Nov 3, 2013

Increasing statistical power in psychological research without increasing sample size

by

What is statistical power and precision?

This post is going to give you some practical tips to increase statistical power in your research. Before going there though, let’s make sure everyone is on the same page by starting with some definitions.

Statistical power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80, which corresponds to an 80% probability of not committing a Type II error.

Precision refers to the width of the confidence interval for an effect size. The smaller this width, the more precise your results are. For 80% power, the confidence interval width will be roughly plus or minus 70% of the population effect size (Goodman & Berlin, 1994). Studies that have low precision have a greater probability of both Type I and Type II errors (Button et al., 2013).

To get an idea of how this works, here are a few examples of the sample size required to achieve .80 power for small, medium, and large (Cohen, 1992) correlations as well as the expected confidence intervals

Population Effect Size Sample Size for 80% Power Estimated Precision
r = .10 782 95% CI [.03, .17]
r = .30 84 95% CI [.09, .51]
r = .50 29 95% CI [.15, .85]

Studies in psychology are grossly underpowered

Okay, so now you know what power is. But why should you care? Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. That’s a false negative rate of 52%, which is no better than a coin-flip! The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants (Richard et al., 2003) suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power (Aguinis et al., 2005). Odds are, if you took the time to design a research study and collect data, you want to find a relationship if one really exists. You don’t want to "miss" something that is really there. More than this, you probably want to have a reasonably precise estimate of the effect size (it’s not that impressive to just say a relationship is positive and probably non-zero). Below, I discuss concrete strategies for improving power and precision.

What can we do to increase power?

It is well-known that increasing sample size increases statistical power and precision. Increasing the population effect size increases statistical power, but has no effect on precision (Maxwell et al., 2008). Increasing sample size improves power and precision by reducing standard error of the effect size. Take a look at this formula for the confidence interval of a linear regression coefficient (McClelland, 2000):

Power Equation

MSE is the mean square error, n is the sample size, Vx is the variance of X, and (1-R2) is the proportion of the variance in X not shared by any other variables in the model. Okay, hopefully you didn’t nod off there. There’s a reason I’m showing you this formula. In this formula, decreasing any value in the numerator (MSE) or increasing anything in the denominator (n, Vx, 1-R2) will decrease the standard error of the effect size, and will thus increase power and precision. This formula demonstrates that there are at least three other ways to increase statistical power aside from sample size: (a) Decreasing the mean square error; (b) increasing the variance of x; and (c) increasing the proportion of the variance in X not shared by any other predictors in the model. Below, I’ll give you a few ways to do just that.

Recommendation 1: Decrease the mean square error

Referring to the formula above, you can see that decreasing the mean square error will have about the same impact as increasing sample size. Okay. You’ve probably heard the term "mean square error" before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are. So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

a) Reduce measurement error by using more reliable measures(i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the "rule-of-thumb" for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula). Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error. If you’re an experimentalist, make sure you execute your experimental manipulations exactly the same way each time, preferably by automating them. Slight variations in the manipulation (e.g., different locations, slight variations in timing) might reduce the reliability of the manipulation, and thus reduce power.

b) Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power. Be careful when adding control variables into your models though: There are diminishing returns for adding covariates. Adding a couple of good covariates is bound to improve your model, but you always have to balance predictive power against model complexity. Adding a large number of predictors can sometimes lead to overfitting (i.e., the model is just describing noise or random error) when there are too many predictors in the model relative to the sample size. So, controlling for a couple of good covariates is generally a good idea, but too many covariates will probably make your model worse, not better, especially if the sample is small.

c) Use repeated-measures designs. Repeated measures designs are where participants are measured multiple times (e.g., once a day surveys, multiple trials in an experiment, etc.). Repeated measures designs reduce the mean square error by partitioning out the variance due to individual participants. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially. For example, you might only have 100 participants, but if you measured them once a day for 21 days, you’ll actually have 2100 data points to analyze. The data analysis can get tricky and the interpretation of the data may change, but many multilevel and structural equation models can take advantage of these designs by examining each measurement occasion (i.e., each day, each trial, etc.) as the unit of interest, instead of each individual participant. Increasing the degrees of freedom is much like increasing the sample size in terms of increasing statistical power. I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power and precision is to increase the variance of your predictor variables (X). The formula listed above shows that doubling the variance of X is has the same impact on increasing statistical precision as doubling the sample size does! So it’s worth figuring this out.

a) In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power (MacCallum et al., 2002).

b) In experimental research, unequally allocating participants to each condition can improve statistical power. For example, if you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, assigning participants equally across groups usually reduces statistical power. The idea behind assigning participants unequally to conditions is to maximize the variance of X for the particular kind of relationship under study -- which, according the formula I gave earlier, will increase power and precision. For example, the optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.

Conclusion

Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05. Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.

References

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review. Journal of Applied Psychology, 90, 94-107. doi:10.1037/0021-9010.90.1.94

Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi:10.1037/0033-2909.112.1.155

Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine, 121, 200-206.

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40. doi:10.1037/1082-989X.7.1.19

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series). New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999

Richard, F. D., Bond, C. r., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331-363. doi:10.1037/1089-2680.7.4.331

Oct 25, 2013

It’s Easy Being Green (Open Access)

by

This is the first installment of the Open Science Toolkit, a recurring feature that outlines practical steps individuals and organizations can take to make science more open and reproducible.

Photo of Frank
Farach

Congratulations! Your manuscript has been peer reviewed and accepted for publication in a journal. The journal is owned by a major publisher who wants you to know that, for $3,000, you can make your article open access (OA) forever. Anyone in the world with access to the Internet will have access to your article, which may be cited more often because of its OA status. Otherwise, the journal would be happy to make your paper available to subscribers and others willing to pay a fee.

Does this sound familiar? It sure does to me. For many years, when I heard about Open Access (OA) to scientific research, it was always about making an article freely available in a peer-reviewed journal -- the so-called “gold” OA option -- often at considerable expense. I liked the idea of making my work available to the widest possible audience, but the costs were too prohibitive.

As it turns out, however, the “best-kept secret” of OA is that you can make your work OA for free by self-archiving it in an OA repository, even if it has already been published in a non-OA journal. Such “green” OA is possible because many journals have already provided blanket permission for authors to deposit their peer-reviewed manuscript in an OA repository.

The flowchart below shows how easy it is to make your prior work OA. The key is to make sure you follow the self-archiving policy of the journal your work was published in, and to deposit the work in a suitable repository.

Click to enlarge.Flowchart showing how to
archive your research

Journals typically display their self-archiving and copyright policies on their websites, but you can also search for them in the SHERPA/RoMEO database, which has a nicely curated collection of policies from 1,333 publishers. The database assigns a code to journals based on how far in the publication process their permissions extend. It distinguishes between a pre-print, which is the version of the manuscript before it underwent peer review, and a post-print, the peer-reviewed version before the journal copyedited and typeset it. Few journals allow you to self-archive their copyedited PDF version of your article, but many let you do so for the pre-print or post-print version. Unfortunately, some journals don’t provide blanket permission for self-archiving, or require you to wait for an embargo period to end before doing so. If you run into this problem, you should contact the journal and ask for permission to deposit the non-copyedited version of your article in an OA repository.

It has also become easy to find a suitable OA repository in which to deposit your work. Your first stop should be the Directory of Open Access Repositories (OpenDOAR), which currently lists over 2,200 institutional, disciplinary, and universal repositories. Although an article deposited in an OA repository will be available to anyone with Internet access, repositories differ in feature sets and policies. For example, some repositories, like figshare, automatically assign a CC BY license to all publicly shared papers; others, like Open Depot, allow you to choose a license before making the article public. A good OA repository will tell you how it ensures the long-term digital preservation of your content as well as what metadata it exposes to search engines and web services.

Once you’ve deposited your article in an OA repository, consider making others aware of its existence. Link to it on your website, mention it on social media, or add it to your CV.

In honor of Open Access Week, I am issuing a “Green OA Challenge” to readers of this blog who have published at least one peer-reviewed article. The challenge is to self-archive one of your articles in an OA repository and link to it in the comments below. Please also feel free to share any comments you have about the self-archiving process or about green OA. Happy archiving!

Oct 16, 2013

Thriving au naturel amid science on steroids

Brian A. Nosek
University of Virginia, Center for Open Science

Jeffrey R. Spies
Center for Open Science

Photo of Brian Nosek Photo of Jeffrey Spies Last fall, the present first author taught a graduate class called “Improving (Our) Science” at the University of Virginia. The class reviewed evidence suggesting that scientific practices are not operating ideally and are damaging the reproducibility of publishing findings. For example, the power of an experimental design in null hypothesis significance testing is a function of the effect size being investigated and the size the sample to test it—power is greater when effects are larger and samples are bigger. In the authors’ field of psychology, for example, estimates suggest that the power of published studies to detect an average effect size is .50 or less (Cohen, 1962; Gigerenzer & Sedlmeier, 1989). Assuming that all of the published effects are true, approximately 50% of published studies would reveal positive results (i.e., p < .05 supporting the hypothesis). In reality, more than 90% of published results are positive (Sterling, 1959; Fanelli, 2010).

How is it possible that the average study has power to detect the average true effect 50% of the time or less and yet does so about 90% of the time? It isn’t. Then how does this occur? One obvious contributor is selective reporting. Positive effects are more likely than negative effects to be submitted and accepted for publication (Greenwald, 1975). The consequences include [1] the published literature is more likely to exaggerate the size of true effects because with low powered designs researchers must still leverage chance to obtain a large enough effect size to produce a positive result; and [2] the proportion of false positives – there isn’t actually an effect to detect – will be inflated beyond the nominal alpha level of 5% (Ioannidis, 2005).

The class discussed this and other scientific practices that may interfere with knowledge accumulation. Some of the relatively common ones are described in Table 1 along with some solutions that we, and others, identified. Problem. Solution. Easy. The class just fixed science. Now, class members can adopt the solutions as best available practices. Our scientific outputs will be more accurate, and significant effects will be more reproducible. Our science will be better.

Alex Schiller, a class member and graduate student, demurred. He agreed that the new practices would make science better, but disagreed that we should do them all. A better solution, he argued, is to take small steps: adopt one solution, wait for that to become standard scientific practice, and then adopt another solution.

We know that some of our practices are deficient, we know how to improve them, but Alex is arguing that we shouldn’t implement all the solutions? Alex’s lapse of judgment can be forgiven—he’s just a graduate student. However, his point isn’t a lapse. Faced with the reality of succeeding as a scientist, Alex is right.

Table 1

Scientific practices that increase irreproducibility of published findings, possible solutions, and barriers that prevent adoption of those solutions

Practice Problem Possible Solution Barrier to Solution
Run many low-powered studies rather than few high-powered studies Inflates false positive and false negative rates Run high-powered studies Non-significant effects are a threat to publishability; Risky to devote extra resources to high-powered tests that might not produce significant effects
Report significant effects and dismiss non-significant effects as methodologically flawed Using outcome to evaluate method is a logical error and can inflate false positive rate Report all effects with rationale for why some should be ignored; let reader decide Non-significant and mixed effects are a threat to publishability
Analyze during data collection, stop when significant result is obtained or continue until significant result is obtained Inflates false positive rate Define data stopping rule in advance Non-significant effects are a threat to publishability
Include multiple conditions or outcome variables, report the subset that showed significant effects Inflates false positive rate Report all conditions and outcome variables Non-significant and mixed effects are a threat to publishability
Try multiple analysis strategies, data exclusions, data transformations, report cleanest subset Inflates false positive rate Prespecify data analysis plan, or report all analysis strategies Non-significant and mixed effects are a threat to publishability
Report discoveries as if they had resulted from confirmatory tests Inflates false positive rate Pre-specify hypotheses; Report exploratory and confirmatory analyses separately Many findings are discoveries, but stories are nicer and scientists seem smarter if they had thought it in advance
Never do a direct replication Inflates false positive rate Conduct direct replications of important effects Incentives are focused on innovation, replications are boring; Original authors might feel embarrassed if their original finding is irreproducible

Note: For reviews of these practices and their effects see Ioannidis, 2005; Giner-Sorolla, 2012; Greenwald, 1975; John et al., 2012; Nosek et al., 2012; Rosenthal, 1979; Simmons et al., 2011; Young et al., 2008

The Context

In an ideal world, scientists use the best available practices to produce accurate, reproducible science. But, scientists don’t live in an ideal world. Alex is creating a career for himself. To succeed, he must publish. Papers are academic currency. They are Alex’s ticket to job security, fame, and fortune. Well, okay, maybe just job security. But, not everything is published, and some publications are valued more than others. Alex can maximize his publishing success by producing particular kinds of results. Positive effects, not negative effects (Fanelli, 2010; Sterling, 1959). Novel effects, not verifications of prior effects (Open Science Collaboration, 2012). Aesthetically appealing, clean results, not results with ambiguities or inconsistencies (Giner-Sorolla, 2012). Just look in the pages of Nature, or any other leading journal, they are filled with articles producing positive, novel, beautiful results. They are wonderful, exciting, and groundbreaking. Who wouldn’t want that?

We do want that, and science advances in leaps with groundbreaking results. The hard reality is that few results are actually groundbreaking. And, even for important research, the results are often far from beautiful. There are confusing contradictions, apparent exceptions, and things that just don’t make sense. To those in the laboratory, this is no surprise. Being at the frontiers of knowledge is hard. We don’t quite know what we are looking at. That’s why we are studying it. Or, as Einstein said, “If we knew what we were doing, it wouldn’t be called research”.   But, those outside the laboratory get a different impression. When the research becomes a published article, much of the muck goes away. The published articles are like the pictures of this commentary’s authors at the top of this page. Those pictures are about as good as we can look. You should see the discards. Those with insider access know, for example, that we each own three shirts with buttons and have highly variable shaving habits. Published articles present the best-dressed, clean-shaven versions of the actual work.

Just as with people, when you replicate effects yourself to see them in person, they may not be as beautiful as they appeared in print. The published version often looks much better than reality. The effect is hard to get, dependent on a multitude of unmentioned limiting conditions, or entirely irreproducible (Begley & Ellis, 2012; Prinz et al., 2011).

The Problem

It is not surprising that effects are presented in their best light. Career advancement depends on publishing success. More beautiful looking results are easier to publish and more likely to earn rewards (Giner-Sorolla, 2012). Individual incentives align for maximizing publishability, even at the expense of accuracy (Nosek et al., 2012).

Consider three hypothetical papers shown in Table 2. For all three, the researchers identified an important problem and had an idea for a novel solution. Paper A is a natural beauty. Two well planned studies showed effects supporting the idea. Paper B and Paper C were conducted with identical study designs. Paper B is natural, but not beautiful; Paper C is a manufactured beauty. Both Paper B and Paper C were based on 3 studies. One study for each showed clear support for the idea. A second study was a mixed success for Paper B, but “worked” for Paper C after increasing the sample size a bit and analyzing the data a few different ways. A third study did not work for either. Paper B reported the failure with an explanation for why the methodology might be to blame, rather than the idea being incorrect. The authors of Paper C generated the same methodological explanation, categorized the study as a pilot, and did not report it at all. Also, Paper C described the final sample sizes and analysis strategies, but did not mention that extra data was collected after initial analysis, or that alternative analysis strategies had been tried and dismissed.

Table 2

Summary of research practices for three hypothetical papers

Step Paper Paper B Paper C
Data collection Conducted two studies Conducted three studies Conducted three studies
Data analysis Analyzed data after completing data collection following a pre-specified analysis plan Analyzed data after completing data collection following a pre-specified analysis plan Analyzed during data collection and collected more data to get to significance in one case. Selected from multiple analysis strategies for all studies.
Result reporting Reported the results of the planned analyses for both studies Reported the results of the planned analyses for all studies Reported results of final analyses only. Did not report one study that did not reach significance.
Final paper Two studies demonstrating clear support for idea One study demonstrating clear support for idea, one mixed, one not at all Two studies demonstrating clear support for idea

Paper A is clearly better than Paper B. Paper A should be published in a more prestigious outlet and generate more attention and accolade. Paper C looks like Paper A, but in reality it is like Paper B. The actual evidence is more circumspect than the apparent evidence. Based on the report, however, no one can tell the difference between Paper A and Paper C.

Two possibilities would minimize the negative impact of publishing manufactured beauties like Paper C. First, if replication were standard practice, then manufactured effects would be identified rapidly. However, direct replication is very uncommon (Open Science Collaboration, 2012). Once an effect is the literature, there is little systematic ethic to self-correct. Rather than be weeded out, false effects persist or just slowly fade away. Second, scientists could just avoid doing the practices that lead to Paper C making this illustration an irrelevant hypothetical. Unfortunately, a growing body of evidence suggests that these practices occur, and some are even common (e.g., John et al., 2012).

To avoid the practices that produce Paper C, the scientist must be aware of and confront a conflict-of-interest—what is best for science versus what is best for me. Scientists have inordinate opportunity to pursue flexible decision-making in design and analysis, and there is minimal accountability for those practices. Further, humans’ prodigious motivated reasoning capacities provide a way to decide that the outcomes that look best for us also has the most compelling rationale (Kunda, 1990). So, we may convince ourselves that the best course of action for us was the best course of action period. It is very difficult to stop doing suspect practices when we have thoroughly convinced ourselves that we are not doing them.

The Solution

Alex needs to publish to succeed. The practices in Table 1 are to the scientist, what steroids are to the athlete. They amplify the likelihood of success in a competitive marketplace. If others are using, and Alex decides to rely on his natural performance, then he will disadvantage his career prospects. Alex wants to do the best science he can and be successful for doing it. In short, he is the same as every other scientist we know, ourselves included. Alex shouldn’t have to make a choice between doing the best science and being successful—these should be the same thing.

Is Alex stuck? Must he wait for institutional regulation, audits, and the science police to fix the system? In a regulatory world, practices are enforced and he need not worry that he’s committing career suicide by following them. Many scientists are wary of a strong regulatory environment in science, particularly for the possibility of stifling innovation. Some of the best ideas start with barely any evidence at all, and restrictive regulations on confidence in outputs could discourage taking risks on new ideas. Nonetheless, funders, governments, and other stakeholders are taking notice of the problematic incentive structures in science. If we don’t solve the problem ourselves, regulators may solve them for us.

Luckily, Alex has an alternative. The practices in Table 1 may be widespread, but the solutions are also well known and endorsed as good practice (Fuchs et al., 2012). That is, scientists easily understand the differences between Papers A, B, and C – if they have full access to how the findings were produced. As a consequence, the only way to be rewarded for natural achievements over manufactured ones is to make the process of obtaining the results transparent. Using the best available practices privately will improve science but hurt careers. Using the best available practices publicly will improve science while simultaneously improving the reputation of the scientist. With openness, success can be influenced by the results and by how they were obtained.  

Conclusion

The present incentives for publishing are focused on the one thing that we scientists are absolutely, positively not supposed to control - the results of the investigation. Scientists have complete control over the design, procedures, and execution of the study. The results are what they are.

A better science will emerge when the incentives for achievement align with the things that scientists can (and should) control with their wits, effort, and creativity. With results, beauty is contingent on what is known about their origin. Obfuscation of methodology can make ugly results appear beautiful. With methodology, if it looks beautiful, it is beautiful. The beauty of methodology is revealed by openness.

Most scientific results have warts. Evidence is halting, uncertain, incomplete, confusing, and messy. It is that way because scientists are working on hard problems. Exposing it will accelerate finding solutions to clean it up. Instead of trying to make results look beautiful when they are not, the inner beauty of science can be made apparent. Whatever the results, the inner beauty—strong design, brilliant reasoning, careful analysis—is what counts. With openness, we won’t stop aiming for A papers. But, when we get them, it will be clear that we earned them.

References

Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483, 531-533.

Cohen , J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153.

Fanelli, D. (2010). "Positive" results increase down the hierarchy of the sciences. PLoS ONE, 5(4), e10068. doi:10.1371/journal.pone.0010068

Fuchs H., Jenny, M., & Fiedler, S. (2012). Psychologists are open to change, yet wary of rules. Perspectives on Psychological Science, 7, 634-637. doi:10.1177/1745691612459521

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Giner-Sorolla, R. (2012). Science or art? How esthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524-532. doi:10.1177/0956797611430953

Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108, 480-498. doi:10.1037/0033-2909.108.3.480

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7,615-631. doi:10.1177/1745691612459058

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657-660. doi:10.1177/1745691612462588

Prinz, F., Schlange, T. & Asadullah, K. (2011). Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712-713.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641. doi:10.1037/0033-2909.86.3.638

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. Journal of the American Statistical Association, 54, 30-34.

Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current publication practices may distort science. PLoS Medicine, 5, 1418-1422.

Oct 10, 2013

Opportunities for Collaborative Research

by

Photo of Jon Grahe

As a research methods instructor, I encourage my students to conduct “authentic” research projects. For now, consider authentic undergraduate research experiences as student projects at any level where the findings might result in a conference presentation or a publishable paper. This included getting IRB approval and attempts to present our findings at regional conferences, maybe including a journal submission. Eventually, I participated in opportunities for my students to contribute to big science by joining in two crowd-sourcing projects organized by Alan Reifman. The first was published in Teaching of Psychology (School Spirit Study Group, 2004) as an example for others to follow. The data included samples from 22 research methods classes who measured indices representative of the group name. Classes evaluated the data from their own school and the researchers collapsed the data to look at generalization across institutions. The second collaborative project included surveys collected by students in 10 research methods classes. The topic was primarily focused on emerging adulthood and political attitudes, but included many other psychological constructs. Later, I note the current opportunities for Emerging Adulthood theorists to use these data for a special issue of the journal, Emerging Adulthood. These two projects notwithstanding, there have been few open calls for instructors to participate in big science. Instructors who want to include authentic research experiences do so by completing all their own legwork as characterized by Frank and Saxe (2012) or as displayed by many of the poster presentations that you see at Regional Conferences.

However, that state of affairs has changed. Though instructors are likely to continue engaging in authentic research in their classrooms, they don’t have to develop the projects on their own anymore. I am very excited about the recent opportunities that allow students to contribute to “big science” by acting as crowd-sourcing experimenters. In full disclosure, I acknowledge my direct involvement in developing three of the following projects. However, I will save you a description of my own pilot test called the Collective Undergraduate Research Project (Grahe, 2010). Instead, I will briefly review recent “open invitation projects”, those where any qualified researcher can contribute. The first to emerge (Psych File Drawer, Reproducibility Project) were focused on PhD level researchers. However, since August 2012 there have been three research projects that specifically invite students to help generate data either as experimenters or as coders. More are likely to emerge soon as theorists grasp the idea that others can help them collect data.

This is a great time to be a research psychologist. These projects provide real opportunities for students or other researchers at any expertise level to get involved in not only authentic, but transformative research opportunities. The Council of Undergraduate Research recently published an edited volume (Karukstis & Hensel, 2010) dedicated to fostering transformative research for undergraduates. According to Wikipedia, “Transformative research is a term that became increasingly common within the science policy community in the 2000s for research that shifts or breaks existing scientific paradigms.” I consider open invitation projects transformative because they change the way that we view minimally acceptable standards for research. Each one is intended to change the basic premise of collaboration by bridging not only institutions, but also the chasm of acquaintance. Any qualified researcher can participate in data collection and authorship. Now, when I introduce research opportunities to my students, the ideas are grand. Here are research projects with grand ideas that invite contributions.

Many Labs Project - The Many Labs Team’s original invitation to contributors, sent out in February 2013 asked contributors to join their “wide-scale replication project {that was} conditionally accepted for publication in the special issue of Social Psychology.” They made a follow-up invitation in July reminding us of their Oct. 1st, 2013 deadline for data collection. This deadline limits future contributors, but it is a great example of a crowd-sourcing project. Their goal is to replicate 12 effects using the Project Implicit infrastructure. As is typical of these projects, contributors meeting a minimum goal (N > 80 cases in this instance) will be listed as coauthors in future publications. As they stated in their July post, “The project and accepted proposal has been registered on the Open Science Framework and can be found at this link: http://www.openscienceframework.org/project/WX7Ck/.” Richard Klein (project coordinator) reported that their project includes 18 different researchers at 15 distinct labs, with a plan for 38 labs contributing data before the deadline. This project is nearing the completion deadline and so new contributions are limited, but the project represents an important exemplar of potential projects.

Reproducibility Project – As stated on their invitation to new contributors on the Open Science Framework page, "Our primary goal is to conduct high quality replications of studies from three 2008 psychology journals and then compare our results to the original studies." As of right now, Johanna Cohoon from the Center for Open Science states, “We have 136 replication authors who come from 59 different institutions, with an additional 19 acknowledged replication researchers (155 replication researchers total). We also have an additional 40 researchers who are not involved in a replication that have earned authorship through coding/vetting 10 or more articles.” In short, this is a large collaboration and is welcoming more committed researchers. Though this project needs advanced researchers, it is possible for faculty to work closely with students who might wish to contribute.

PsychFileDrawer Project – This is less a collaborative effort between researchers than a compendium of replications, as stated by the project organizers: “PsychFileDrawer.org is a tool designed to address the File Drawer Problem as it pertains to psychological research: the distortion in the scientific literature that results from the failure to publish non-replications." It is collaborative in the sense that they host a “top 20” list of studies that the viewers want to see replicated. However, any researcher with a replication is invited to contribute the data. Further, although this was not initially targeted toward students, there is a new feature that allows contributors to identify a sample as a class project and the FAQ page asks instructors to comment on, “…the level and type of instructor supervision, and instructor’s degree of confidence in the fidelity with which the experimental protocol was implemented.”

Emerging Adulthood, Political Decisions, and More (2004) project – Alan Reifman is an early proponent of collaborative undergraduate research projects. After successfully guiding the School Spirit Study Group, he called again to research methods instructors to collectively conduct a survey that included two emerging adulthood scales, some political attitudes and intentions, and other scales that contributors wanted to add to the paper and pencil survey. By the time the survey was finalized, it was over 10 pages long and included hundreds of individual items measuring dozens of distinct psychological constructs on 12 scales. In retrospect, the survey was too long and the invitation to add on measures might have caused of the attrition of committed contributors that occurred. However, the final sample included over 1300 cases from 11 distinct institutions across the US. A list of contributors and initial findings can be found at Alan Reifman’s Emerging Adulthood Page. This project suffered from the amorphous structure of the collaboration. In short, no one instructor was interested in all the constructs. To resolve this situation where wonderfully rich data sit unanalyzed, the contributors are inviting emerging adulthood theorists to analyze the data and submit their work for a special issue of Emerging Adulthood. Details for how to request the data are available on the project’s OSF page. The deadline for a submission is July, 2014.

International Situations Project (ISP) – David Funder and Esther Guillaume-Hanes from the University of California—Riverside have organized a coalition of international researchers (19 and counting) to complete an internet protocol. As they describe on their about page, “Participants will describe situations by writing a brief description of the situation they experienced the previous day at 7 pm. They will identify situational characteristics uing the Riverside Situation Q-sort (RSQ) which includes 89 items that participants place into one of nine categories ranging from not at all characteristic to very characteristics. They then identify the behaviors they displayed using the Riverside Behavioral Q-Sort (RBQ) which includes 68 items using the same sorting procedure. The UC-Riverside researchers are taking primary responsibility for writing research reports. They have papers in print and others in preparation where all contributors who provide more than 80 cases are listed as authors. In Fall 2012, Psi Chi and Psi Beta encouraged their members to replicate the ISP in the US to create a series of local samples yielding 11 samples with 5 more committed contributors (Grahe, Guillaume-Hanes, & Rudmann, 2014). Currently, contributors are invited to complete either this project or their subsequent Personality project which includes completing this protocol, then completing the California Adult Q-Sort two weeks later. Interested researchers should contact Esther Guillaume directly at eguil002@ucr.edu.

Collaborative Replications and Education Project (CREP)This project has recently started inviting contributions and is explicitly designed for undergraduates. Instructors who guide student research projects are encouraged to share the available studies list with their students. Hans IJzerman and Mark Brandt reviewed the top three cited empirical articles in the top journals in nine sub disciplines and rated them for feasibility to be completed by undergraduates selecting nine studies that were the most feasible. The project provides small ($200-$500) CREP research awards for completed projects (sponsored by Psi Chi and the Center for Open Science). Contributors will be encouraged/supported in writing and submitting reports for publication by the project coordinators. Anyone interested in participating should contact Jon Grahe (graheje@plu.edu).

Archival Project – This Center for Open Science project is also specifically designed as a crowd-sourcing opportunity for students. When it completes the beta-testing phase, the Archival Project will be publicly advertised. Unlike all the projects reviewed thus far, this project asks contributors to serve as coders rather than act as experimenters. It is a companion project to the OSC Reproducibility Project in that the target articles are from the same three Journals from the first three months of 2008. This project has a low bar for entry as training can take little time (particularly with the now available online tutorial) and coders can code as few as a single article and still make a real contribution. However, this project also has a system of honors as they state on their “getting involved” page: “Contributors who provide five or more accurate codings will be listed as part of the collective authorship.” This project was designed with the expectation that instructors will find the opportunity pedagogically useful and that they will employ it as a course exercise. Alternatively, students in organized clubs (such as Psi Chi) are invited to participate to increase their own methodological competence while simultaneously accumulating evidence of their contributions to an authentic research project. Finally, graduate students are invited to participate without faculty supervision. Interested parties should contact Johanna Cohoon (johannacohoon@gmail.com) for more information.

Future Opportunities – While this is intended to be an exhaustive list of open invitation projects, the field is not static and this list is likely to grow. What is exciting is that we now have ample opportunities to participate in “big science” with relatively small contributions. When developed with care, these projects follow Grahe et al. (2012)’s recommendation to take advantage of the magnitude of research being conducted each year by psychology students. The bar for entry varies in these projects from relatively intensive (e.g. Reproducibility Project, CREP) to relatively easy (e.g. Archival Project, ISP), providing opportunities for individuals with varying resources, from graduate students and PhD level researchers capable of completing high quality replications to students and instructors who seek opportunities in the classroom. Beyond the basic opportunity to participate in transformative research, these projects provide exemplars for how future collaborative projects should be designed and managed.

This is surely an incomplete list of current or potential examples of crowd-sourcing research. Please share other examples as comments below. Further consider pointing out strengths, weaknesses, opportunities or threats that could emerge from collaborative research. Finally, any public statements about intentions to participate in this type of open science initiative are welcome.

References

Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives On Psychological Science, 7(6), 600-604. doi:10.1177/1745691612460686

Grahe. J. E., Gullaume-Hanes, E., & Rudmann, J. (2014). Students collaborate to advance science: The International Situations Project. Council for Undergraduate Research Quarterly

Grahe, J. E., Reifman, A., Hermann, A. D., Walker, M., Oleson, K. C., Nario-Redmond, M., & Wiebe, R. P. (2012). Harnessing the undiscovered resource of student research projects. Perspectives On Psychological Science, 7(6), 605-607. doi:10.1177/1745691612459057

Karukstis, K. K., & Hensel, N. (2010) Transformative research at predominately undergraduate institutions.” Council of Undergraduate Research. Washington DC., USA.

Oct 4, 2013

A publishing sting, but what was stung?

by

Before Open Science there was Open Access (OA) — a movement driven by the desire to make published research publicly accessible (after all, the public usually had paid for it), rather than hidden behind paywalls.

Open Access is, by now, its very own strong movement — follow for example Björn Brembs if you want to know what is happening there — and there are now nice Open Access journals, like PLOS and Frontiers, which are peer-reviewed and reputable. Some of these journals charge an article processing fee for OA articles, but in many cases funders have introduced or are developing provisions to cover these costs. (In fact, the big private funder in Sweden INSISTS on it.)

But, as always, situations where there is money involved and targets who are desperate (please please please publish my baby so I won’t perish) breed mimics and cuckoos and charlatans, ready to game the new playing field to their advantage. This is probably just a feature of the human condition (see Triver’s “Folly of Fools”).

There are lists of potentially predatory Open Access journals — I have linked some in on my private blog (Åse Fixes Science) here and here. Reputation counts. Buyers beware!

Demonstrating the challenges of this new marketplace, John Bohannon published in Science (a decidedly not Open Access journal) a sting operation in which he spoofed Open Access journals to test their peer-review system. The papers were machine generated nonsense — one may recall the Sokal Hoax from the previous Science Wars. One may also recall the classic Ceci paper from 1982, which made the rounds again earlier this year (and I blogged about that one too - on my other blog).

Crucially, all of Bohannon’s papers contained fatal flaws that a decent peer-reviewer should catch. The big news? Lots of them did not (though PLOS did). Here’s the Science article with its commentary stream, and a commentary from Retraction Watch).

This is, of course, interesting — and it is generating buzz. But, it is also generating some negative reaction. For one, Bohannon did not include the regular non-OA journals in his test, so the experiment lacks a control group, which means we can make no comparison and draw no firm inferences from the data. The reason he states (it is quoted on the Retraction Watch site) is the very long turnaround times for regular journals, which can be months, even a year (or longer, as I’ve heard). I kinda buy it, but this is really what is angering the Open Access crowd, who sees this letter as an attempt to implicate Open Access itself as the source of the problem. And, apart from Bohannon not including regular journals in his test, Science published what he wrote without peer reviewing it.

My initial take? I think it is important to test these things — to uncover the flaws in the system, and also to uncover the cuckoos and the mimics and the gamers. Of course, the problems in peer review are not solely on the shoulders of Open Access — Diederik Stapel, and other frauds, published almost exclusively in paywalled journals (including Science). The status of peer review warrants its own scrutiny.

But, I think the Open Access advocates have a really important point to make. As noted on Retraction Watch, Bohannon's study didn't include non-OA journals, so it's unclear whether the peer-review problems he identified in OA journals are unique to their OA status.

I’ll end by linking in some of the commentary that I have seen so far — and, of course, we’re happy to hear your comments.

Oct 2, 2013

Smoking on an Airplane

by

Photo of Denny
Boorsboom

People used to smoke on airplanes. It's hard to imagine, but it's true. In less than twenty years, smoking on airplanes has grown so unacceptable that it has become difficult to see how people ever condoned it in the first place. Psychological scientists used to refuse to share their data. It's not so hard to imagine, and it's still partly true. However, my guess is that a few years from now, data-secrecy will be as unimaginable as smoking on an airplane is today. We've already come a long way. When in 2005 Jelte Wicherts, Dylan Molenaar, Judith Kats, and I asked 141 psychological scientists to send us their raw data to verify their analyses, many of them told us to get lost - even though, at the time of publishing the research, they had signed an agreement to share their data upon request. "I don't have time for this," one famous psychologist said bluntly, as if living up to a written agreement is a hobby rather than a moral responsibility. Many psychologists responded in the same way. If they responded at all, that is.

Like Diederik Stapel.

I remember that Judith Kats, the student in our group who prepared the emails asking researchers to make data available, stood in my office. She explained to me how researchers had responded to our emails. Although many researchers had refused to share data, some of our Dutch colleagues had done so in an extremely colloquial, if not downright condescending way. Judith asked me how she should respond. Should she once again inform our colleagues that they had signed an APA agreement, and that they were in violation of a moral code?

I said no.

It's one of the very few things in my scientific career that I regret. Had we pushed our colleagues to the limit, perhaps we would have been able to identify Stapel's criminal practices years earlier. As his autobiography shows, Stapel counterfeited his data in an unbelievably clumsy way, and I am convinced that we would have easily identified his data as fake. I had many reasons for saying no, which seemed legitimate at the time, but in hindsight I think my behavior was a sign of adaptation to a defective research culture. I had simply grown accustomed to the fact that, when I entered an elevator, conversations regarding statistical analyses would fall silent. I took it as a fact of life that, after we methodologists had explained students how to analyze data in a responsible way, some of our colleagues would take it upon themselves to show students how scientific data analysis really worked (today, these practices are known as p-hacking). We all lived in a scientific version of The Matrix, in which the reality of research was hidden from all - except those who had the doubtful honor of being initiated. There was the science that people reported and there was the science that people did.

In Groningen University, where Stapel used to work, he was known as The Lord of the Data, because he never let anyone near his SPSS files. He pulled results out of thin air, throwing them around as presents to his co-workers, and when anybody asked him to show the underlying data files, he simply didn't respond. Very few people saw this as problematic, because, hey, these were his data, and why should Stapel share his data with outsiders?

That was the moral order of scientific psychology. Data are private property. Nosy colleagues asking for data? Just chase them away, like you chase coyotes from your farm. That is why researchers had no problem whatsoever denying access to their data, and that is why several people saw the data-sharing request itself as unethical. "Why don't you trust us?," I recall one researcher saying in a suspicious tone of voice.

It is unbelievable how quickly things have changed.

In the wake of the Stapel case, the community of psychological scientists committed to openness, data-sharing, and methodological transparency quickly reached a critical mass. The Open Science Framework allows researchers to archive all of their research materials, including stimuli, analysis code, and data, to make them public by simply pressing a button. The new Journal of Open Psychology Data offers an outlet specifically designed to publish datasets, thereby giving these the status of a publication. PsychDisclosure.org asks researchers to document decisions regarding, e.g., sample size determination and variable selection, that were left unmentioned in publications; most researchers provide this information without hesitation - some actually do so voluntarily. The journal Psychological Science will likely implement requirements for this type information in the submission process. Data-archiving possibilities are growing like crazy. Major funding institutes require data-archiving or are preparing regulations that do. In the Reproducibility Project, hundreds of studies are being replicated in a concerted effort. As a major symbol of these developments, we now have the Center for Open Science, which facilitates the massive grassroots effort to open up the scientific regime in psychology.

If you had told me that any of this would happen back in 2005, I would have laughed you away, just as I would have laughed you away in 1990, had you told me that the future would involve such bizarre elements as smoke-free airplanes.

The moral order of research in psychology has changed. It has changed for the better, and I hope it has changed for good.

Welcome!

by

Welcome to the blog of the Open Science Collaboration! We are a loose network of researchers, professionals, citizen scientists, and others with an interest in open science, metascience, and good scientific practices. We’ll be writing about:

  • Open science topics like reproducibility, transparency in methods and analyses, and changing editorial and publication practices
  • Updates on open science initiatives like the Reproducibility Project and opportunities to get involved in new projects like the Archival Project
  • Interviews and opinions from researchers, developers, publishers, and citizen scientists working to make science more transparent, rigorous, or reproducible.

We hope that the blog will stimulate open discussion and help improve the way science is done!

If you'd like to suggest a topic for a post, propose a guest post or column, or get involved with moderation, promotion, or technical development, we would love to hear from you! Email us at oscbloggers@gmail.com or tweet @OSCbloggers.

The OSC is an open collaboration - anyone is welcome to join, contribute to discussions, or develop a project. The OSC blog is supported by the Center for Open Science, a non-profit organization dedicated to increasing openness, integrity and reproducibility of scientific research.

← Previous Page 5 of 5