The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, August 4, 2017

Towards a more collaborative science with StudySwap

The replication crisis is over. Sure, not everyone has gotten the memo (either about it having started, or about it having ended) but the majority of scientists agree that there were (slightly) too many findings from the past that cannot easily be replicated. The underlying reasons are clear: publication bias, flexibility in the data analysis, low power, and not enough rewards for replication studies. The solutions are also clear: registered reports, sample size justification, better statistics training, and publishing and funding replication research.

So, researchers optimistic about other things we can improve in addition to reproducibility are already looking forward. Since the beginning of 2017 we have entered the theory crisis, which makes it, among other things, very difficult to falsify theories. Young scholars are already getting enthusiastic about the upcoming measurement crisis, where we finally come to grasp with a largely ignored issue concerning our measurement tools.

But here, I want to focus on one of the greatest challenges I think our science will face: The need to collaborate. Because collaboration is such a tricky issue, I project it will take us the most time of all crises to solve – but I also expect we will be rewarded by a Golden Age when researchers figure out how to most effectively coordinate our collective resources.

Figure 1. List of crises in psychology, taken from a slide from an introduction to psychology lecture in 2076. Yes, we are still using Powerpoint in 2076.

However, some precocious individuals are trying to prove me wrong by showing collaboration is not just possible, but easy. Randy McCarthy and Chris Chartier have started StudySwap: A website where you can advertise ‘haves’ and ‘needs’ to indicate you can collect data for others, or you are looking for others to collect data for you.

At my department, we sometimes ‘StudySwap’ among colleagues. It’s difficult to get people to the lab for a short 15 minute study with which participants earn 3 euro, so we try to combine studies where possible into sets that take longer, which participants find more interesting to come to the lab for financially. StudySwap broadens the scope of this swapping. If you have a small participants pool, you can get more participants at another university. If you are looking for special populations (e.g., people from different cultures) you can post a need. But as a teacher, I can also imagine posting several ‘haves’ for our research practicum next year, where 100 students need to collect data in small groups, and we could use a replication study from another lab as the topic for some groups. Or, it may be beneficial to find studies that are “ready to go” if you have a student who needs to complete a study during a fixed period of time (e.g., a semester, an academic year, etc.).

Now, Randy and Chris are taking StudySwap in new exciting directions. They are coordinating a Nexus (similar to a special issue, but open indefinitely) in the journal Collabra about Collections2 – crowd-sourced research projects where groups of researchers, or collections of researchers, collect data that will analyzed by grouping all data together (such as RRRs, the ManyLabs projects, or the Pipeline Projects). This approach of designing (sets of) studies that will be aggregated and synthesized is known as a prospective meta-analysis. When pre-registered, it is the absolute state-of-the-art of doing science. The Nexus in Collabra will highlight some exciting ways in which such prospective meta-analyses can be designed, such as collecting conceptual replications, examining different outcome measures, or populations. Another example I could see happening is Collections2 that focus explicitly on sampling both individuals and stimuli from a larger population.

The nice thing about Collections2 in the Nexus special issue in Collabra is that submissions can be Registered Reports, so any accepted project that is successfully executed will lead to a publication. Registered Reports help to emphasize the proposed hypothesis and methods (as opposed to the observed results) and will likely provide an important incentive for recruiting contributing labs. Follow StudySwap (@Study_Swap) and Collabra (@CollabraOA) on Twitter for official announcements about how you can get involved with the upcoming Nexus. Although the Nexus is not accepting proposals quite yet, it is not too soon to start planning a potential crowd-sourced project.

In my personal experience, joining in on a collaborative research project (in my case, the RP:P) was perhaps one of the most educational experiences I did when I was a young scholar. It is worth the time just for how much you can learn, but obviously, it is very nice that your time and effort is also rewarded through a publication.

I couldn’t be more excited about what Randy and Chris are working on with StudySwap. This is what having a vision looks like. They have identified one of the major limitations of psychological science – funding individuals to perform research lines in relative isolation – and are trying to make psychological science better. I will be joining them by posting haves and responding to needs, if only to try to prove my own prediction wrong that we will enter a Collaboration Crisis in 2036. If we look at fields around us that face similar difficulties in collecting high quality data (e.g., medicine, physics) then we know collaboration on a larger scale will need to happen. Recent successful collaborative projects such as RP:P and ManyLabs show it is feasible to work together on replications. If we figure out how to collaborate on novel lines of research, I’m confident psychology will enter a golden age where important insights are generated with a reliability and speed that will impress the general public, greatly enhancing the reputation of psychological science.

Monday, July 3, 2017

Impossibly hungry judges

I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study on judges handing out harsher sentences before lunch than after lunch. The idea is that their mental resources deplete over time, and they stop thinking carefully about their decision – until having a bite replenishes their resources. The study is well-known, and often (as in the Radiolab episode) used to argue how limited free will is, and how much of our behavior is caused by influences outside of our own control. I had never read the original paper, so I decided to take a look. 

During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided upon. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned – it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!

I’m not the first person to be surprised by this data (thanks to Nick Brown for pointing me to these papers on Twitter). There was a published criticism on the study in PNAS (which no one reads or cites), and more recently, an article by Andreas Glöckner explaining how the data could be explained through a more plausible mechanism (for a nice write up, see this blog by Tom Stafford). I appreciate that people have tried to think about which mechanism could cause this effect, and if you are interested, highly recommend reading the commentaries (and perhaps even the response by the authors). 

But I want to take a different approach in this blog. I think we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making. 

As Glöckner notes, one surprising aspect of this study is the magnitude of the effect: ‘A drop of favorable decisions from 65% in the first trial to 5% in the last trial as observed in DLA is equivalent to an odds ratio of 35 or a standardized mean difference of d = 1.96 (Chinn, 2000)’.

Some people dislike statistics. They are only interested in effects that are so large, you can see them by just plotting the data. This study might seem to be a convincing illustration of such an effect. My goal in this blog is to argue against this idea. You need statistics, maybe especially when effects are so large they jump out at you.

When reporting findings, authors should report and interpret effect sizes. An important reason for this is that effects can be impossibly large. An example I give in my MOOC is the Ig Nobel prize winning finding that suicide rates among white people increased with the amount of airtime dedicated to country music. The reported (but not interpreted) correlation was a whopping r = 0.54. I once went to a Dolly Parton concert with my wife. It was a great 2 hour show. If the true correlation between listening to country music and white suicide rates was 0.54, this would not have been a great concert, but a mass-suicide.

Based on this data, the difference between the height of 21-year old men and women in The Netherlands is approximately 13 centimeters. That is a Cohen’s d of 2. That’s the effect size in the hungry judges study. 

If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal - you would already know it exists. Sort of how the ‘after lunch dip’ is a strong and replicable finding that you can feel yourself (and that, as it happens, is directly in conflict with the finding that judges perform better immediately after lunch – surprisingly, the authors don’t discuss the after lunch dip).

We can look at the review paper by Richard, Bond, & Stokes-Zoota (2003) to see which effect sizes in law psychology are close to a Cohen’s d of 2, and find two that are slightly smaller. The first is the effect that a jury’s final verdict is likely to be the verdict a majority initially favored, which 13 studies show has an effect size of r = 0.63, or d = 1.62. The second is that when a jury is initially split on a verdict, its final verdict is likely to be lenient, which 13 studies show to have an effect size of r = .63 as well. In their entire database, some effect sizes that come close to d = 2 are the finding that personality traits are stable over time (r = 0.66, d = 1.76), people who deviate from a group are rejected from that group (r = .6, d = 1.5), or that leaders have charisma (r = .62, d = 1.58). You might notice the almost tautological nature of these effects. The biggest effect in their database is for ‘psychological ratings are reliable’ (r = .75, d = 2.26) – if we try to develop a reliable rating, it is pretty reliable. That is the type of effects that has a Cohen’s d of around 2: Tautologies. And that is, supposedly, the effect size that the passing of time (and subsequently eating lunch) has on parole hearing sentencings.

I think it is telling that most psychologists don’t seem to be able to recognize data patterns that are too large to be caused by psychological mechanisms. There are simply no plausible psychological effects that are strong enough to cause the data pattern in the hungry judges study. Implausibility is not a reason to completely dismiss empirical findings, but impossibility is. It is up to authors to interpret the effect size in their study, and to show the mechanism through which an effect that is impossibly large, becomes plausible. Without such an explanation, the finding should simply be dismissed.

Monday, June 19, 2017

Verisimilitude, Belief, and Progress in Psychological Science

Does science offer a way to learn what is true about our world? According to the perspective in philosophy of science known as scientific realism, the answer is ‘yes’. Scientific realism is the idea that successful scientific theories that have made novel predictions give us a good reason to believe these theories make statements about the world that are at least partially true. Known as the no miracle argument, only realism can explain the success of science, which consists of repeatedly making successful predictions (Duhem, 1906), without requiring us to believe in miracles.

Not everyone thinks that it matters whether scientific theories make true statements about the world, as scientific realists do. Laudan (1981) argues against scientific realism based on a pessimistic meta-induction: If theories that were deemed successful in the past turn out to be false, then we can reasonably expect all our current successful theories to be false as well. Van Fraassen (1980) believes it is sufficient for a theory to be ‘empirically adequate’, and make true predictions about things we can observe, irrespective of whether these predictions are derived from a theory that describes how the unobservable world is in reality. This viewpoint is known as constructive empiricism. As Van Fraassen summarizes the constructive empiricist perspective (1980, p.12): “Science aims to give us theories which are empirically adequate; and acceptance of a theory involves as belief only that it is empirically adequate”.

The idea that we should ‘believe’ scientific hypotheses is not something scientific realists can get behind. Either they think theories make true statements about things in the world, but we will have to remain completely agnostic about when they do (Feyerabend, 1993), or they think that corroborating novel and risky predictions makes it reasonable to believe that a theory has some ‘truth-likeness’, or verisimilitude. The concept of verisimilitude is based on the intuition that a theory is closer to a true statement when the theory allows us to make more true predictions, and less false predictions. When data is in line with predictions, a theory gains verisimilitude, when data are not in line with predictions, a theory loses verisimilitude (Meehl, 1978). Popper clearly intended verisimilitude to be different from belief (Niiniluoto, 1998). Importantly, verisimilitude refers to how close a theory is to the truth, which makes it an ontological, not epistemological question. That is, verisimilitude is a function of the degree to which a theory is similar to the truth, but it is not a function of the degree of belief in, or the evidence for, a theory (Meehl, 1978, 1990). It is also not necessary for a scientific realist that we ever know what is true – we just need to be of the opinion that we can move closer to the truth (known as comparative scientific realism, Kuipers, 2016).

Attempts to formalize verisimilitude have been a challenge, and from the perspective of an empirical scientist, the abstract nature of this ongoing discussion does not really make me optimistic it will be extremely useful in everyday practice. On a more intuitive level, verisimilitude can be regarded as the extent to which a theory makes the most correct (and least incorrect) statements about specific features in the world. One way to think about this is using the ‘possible worlds’ approach (Niiniluoto, 1999), where for each basic state of the world one can predict, there is a possible world that contains each unique combination of states.

For example, consider the experiments by Stroop (1935), where color related words (e.g., RED, BLUE) are printed either in congruent colors (i.e., the word RED in red ink) or incongruent colors (i.e., the word RED in blue ink). We might have a very simple theory predicting that people automatically process irrelevant information in a task. When we do two versions of a Stroop experiment, one where people are asked to read the words, and one where people are asked to name the colors, this simple theory would predict slower responses on incongruent trials, compared to congruent trials. A slightly more advanced theory predicts that congruency effects are dependent upon the salience of the word dimension and color dimension (Melara & Algom, 2003). Because in the standard Stroop experiment the word dimension is much more salient in both tasks than the color dimension, this theory predicts slower responses on incongruent trials, but only in the color naming condition. We have four possible worlds, two of which represent predictions from either of the two theories, and two that are not in line with either theory. 

Responses Color Naming
Responses Word Naming
World 1
World 2
Not Slower
World 3
Not Slower
World 4
Not Slower
Not Slower

In an unpublished working paper, Meehl (1990b) discusses a ‘box score’ of the number of successfully predicted features, which he acknowledges is too simplistic. No widely accepted formalized measure of verisimilitude is available to express the similarity between the successfully predicted features by a theory, although several proposals have been put forward (Niiniluoto, 1998; Oddie, 2013, for an example based on Tversky's (1977) contrast model, see Cevolani, Crupi, & Festa, 2011). However, even if formal measures of verisimilitude are not available, it remains a useful concept to describe theories that are assumed to be closer to the truth because they make novel predictions (Psillos, 1999).

As empirical scientists, our main job is to decide which features are present in our world. Therefore, we need to know if predictions made by theories are corroborated or falsified in experiments. To be able to falsify a theory, it needs to forbid certain states of the world (Lakatos, 1978). This is not easy, especially for probabilistic statements, which is the bread and butter of psychological science. Where a single black swan is clearly observable, probabilistic statements only reach their true predicted value in infinity, and every finite sample will have some variation around the predicted value. However, according to Popper, probabilistic statements can be made falsifiable by interpreting probability as the relative frequency of a result in a specified hypothetical series of observations, and decide that reproducible regularities are not attributed to randomness (Popper, 2002). Even though any finite sample will show some variation, we can decide upon a limit of the variation. Researchers can use the limit of variation that is allowed as a methodological rule, and decide whether a set of observations falls in a ‘forbidden’ state of the world, or in a ‘permitted’ state of the world, according to some theoretical prediction.

This methodological falsification (Lakatos, 1978) is clearly inspired by a Neyman-Pearson perspective on statistical inferences. Popper (2002, p. 168) acknowledges feedback from the statistician Abraham Wald, who developed statistical decision theory based on the work by Neyman and Pearson (Wald, 1992). Lakatos (1978, p. 25) writes how we can make predictions falsifiable by “specifying certain rejection rules which may render statistically interpreted evidence 'inconsistent' with the probabilistic theory” and notes: “this methodological falsificationism is the philosophical basis of some of the most interesting developments in modern statistics. The Neyman-Pearson approach rests completely on methodological falsificationism”. To use methodological falsification, Popper describes how empirical researchers need to decide upon an interval within which the predicted value will fall. We can then calculate for any number of observations the probability that our value will indeed fall within this range, and design a study such that this probability is very high, or that it’s complementary probability, which Popper denotes by ε, is small. We can recognize this procedure as a Neyman-Pearson hypothesis test, where ε is the Type 2 error rate. In other words, high statistical power, or when the null is true, a very low alpha level, can corroborate a hypothesis.

Popper distinguishes between subjective probabilities (where the degree of probability is expressed as feelings of certainty, or, belief), and objective probabilities (where probabilities are relative frequencies with which an event occurs in a specified range of observations. Popper strongly believed that the corroboration of tests should be based on Frequentist, not Bayesian, probabilities (Popper, p. 434): “As to degree of corroboration, it is nothing but a measure of the degree to which a hypothesis h has been tested, and of the degree to which it has stood up to tests. It must not be interpreted, therefore, as a degree of the rationality of our belief in the truth of h”. For a scientific realist, who believes the main goal of scientists is to identify features of the world that corroborate or falsify theories, what matters is whether theories are truthlike, not whether you believe they are truthlike. As Taper and Lele (2011) express this viewpoint: “It is not that we believe that Bayes' rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” Indeed, if the goal is to identify the presence or absence of features in the world to develop more truth-like theories, we mainly need procedures that allow us to make choices about the presence or absence of these features with high accuracy. Subjective belief plays no role in these procedures.

To identify the presence or absence of features with high accuracy, we need a statistical procedure that allows us to make decisions while controlling the probability we make an error. This idea is translated into practice in hypothesis testing procedures put forward by Neyman and Pearson (1933): “We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.” Any procedure with good error control can be used (although Popper stresses that these findings should also be replicable). Some authors prefer likelihood ratios where error rates have maximum bounds (Royall, 1997; Taper & Ponciano, 2016), but in general, frequentists hypothesis tests are used where both the Type 1 error rate and the Type 2 error rate are controlled.

Meehl (1978) believes “the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology”. Meehl is of this opinion, not because hypothesis tests are not useful, but because they are not used to test risky predictions. Meehl remarks that “When I was a rat psychologist, I unabashedly employed significance testing in latent-learning experiments; looking back I see no reason to fault myself for having done so in the light of my present methodological views” (Meehl, 1990a). When one theory predicts rats learn nothing, and another theory predicts rats learn something, even Meehl believed testing the difference between an experimental and control group was a useful test of a theoretical prediction. However, Meehl believes that many hypothesis tests are used in a way such that they actually do not increase the verisimilitude of theories are all. If you predict gender differences, you will find them more often than not in a large enough sample. Because people can not be randomly assigned to gender conditions, the null hypothesis is most likely false, not predicted by any theory, and therefore rejecting the null hypothesis does not increase the verisimilitude of any theory. But as a scientific realist, Meehl believes accepting or rejecting predictions is a sound procedure, as long as you test risky predictions in procedures with low error rates. Using such procedures, we have observed an asymmetry in the Stroop experiments, where the interference effect is much greater in the color naming task than in the word naming task, which leads us to believe the theory that takes into account the salience of the word and color dimensions has higher truth-likeness.

From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truth-likeness of a theory. Obviously, if you reject realism, and follow anti-realist philosophical viewpoints such as Fraassen’s constructive empiricism, then you also reject verisimilitude, or the idea that theories can be closer to an unobservable and unknowable truth. I understand most psychologists do not choose their statistical approaches to follow logically from their philosophy on science, and instead follow norms or hypes. But I think it is useful to at least reflect upon basic questions. What is the goal of science? Can we approach the truth, or can we only believe in hypotheses? There should be some correspondence between your choice of statistical inferences, and your philosophy of science. Whenever I tell a fellow scientist that I am not particularly interested in evidence, and that I think error control is the most important goal in science, people often look at me like I’m crazy, and talk to me like I’m stupid. I might be both – but I think my statements follow logically from a scientific realist perspective on science, and are perfectly in line with thoughts by Neyman, Popper, Lakatos, and Meehl.

A final benefit of being a scientific realist is that I can believe it is close to 100% certain that this blog post is wrong, but testing my ideas against the literature, it seems to have pretty high verisimilitude. Nevertheless, this is a topic I am not an expert on, so use the comments to identify features of my blog that are incorrect, so that we can improve its truth-likeness.


Cevolani, G., Crupi, V., & Festa, R. (2011). Verisimilitude and belief change for conjunctive theories. Erkenntnis, 75(2), 183.

Feyerabend, P. (1993). Against method (3rd ed). London ; New York: Verso.

Kuipers, T. A. F. (2016). Models, postulates, and generalized nomic truth approximation. Synthese, 193(10), 3057–3077.

Lakatos, I. (1978). The methodology of scientific research programmes: Volume 1: Philosophical papers (Vol. 1). Cambridge University Press.

Laudan, L. (1981). A confutation of convergent realism. Philosophy of Science, 48(1), 19–49.

Meehl, P. E. (1978). Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology. Journal of Consulting and Clinical Psychology, 46, 806–834.

Meehl, P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Meehl, P. E. (1990b). Corroboration and verisimilitude: Against Lakatos’ “sheer leap of faith.” Working Paper, MCPS-90-01). Minneapolis: University of Minnesota, Center for Philosophy of Science. Retrieved from

Melara, R. D., & Algom, D. (2003). Driven by information: A tectonic theory of Stroop effects. Psychological Review, 110(3), 422–471.

Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337.

Niiniluoto, I. (1998). Verisimilitude: The Third Period. The British Journal for the Philosophy of Science, 49, 1–29.

Niiniluoto, I. (1999). Critical Scientific Realism. Oxford University Press.

Oddie, G. (2013). The content, consequence and likeness approaches to verisimilitude: compatibility, trivialization, and underdetermination. Synthese, 190(9), 1647–1687.

Popper, K. R. (2002). The logic of scientific discovery. London; New York: Routledge.

Psillos, S. (1999). Scientific realism: how science tracks truth. London; New York: Routledge.

Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. London ; New York: Chapman and Hall/CRC.

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643.

Taper, M. L., & Lele, S. R. (2011). Philosophy of Statistics. In P. S. Bandyophadhyay & M. R. Forster (Eds.), Evidence, evidence functions, and error probabilities (pp. 513–531). Elsevier, USA.

Taper, M. L., & Ponciano, J. M. (2016). Evidential statistics as a statistical modern synthesis to support 21st century science. Population Ecology, 58(1), 9–29.

Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327.

Van Fraassen, B. C. (1980). The scientific image. Oxford : New York: Clarendon Press ; Oxford University Press.

Wald, A. (1992). Statistical Decision Functions. In S. Kotz & N. L. Johnson (Eds.), Breakthroughs in Statistics (pp. 342–357). Springer New York.

Thursday, May 11, 2017

How a power analysis implicitly reveals the smallest effect size you care about

When designing a study, you need to justify the sample size you aim to collect. If one of your goals is to observe a p-values lower than the alpha level you decided upon (e.g., 0.05), one justification for the sample size can be a power analysis. A power analysis tells you the probability of observing a statistically significant effect, based on a specific sample size, alpha level, and true effect size. At our department, people who use power as a sample size justification need to aim for 90% power if they want to get money from the department to collect data.

A power analysis is performed based on the effect size you expect to observe. When you expect an effect with a Cohen’s d of 0.5 in an independent two-tailed t-test, and you use an alpha level of 0.05, you will have 90% power with 86 participants in each group. What this means, is that only 10% of the distribution of effects sizes you can expect when d = 0.5 and n = 86 falls below the critical value required to get a p < 0.05 in an independent t-test.

In the figure below, the power analysis is visualized by plotting the distribution of Cohen’s d given 86 participants per group when the true effect size is 0 (or the null-hypothesis is true), and when d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect).

You’ve probably seen such graphs before (indeed, G*power, widely used power analysis software, provides these graphs as output). The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. I think people find it easier to interpret d than t. Only t-tests which yield a t 1.974, or a d 0.30, will be statistically significant. All effects smaller than d = 0.30 will never be statistically significant with 86 participants in each condition.
If you design a study where results will be analyzed with an independent two-tailed t-test with α = 0.05, the smallest true effect you can statistically detect is determined exclusively by the sample size. The (unknown) true effect size only determines how far to the right the distribution of d-values lies, and thus, which percentage of effect sizes will be larger than the smallest effect size of interest (and will be statistically significant – or the statistical power).

I think it is reasonable to assume that if you decide to collect data for a study where you plan to perform a null-hypothesis significance test, you are not interested in effect sizes that will never be statistically significant. If you design a study that has 90% power for a medium effect of d = 0.5, the sample size you decide to use means effects smaller than d = 0.3 will never be statistically significant. We can use this fact to infer what your smallest effect size of interest, or SESOI (Lakens, 2014), will be. Unless you state otherwise, we can assume your SESOI is d = 0.3, and any effects smaller than this effect size are considered too small to be interesting. Obviously, you are free to explicitly state any effect smaller than d = 0.5 or d = 0.4 is already too small to matter for theoretical or practical purposes. But without such an explicit statement about what your SESOI is, we can infer it from your power analysis.

This is useful. Researchers who use null-hypothesis significance testing often only specify the effect they expect when the null is true (d = 0), but not the smallest effect size that should still be considered support for their theory when there is a true effect. This leads to a psychological science that is unfalsifiable (Morey & Lakens, under review). Alternative approaches to determining what the smallest effect size of interest is have recently been suggested. For example, Simonsohn (2015) suggested to set the smallest effect size of interest to 33% of the effect size in the original study could detect. For example, if an original study used 20 participants per group, the smallest effect size of interest would be d = 0.49 (which is the effect size they had 33% power to detect with n = 20).

Let’s assume the original study used a sample size of n = 20 per group. The figure below shows that an observed effect size of d = 0.8 would be statistically significant (d = 0.8 lies to the right of the critical d-value), but that the critical d-value is d = 0.64. That means that effects smaller than d = 0.64 would never be statistically significant in a study with 20 participants per group in a between-subjects design. I think it makes more sense to assume the smallest effect size of interest for researchers who design a study with n = 20 is d = 0.64, rather than d = 0.49. 

The figures can be produced by a new Shiny app I created (the Shiny app also plots power curves and the p-value distribution [they are not all visible on, but you can try HERE as long as bandwidth lasts, or just grab the code and app from GitHub] – I might discuss these figures in a future blog post). If you have designed your next study, check the critical d-value to make sure that the smallest effect size you care about, isn’t smaller than the critical effect size you can actually detect. If you think smaller effects are interesting, but you don’t have the resources, specify your SESOI explicitly in your article. You can also use this specified smallest effect size of interest in an equivalence test to statistically reject any effect large enough that you deem it worthwhile (Lakens, 2017), which will help interpreting t-tests where p > α. In short, we really need to start specifying the effects we expect under the alternative model, and if you don’t know where to start, your power analysis might have been implicitly telling you what your smallest effect size of interest is.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710.

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science.

Morey, R. D., & Lakens, D. (under review). Why most of psychology is statistically unfalsifiable.

Simonsohn, U. (2015). Small Telescopes Detectability and the Evaluation of Replication Results. Psychological Science, 26(5), 559–569.

Friday, April 14, 2017

Five reasons blog posts are of higher scientific quality than journal articles

The Dutch toilet cleaner ‘WC-EEND’ (literally: 'Toilet Duck') aired a famous commercial in 1989 that had the slogan ‘We from WC-EEND advise… WC-EEND’. It is now a common saying in The Netherlands whenever someone gives an opinion that is clearly aligned with their self-interest. In this blog, I will examine the hypothesis that blogs are, on average, of higher quality than journal articles. Below, I present 5 arguments in favor of this hypothesis.  [EDIT: I'm an experimental psychologist. Mileage of what you'll read below may vary in other disciplines].

1. Blogs have Open Data, Code, and Materials

When you want to evaluate scientific claims, you need access to the raw data, the code, and the materials. Most journals do not (yet) require authors to make their data publicly available (whenever possible). The worst case example when it comes to data sharing is the American Psychological Association. In the ‘Ethical Principles of Psychologists and Code of Conduct’ of this professional organization that supported torture, point 8.14 says that psychologists only have to share data when asked to by ‘competent professionals’ for the goal to ‘verify claims’, and that these researchers can charge money to compensate any costs that are made when they have to respond to a request for data. Despite empirical proof that most scientists do not share their data when asked, the APA considers this ‘ethical conduct’. It is not. It’s an insult to science. But it’s the standard that many relatively low quality scientific journals, such as the Journal of Experimental Psychology: General, hide behind to practice closed science.

On blogs, the norm is to provide access to the underlying data, code, and materials. For example, here is Hanne Watkins, who uses data she collected to answer some questions about the attitudes of early career researchers and researchers with tenure towards replications. She links to the data and materials, which are all available on the OSF. Most blogs on statistics will link to the underlying code, such as this blog by Will Gervais on whether you should run well-powered studies or many small-powered studies. On average, it seems to me almost all blogs practice open science to a much higher extent than scientific journals.

2. Blogs have Open Peer Review

Scientific journal articles use peer review as quality control. The quality of the peer review process is as high as the quality of the peers that were involved in the review process. The peer review process was as biased as the biases of the peers that were involved in the review process. For most scientific journal articles, I can not see who reviewed a paper, or check the quality, or the presence of bias, because the reviews are not open. Some of the highest quality journals in science, such as PeerJ and Royal Society Open Science, have Open Peer Review, and journals like Frontiers at least specify the names of the reviewers of a publication. Most low quality journals (e.g., Science, Nature) have 100% closed peer review, and we don’t even know the name the handling editor of a publication. It is often impossible to know whether articles were peer reviewed to begin with, and what the quality of the peer review process was.

Some blogs have Open pre-publication Peer Review. If you read the latest DataColada blog post, you can see the two reviews of the post by experts in the field (Tom Stanley and Joe Hilgard) and several other people who shared thoughts before the post went online. On my blog, I sometimes ask people for feedback before I put a blog post online (and these people are thanked in the blog if they provided feedback), but I also have a comment section. This allows people to point out errors and add comments, and you can see how much support or criticism a blog has received. For example, in this blog on why omega squared is a better effect size to use than eta-squared, you can see why Casper Albers disagreed by following a link to a blog post he wrote in response. Overall, the peer review process in blog posts is much more transparent. If you see no comments on a blog post, you have the same information about the quality of the peer review process as you’d have for the average Science article. Sure, you may have subjective priors about the quality of the review process at Science (ranging from ‘you get in if your friend is an editor’ to ‘it’s very rigorous’) but you don’t have any data. But if a blog has comments, at least you can see what peers thought about a blog post, giving you some data, and often very important insights and alternative viewpoints.

3. Blogs have no Eminence Filter

Everyone can say anything they want on a blog, as long as it does not violate laws regarding freedom of speech. It is an egalitarian and democratic medium. This aligns with the norms in science. As Merton (1942) writes: “The acceptance or rejection of claims entering the lists of science is not to depend on the personal or social attributes of their protagonist; his race, nationality, religion, class, and personal qualities are as such irrelevant.” We see even Merton was a child of his times – he of course meant that his *or her* race, etcetera, is irrelevant.

Everyone can write a blog, but not everyone is allowed to publish in a scientific journal. As one example, criticism recently arose about a special section in Perspectives on Psychological Science about ‘eminence’ in which the only contribution from a woman was about gender and eminence. It was then pointed out that this special section only included the perspectives on eminence by old American men, and that there might be an issue with diversity in viewpoints in this outlet.

I was personally not very impressed by the published articles in this special section, probably because the views on how to do science as expressed by this generation of old American men does not align with my views on science. I have nothing against old (or dead) American men in general (Meehl be praised), but I was glad to hear some of the most important voices in my scientific life submitted responses to this special issue. Regrettably, all these responses were rejected. Editors can make those choices, but I am worried about the presence of an Eminence Filter in science, especially one that in this specific case filters out some of the voices that have been most important in shaping me as a scientist. Blogs allows these voices to be heard, which I think is closer to the desired scientific norms discussed by Merton.

4. Blogs have Better Error Correction

In a 2014 article, we published a Table 1 of sample sizes required to design informative studies for different statistical approaches. We stated these are sample sizes per condition, but for 2 columns, these are actually the total sample sizes you need. We corrected this in an erratum. I know this erratum was published, and I would love to link to it, but honest to Meehl, I can not find it. I just spend 15 minutes searching for it in any way I can think of, but there is no link to it on the journal website, and I can’t find it in Google scholar. I don’t see how anyone will become aware of this error when they download our article.

When I make an error in a blog post, I can go in and update it. I am pretty confident that I make approximately as many errors in my published articles as I make in my blog posts, but the latter are much easier to fix, and thus, I would consider my blogs more error-free, and of higher quality. There are some reasons why you can not just update scientific articles (we need a stable scientific record), and there might be arguments for better and more transparent version control of blog posts, but for the consumer, it’s just very convenient that mistakes can easily be fixed in blogs, and that you will always read the best version.

5. Blogs are Open Access (and might be read more).

It’s obvious that blogs are open access. This is a desirable property of high quality science. It makes the content more widely available, and I would not be surprised (but I have no data) that blog posts are *on average* read more than scientific articles because they are more accessible. Getting page views is not, per se, an indication of scientific quality. A video on Pen Pineapple Apple Pen gets close to 8 million views, and we don’t consider that high quality music (I hope). But views are one way to measure how much impact blogs have on what scientists think.

I only have data for page views from my own blog. I’ve made a .csv file with the page views of all my blog posts publicly available (so you can check my claims below about page views of specific blog posts below, cf. point 1 above). There is very little research on the impact of blogs on science. They are not cited a lot (even though you can formally cite them) but they can have clear impact, and it would be interesting to study how big their impact is. I think it would be a fun project to compare the impact of blogs with the impact of scientific articles more formally. Should be a fun thesis project for someone studying scientometrics.

Some blog posts that I wrote get more views than the articles I comment on. One commentary blog post I wrote on a paper which suggested there was ‘A surge of p-values between 0.041 and 0.049 in recent decades’. The paper received 7147 view at the time of writing. My blog post received 11285 views so far. But it is not universally true that my blogs get more pageviews than the articles I comment on. A commentary I wrote on a horribly flawed paper by Gilbert and colleagues in Science, where they misunderstood how confidence intervals work, has only received 12190 hits so far, but the article info of their Science article tells me their article received three times as many views for the abstract, 36334, and also more views for the full text (19124). On the other hand, I do have blog posts that have gotten more views than this specific Science article (e.g., this post on Welch’s t-test which has 38127 hits so far). I guess the main point of these anecdotes is not surprising, but nevertheless worthwhile to point out: Blog are read, sometimes a lot.


I’ve tried to measure blogs and journal articles on some dimensions that, I think, determine their scientific quality. It is my opinion that blogs, on average, score better on some core scientific values, such as open data and code, transparency of the peer review process, egalitarianism, error correction, and open access. It is clear blogs impact the way we think and how science works. For example, Sanjay Srivastava’s pottery barn rule, proposed in a 2012 blog, will be implemented in the journal Royal Society Open Science. This shows blogs can be an important source of scientific communication. If the field agrees with me, we might want to more seriously consider the curation of blogs, to make sure they won’t disappear in the future, and maybe even facilitate assigning DOI’s to blogs, and the citation of blog posts.

Before this turns into a ‘we who write blogs recommend blogs’ post, I want to make clear that there is no intrinsic reason why blogs should have higher scientific quality than journal articles. It’s just that the authors of most blogs I read put some core scientific values into practice to a greater extent than editorial boards at journals. I am not recommending we stop publishing in journals, but I want to challenge the idea that journal publications are the gold standard of scientific output. They fall short on some important dimensions of scientific quality, where they are outperformed by blog posts. Pointing this out might inspire some journals to improve their current standards.