Not everyone thinks that it matters whether scientific theories make true statements about the world, as scientific realists do. Laudan (1981) argues against scientific realism based on a pessimistic metainduction: If theories that were deemed successful in the past turn out to be false, then we can reasonably expect all our current successful theories to be false as well. Van Fraassen (1980) believes it is sufficient for a theory to be ‘empirically adequate’, and make true predictions about things we can observe, irrespective of whether these predictions are derived from a theory that describes how the unobservable world is in reality. This viewpoint is known as constructive empiricism. As Van Fraassen summarizes the constructive empiricist perspective (1980, p.12): “Science aims to give us theories which are empirically adequate; and acceptance of a theory involves as belief only that it is empirically adequate”.
The idea that we should ‘believe’ scientific hypotheses is not something scientific realists can get behind. Either they think theories make true statements about things in the world, but we will have to remain completely agnostic about when they do (Feyerabend, 1993), or they think that corroborating novel and risky predictions makes it reasonable to believe that a theory has some ‘truthlikeness’, or verisimilitude. The concept of verisimilitude is based on the intuition that a theory is closer to a true statement when the theory allows us to make more true predictions, and less false predictions. When data is in line with predictions, a theory gains verisimilitude, when data are not in line with predictions, a theory loses verisimilitude (Meehl, 1978). Popper clearly intended verisimilitude to be different from belief (Niiniluoto, 1998). Importantly, verisimilitude refers to how close a theory is to the truth, which makes it an ontological, not epistemological question. That is, verisimilitude is a function of the degree to which a theory is similar to the truth, but it is not a function of the degree of belief in, or the evidence for, a theory (Meehl, 1978, 1990). It is also not necessary for a scientific realist that we ever know what is true – we just need to be of the opinion that we can move closer to the truth (known as comparative scientific realism, Kuipers, 2016).
Attempts to formalize verisimilitude have been a challenge, and from the perspective of an empirical scientist, the abstract nature of this ongoing discussion does not really make me optimistic it will be extremely useful in everyday practice. On a more intuitive level, verisimilitude can be regarded as the extent to which a theory makes the most correct (and least incorrect) statements about specific features in the world. One way to think about this is using the ‘possible worlds’ approach (Niiniluoto, 1999), where for each basic state of the world one can predict, there is a possible world that contains each unique combination of states.
For example, consider the experiments by Stroop (1935), where color related words (e.g., RED, BLUE) are printed either in congruent colors (i.e., the word RED in red ink) or incongruent colors (i.e., the word RED in blue ink). We might have a very simple theory predicting that people automatically process irrelevant information in a task. When we do two versions of a Stroop experiment, one where people are asked to read the words, and one where people are asked to name the colors, this simple theory would predict slower responses on incongruent trials, compared to congruent trials. A slightly more advanced theory predicts that congruency effects are dependent upon the salience of the word dimension and color dimension (Melara & Algom, 2003). Because in the standard Stroop experiment the word dimension is much more salient in both tasks than the color dimension, this theory predicts slower responses on incongruent trials, but only in the color naming condition. We have four possible worlds, two of which represent predictions from either of the two theories, and two that are not in line with either theory.
Responses Color Naming

Responses Word Naming


World 1

Slower

Slower

World 2

Slower

Not Slower

World 3

Not Slower

Slower

World 4

Not Slower

Not Slower

In an unpublished working paper, Meehl (1990b) discusses a ‘box score’ of
the number of successfully predicted features, which he acknowledges is too
simplistic. No widely accepted formalized measure of verisimilitude is
available to express the similarity between the successfully predicted features
by a theory, although several proposals have been put forward (Niiniluoto, 1998; Oddie, 2013, for an example based on Tversky's (1977) contrast model, see Cevolani, Crupi, & Festa, 2011). However, even if formal measures
of verisimilitude are not available, it remains a useful concept to describe
theories that are assumed to be closer to the truth because they make novel
predictions (Psillos, 1999).
As empirical scientists, our main job is to decide which
features are present in our world. Therefore, we need to know if predictions made by theories are
corroborated or falsified in experiments. To be able to falsify a
theory, it needs to forbid certain states of the world (Lakatos, 1978). This is not easy, especially
for probabilistic statements, which is the bread and butter of psychological
science. Where a single black swan is clearly observable, probabilistic
statements only reach their true predicted value in infinity, and every finite
sample will have some variation around the predicted value. However, according
to Popper, probabilistic statements can be made
falsifiable by interpreting probability as the relative frequency of a result
in a specified hypothetical series of observations, and decide that
reproducible regularities are not attributed to randomness (Popper, 2002). Even though any finite
sample will show some variation, we can decide upon a limit of the variation. Researchers
can use the limit of variation that is allowed as a methodological rule, and decide whether a set of observations falls
in a ‘forbidden’ state of the world, or in a ‘permitted’ state of the world,
according to some theoretical prediction.
This methodological
falsification (Lakatos, 1978) is clearly inspired by a
NeymanPearson perspective on statistical inferences. Popper (2002, p. 168) acknowledges
feedback from the statistician Abraham Wald, who developed statistical decision
theory based on the work by Neyman and Pearson (Wald, 1992). Lakatos (1978, p. 25) writes
how we can make predictions falsifiable by “specifying certain rejection rules
which may render statistically interpreted evidence 'inconsistent' with the
probabilistic theory” and notes: “this methodological falsificationism is the
philosophical basis of some of the most interesting developments in modern
statistics. The NeymanPearson approach rests completely on methodological
falsificationism”. To use methodological falsification, Popper describes how
empirical researchers need to decide upon an interval within which the
predicted value will fall. We can then calculate for any number of observations
the probability that our value will indeed fall within this range, and design a
study such that this probability is very high, or that it’s complementary
probability, which Popper denotes by ε,
is small. We can recognize this procedure as a NeymanPearson hypothesis test,
where ε is the Type 2
error rate. In other words, high statistical power, or when the null is true, a
very low alpha level, can corroborate a hypothesis.
Popper distinguishes between subjective probabilities (where
the degree of probability is expressed as feelings of certainty, or, belief),
and objective probabilities (where probabilities are relative frequencies with
which an event occurs in a specified range of observations. Popper strongly believed that the corroboration
of tests should be based on Frequentist, not Bayesian, probabilities (Popper,
p. 434): “As to degree of corroboration, it is nothing but a measure of the
degree to which a hypothesis h has been tested, and of the degree to which it
has stood up to tests. It must not be interpreted, therefore, as a degree of
the rationality of our belief in the truth of h”. For a scientific
realist, who believes the main goal of scientists is to identify features of
the world that corroborate or falsify theories, what matters is whether
theories are truthlike, not whether you believe
they are truthlike. As Taper and Lele (2011) express this viewpoint: “It
is not that we believe that Bayes' rule or Bayesian mathematics is flawed, but
that from the axiomatic foundational definition of probability Bayesianism is
doomed to answer questions irrelevant to science. We do not care what you
believe, we barely care what we believe, what we are interested in is what you
can show.” Indeed, if the goal is to identify the presence or absence of
features in the world to develop more truthlike theories, we mainly need
procedures that allow us to make choices about the presence or absence of these
features with high accuracy. Subjective belief plays no role in these
procedures.
To identify the presence or absence of features with high
accuracy, we need a statistical procedure that allows us to make decisions
while controlling the probability we make an error. This idea is translated
into practice in hypothesis testing procedures put forward by Neyman and
Pearson (1933): “We are inclined to think
that as far as a particular hypothesis is concerned, no test based upon the
theory of probability can by itself provide any valuable evidence of the truth
or falsehood of that hypothesis. But we may look at the purpose of tests from
another viewpoint. Without
hoping to know whether each separate hypothesis is true or false, we may search
for rules to govern our behaviour with regard to them, in following which we
insure that, in the long run of experience, we shall not be too often wrong.” Any
procedure with good error control can be used (although Popper stresses that
these findings should also be replicable). Some authors prefer likelihood
ratios where error rates have maximum bounds (Royall, 1997; Taper &
Ponciano, 2016),
but in general, frequentists hypothesis tests are used where both the Type 1
error rate and the Type 2 error rate are controlled.
Meehl (1978) believes “the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology”. Meehl is of this opinion, not because hypothesis tests are not useful, but because they are not used to test risky predictions. Meehl remarks that “When I was a rat psychologist, I unabashedly employed significance testing in latentlearning experiments; looking back I see no reason to fault myself for having done so in the light of my present methodological views” (Meehl, 1990a). When one theory predicts rats learn nothing, and another theory predicts rats learn something, even Meehl believed testing the difference between an experimental and control group was a useful test of a theoretical prediction. However, Meehl believes that many hypothesis tests are used in a way such that they actually do not increase the verisimilitude of theories are all. If you predict gender differences, you will find them more often than not in a large enough sample. Because people can not be randomly assigned to gender conditions, the null hypothesis is most likely false, not predicted by any theory, and therefore rejecting the null hypothesis does not increase the verisimilitude of any theory. But as a scientific realist, Meehl believes accepting or rejecting predictions is a sound procedure, as long as you test risky predictions in procedures with low error rates. Using such procedures, we have observed an asymmetry in the Stroop experiments, where the interference effect is much greater in the color naming task than in the word naming task, which leads us to believe the theory that takes into account the salience of the word and color dimensions has higher truthlikeness.
From a
scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide
an answer to the main question of interest, which is the verisimilitude of
scientific theories. Belief can be used to decide which questions to
examine, but it can not be used to determine the truthlikeness of a theory. Obviously,
if you reject realism, and follow antirealist philosophical viewpoints such as
Fraassen’s constructive empiricism, then you also reject verisimilitude, or the
idea that theories can be closer to an unobservable and unknowable truth. I
understand most psychologists do not choose their statistical approaches to
follow logically from their philosophy on science, and instead follow norms or
hypes. But I think it is useful to at least reflect upon basic questions. What
is the goal of science? Can we approach the truth, or can we only believe in hypotheses? There should be
some correspondence between your choice of statistical inferences, and your
philosophy of science. Whenever I tell a fellow scientist that I am not
particularly interested in evidence, and that I think error control is the most
important goal in science, people often look at me like I’m crazy, and talk to
me like I’m stupid. I might be both – but I think my statements follow
logically from a scientific realist perspective on science, and are perfectly
in line with thoughts by Neyman, Popper, Lakatos, and Meehl.
A final benefit of being a scientific realist is that I can
believe it is close to 100% certain that this blog post is wrong, but testing
my ideas against the literature, it seems to have pretty high verisimilitude.
Nevertheless, this is a topic I am not an expert on, so use the comments to
identify features of my blog that are incorrect, so that we can improve its
truthlikeness.
References
Cevolani, G., Crupi, V., &
Festa, R. (2011). Verisimilitude and belief change for conjunctive theories. Erkenntnis,
75(2), 183.
Feyerabend,
P. (1993). Against method (3rd ed). London ; New York: Verso.
Kuipers,
T. A. F. (2016). Models, postulates, and generalized nomic truth approximation.
Synthese, 193(10), 3057–3077.
https://doi.org/10.1007/s1122901509169
Lakatos,
I. (1978). The methodology of scientific research programmes: Volume 1:
Philosophical papers (Vol. 1). Cambridge University Press.
Laudan,
L. (1981). A confutation of convergent realism. Philosophy of Science, 48(1),
19–49.
Meehl,
P. E. (1978). Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald,
and the Slow Progress of Soft Psychology. Journal of Consulting and Clinical
Psychology, 46, 806–834.
Meehl,
P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian
defense and two principles that warrant it. Psychological Inquiry, 1(2),
108–141.
Meehl,
P. E. (1990b). Corroboration and verisimilitude: Against Lakatos’ “sheer
leap of faith.” Working Paper, MCPS9001). Minneapolis: University of
Minnesota, Center for Philosophy of Science. Retrieved from
http://meehl.umn.edu/sites/g/files/pua1696/f/146corroborationverisimilitude.pdf
Melara,
R. D., & Algom, D. (2003). Driven by information: A tectonic theory of
Stroop effects. Psychological Review, 110(3), 422–471.
https://doi.org/10.1037/0033295X.110.3.422
Neyman,
J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of
Statistical Hypotheses. Philosophical Transactions of the Royal Society of
London A: Mathematical, Physical and Engineering Sciences, 231(694–706),
289–337. https://doi.org/10.1098/rsta.1933.0009
Niiniluoto,
I. (1998). Verisimilitude: The Third Period. The British Journal for the
Philosophy of Science, 49, 1–29.
Niiniluoto,
I. (1999). Critical Scientific Realism. Oxford University Press.
Oddie,
G. (2013). The content, consequence and likeness approaches to verisimilitude:
compatibility, trivialization, and underdetermination. Synthese, 190(9),
1647–1687. https://doi.org/10.1007/s1122901199308
Popper,
K. R. (2002). The logic of scientific discovery. London; New York:
Routledge.
Psillos,
S. (1999). Scientific realism: how science tracks truth. London; New
York: Routledge.
Royall,
R. (1997). Statistical Evidence: A Likelihood Paradigm. London ; New
York: Chapman and Hall/CRC.
Stroop,
J. R. (1935). Studies of interference in serial verbal reactions. Journal of
Experimental Psychology, 18(6), 643.
Taper,
M. L., & Lele, S. R. (2011). Philosophy of Statistics. In P. S.
Bandyophadhyay & M. R. Forster (Eds.), Evidence, evidence functions, and
error probabilities (pp. 513–531). Elsevier, USA.
Taper,
M. L., & Ponciano, J. M. (2016). Evidential statistics as a statistical
modern synthesis to support 21st century science. Population Ecology, 58(1),
9–29.
Tversky,
A. (1977). Features of similarity. Psychological Review, 84(4),
327.
Van
Fraassen, B. C. (1980). The scientific image. Oxford : New York: Clarendon
Press ; Oxford University Press.
Wald,
A. (1992). Statistical Decision Functions. In S. Kotz & N. L. Johnson
(Eds.), Breakthroughs in Statistics (pp. 342–357). Springer New York.
https://doi.org/10.1007/9781461209195_22
Very nice post! I'll need more time to think about the substantive issues, but here's some nitpicking: I'm not sure that "this methodological falsification (Lakatos, 1978) is clearly inspired by a NeymanPearson perspective on statistical inferences." Logik der Forschung was published in 1934, while Neyman and Pearson's papers were published in the 1930's, I think (1933 for the paper you quote). Given the slow communication between Austria and GreatBritain at that time, I think it's more likely that they developped their thinking independantly of each other (I don't think Wald was already writing statistical papers at that time). But I'd be glad to be proved wrong!
ReplyDeleteBut Popper didn't die in 1934, and in later (translated and updated) additions, he added the following footnote indicating he talked to Wald:
Delete"Here the word ‘all’ is, I now believe, mistaken, and should be replaced, to be a little more precise, by ‘all those . . . that might be used as gambling systems’. Abraham Wald showed me the need for this correction in 1935. Cf. footnotes *1 and *5 to section 58 above (and footnote 6, referring to A. Wald, in section *54 of my Postscript)"
If only falsifying hypotheses was so easy all the time ;)
Maybe I should have added I draw heavily on the 3rd addendum in later editions of Poppers book. I cite the 2002 version intentionally.
DeleteBut I'm still puzzled: even in 1935, is it obvious that Wald had had the time to read the NeymanPearson's papers? And if he had had the time, why doesn't Popper quote them in Logik der Forschung? Perhaps more importantly, I'm unsure whether the modification suggested by Wald is really fundamental; and if it isn't, we migth think that Popper had already built up his own ideas independantly of NeymanPearson ;) (it's just a detail, I'll admit it!)
Delete"From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truthlikeness of a theory."
ReplyDeleteIf Bayes factors tell you the plausibility of one hypothesis over another then doesn't that also imply that they tell you something about the truthlikeness or verisimilitude of the hypothesis, relative to the other (i.e., the one with greater plausibility is closer to the truth based on the observable data)?
No, belief and truthlikeness are not the same. Note that the problem is not the relative likelihood (likelihoods are fine and can be used) the problem is the prior.
DeleteThis is a wellwritten, dense blog post. It seems to be a quite concise summary of your position. Thanks for writing it.
ReplyDeleteWell, you read van Fraassen and Feyerabend and still belief in scientific realism. So no need to recapitulate their arguments, i guess. If you want more food for thought though, maybe try Adornos Negative Dialectics for a very dense text on incommensurability.
One of your other points is whether Bayesian posteriors can map the verisimilitude of scientific theories. This is an intriguing question. I'd argue that if reality exists in a verisimilitude fashion, then only as Dirac or Kronecker delta functions. Consider that it is questionable whether any prior (but the oracle prior) can ever converge to such a function in finite time, or finite iterations of experiments. Even more so if we assume that the delta function is nonstationary, or if the objective scientific experiment generating the evidence is nonreproducible (e.g. prediction of an election result, or similar). Therefore it could be there is a set of statements about reality, which might never be captured by Bayesian updating. In that regard, i fully agree with you that it needs a jump of faith for verisimilitude, maybe using thresholding at which point we treat a belief function as a delta function. But there exist many ways how this could be incorporated.
Consider that even hard Bayesians would accept that Trump won the election as inevitable fact, i.e. their posterior is 1 on Trump and 0 on Hillary. So i might not really understand your line of reasoning against Bayesian updating here. Hm. Maybe you are more wondering whether a Bayesian may use thresholding also for probabilistic statements, for which we could still perform reproducable experiments to gain further evidence?
Hi Robert, thanks for your comments (even though I'm pretty sure I didn't understand the second paragraph, but I'll google). I guess you are right that if outcome of the Frequentist and Bayesian decision procedure are the same, there is only a philosophical difference, but not one in practice. I think Bayesian updating can be used combined with a decision threshold as long as the frequentist error rates are ok (If I understand your main point!).
DeleteIt seems quite a stretch to note that Meehl accepted NP type testing under certain conditions and then go on to argue that his writings support the idea that, "error control is the most important goal in science."
ReplyDeleteI typically don't reply to anonymous comments.
DeleteIt's a pleasure to read these posts where the contrast of methods and philosophy of science is underscored. The Meehl objection to 'NHST everywhere' in psychology is a weak version to that of Gelman (no such things as 'null effect' or 'null HP', why are you testing against it?) and very similar to that of Gigerenzer in one of his recent talks (https://www.youtube.com/watch?v=4VSqfRnxvV8&t=1910s): NHST is perfectly OK and may add a lot to the theory, as long as you are pitting two proper alternative explanations against each other (his examples relates to the use of heuristics in accurate decisionmaking: instead of pitting heuristic A against H0, you should pit heuristic A against heuristic B and check which is more accurate). This gives incremental theoretical value to statistically significant results.
ReplyDeleteMy position here is this: I agree with Meehl and Gigerenzer (not with Gelman). But, Feyerabend makes an extreme point which we should be mindful of: there is no 'one method' to do science, and thus I'll remain open to NHST against 'pure H0', while maybe asking for a higher burden of proof there than I would in NHST 'explanation 1 vs explanation 2'.
Hi Daniel, I enjoy your blog and I appreciate you emphasizing the importance of philosophy in evaluating statistical inferences. You state that:
ReplyDelete"From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories."
I'm sure you've heard the similar Bayesian critique of frequentist methods, which is that pvalues and decisions about statistical significance don't answer the question we are usually interested in. From talking to my nonstatistician friends about how they interpret statistical results, I've found that they all want the pvalue to be the probability that their results were due to chance, so that they can interpret a small pvalue as the probability their research hypothesis is incorrect. This was Cohen's critique in "The Earth is Round (P<0.05)":
"What's wrong with NHST? Well, among many other things it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!"
I've found that my students in introductory statistics also instinctively want to interpret the pvalue as the probability of the null. This could be because they are just being introduced to NHST and the logic is somewhat convoluted and so they initially go with the simpler (and incorrect) interpretation of statistical significance. I suspect that it is also because the incorrect interpretation of statistical significance makes the most intuitive sense, and answers the question that is of most interest to them.
Of course, the clever students eventually learn the model, and understand the logic of rules such as "we treat population parameters as having fixed but unknown values, and so therefore we cannot make probabilistic statements about these values. It is only our data that are random, not the truth." But usually learning this is a struggle.
I know you qualified your statement with "from a scientific realism perspective"  does treating probability as epistemological rather than ontological mean having rule out or suspend scientific realism? It seems to me you can both treat probability as referring to a state of knowledge *and* believe that there is a truth out there that is ultimately beyond our reach, even as we constantly strive to improve our understanding of it. I don't see the conflict here. For example I'm allowed to put a "normally distributed random error" term in a model even though I know that what I'm treating as "error" is really governed, at least in part, by other deterministic forces. In this sense, "normal random error" is a substitute for uncertainty; I know that I can't model everything and make perfect predictions and so I'm going to pretend that "normal random error" explains all of the observed variation that my model fails to predict. It's certainly fine to call this a frequency. It's also fine to call it a model of uncertainty, without having to give up on objective reality.
There is a difference between accepting model assumptions, and including belief in your model. You can believe there is a truth out there  but since your belief is not relevant for it, scientific realism suggests there is no rationale to include it in a statistical test.
DeleteRegarding Meehl, you write:
ReplyDelete"Meehl believes accepting or rejecting predictions is a sound procedure, as long as you test risky predictions in procedures with low error rates"
I agree, but I also take Meehl's position as meaning that nearly all "significant" results are useless, given sufficient power. The error rates will be low but the results will (perhaps ironically) tell you less and less the more power you have. From the abstract to "Theory Testing in Psychology" (1967):
"Because physical theories typically predict numerical values, an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research, improved power of a statistical design leads to a prior probability approaching 1/2 of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by "success" is very weak, and becomes weaker with increased precision. "Statistical significance" plays a logical role in psychology precisely the reverse of its role in physics..."
So yes, Meehl would agree with the goal of error control, but I read this above quote as saying that you can't get error control AND the testing of risky predictions using a procedure that attempts to reject a special case of "not the hypothesis" instead of attempting to directly reject the hypothesis. Do you see many cases of NHST being used to test risky predictions, in which "reject Ho" means "reject my scientific hypothesis"?
It will become much easier, and we will see more, now people are starting to use equivalence testing: http://journals.sagepub.com/doi/full/10.1177/1948550617697177
DeleteI hope you are correct and that equivalence testing gains popularity. I fear that most practicing scientists have too strong an incentive to continue with "nil hypothesis" testing  it is easy to do, requires almost no understanding of what is actually being done, and it substantially increases the chances of getting a paper published. I appreciate your work in pushing for a much more philosophically sound alternative.
DeleteDear Daniel, Thank you very much for your effort here, a very constructive post. Two quick things.
ReplyDeleteFirst, when something is really unknown, one probably would prefer to run a "doortodoor" search to find it using some initial clue (Bayesian Inference) rather then probably take a null position and wait for some nullfalsifying evidence to reject that null position (Frequentist Inference).
Second, inference is important **only after** correct probability modeling. A HUGE share of social and behavioral research uses measurement tools that are either dichotomousely scored or on a Likert scale. Such research findings must be only stochatistcally modeled accurately using discrete probability modeling (e.g., negative binomial, hypergeometric) taking into account possible overdispersion almost always present in such type of research data.
I think **After** we really accurately model an actual research using an accurate probability model, the issue of inference **reasonably** just starts.
I very much look forward to a day when two things in social and behavioral sciences happen. (A) we don't use ttests and (M)AN(C)OVAs and LMs when really the measurement tools we see in social & behavioral research cry out loud for Generalized Linear Models, and Discrete Probability Modeling. (B) Efforts to make an inference happen only after (A) is met.
Great stuff Danny.Helpful topics for my notes. Thank you
ReplyDelete