PESTER More Often! Use Probabilistic Estimation for Sound Theory Evaluation and Revision

Halfway around the world there’s a group of scholars, mainly philosophers, who engage daily in discussions about the future of statistical analysis in the social, cognitive, and, behavioral sciences. There’s Geoff and Fiona and, of course, Neil, the benevolent, who invites metaphorical flies like me to rest on the wall.

For some, like Geoff, I may be more real than metaphorical in my fly-likeness. I’ve pestered him enough lately about ESCI, his handy set of SETs: statistical estimation tools, which help enable the application of what he calls the New Statistics. “Geoff, could you add more lines to the meta-analysis tool — I have 38 studies I’d like to assess, and ESCI stops at 30.” “Geoff, can ESCI handle meta-analyses for correlated designs? … No, that’s okay; I understand, but could you perhaps add that when you get the chance? It would be very, very useful since a lot of what we psychologists do involve correlated designs.” Stuff like that. Flies, after all, aren’t the least bit shy. (The metaphor is imperfect, I realize: flies are known to feast on piles of fecal matter, and I, by no means, wish to imply that my feasting on ESCI means that ESCI is of fecal constitution. I remind you that flies also like many other things that you and I would find quite delicious.)

At any rate, much of the discussion I’ve been overhearing lately has to do with the relationship among concepts such as statistical estimation (the crux of the New Statistics), hypothesis testing, null hypothesis significance testing (or NHST, the pest), and theories in science. Some questions recently posed include, What is hypothesis testing in science without NHST? And, could hypothesis testing proceed effectively with only statistical estimation sans significance testing?

First, let me offer a simple example of hypothesis testing that comes to us straight from the annals of psychology: Peter Wason’s rule discovery task, also known as the 2-4-6 task. The task is a familiar one, so let me just sketch the barest details. The experimenter tells the subject that he has a rule in mind which determines which number triplets are members of a target set and, conversely, which are excluded. He gives the subject an example that fits the rule: {2, 4, 6}. The subject’s task is to discover what rule the experimenter has in mind. To do so, he may generate other triplets and the experimenter will inform him whether the example fits the rule or not. What Wason found was that most subjects generated triplets that fit the hypothesized rule. He mistakenly labelled this behavior confirmation bias, but as Klayman and Ha (1987) later noted, the bias shown is not one of confirming the hypothesis, but one of conforming to the hypothesis, a tendency Klayman and his colleague labelled a positive-test strategy. Positive hypothesis tests, in principle, could yield either confirmatory or disconfirmatory evidence. If I test an instance of what I hypothesize to be the experimenter’s rule and it is shown to be false, then I’ve collected disconfirmatory evidence. Unless, I give it less weight than I would had I learned that it was confirmatory, I cannot legitimately be accused of confirmation bias.

At any rate, I merely wanted to show how hypothesis testing could easily be conducted without NHST. Of course, the task has many convenient attributes that make life easier. The experimenter gives perfectly accurate feedback on each triple tested by the subject, thus ruling out a key source of uncertainty. The rule itself is categorical. There are no fuzzy cases that partially fit the rule, but partially don’t. It’s all black and white. Therefore, if the subject stumbles upon a triplet that doesn’t fit the rule he has in mind, he can be absolutely sure that he has the wrong rule in mind. Accruing confirmatory evidence, however, merely strengthens support for the hypothesis. It never proves the hypothesized rule to be true.

In most behavioral science research, the situation is quite different, mainly because, for a variety of reasons, a single case or even multiple cases of disconfirmatory evidence don’t prove anything. Evidence tends to be tagged with multiple uncertainties. Theories and even hypotheses tend to be fuzzily stated making evidential interpretations more equivocal and subjective. The use of NHST, as the primary means of statistically testing hypotheses, further frustrates the scientific process since it promotes a focus on the null hypothesis, which is not the hypothesis of interest. More damning is the fact that everything but the null hypothesis becomes the alternative! This actively promotes a disregard for precision and clear thinking. It actively promotes satisfaction with merely showing differences between conditions, without thinking carefully about how large or small those differences ought to be based on one’s hypotheses. That, of course, severely undermines scientific progress since the same evidence is often treated by one theoretical camp as supporting a proposition, while an opposing camp treats the same evidence as contradicting the proposition. The wishy-washy-ness of the process actively encourages theoretical dogfights. At least, the camp members see dogfights. From far outside, the fight looks more like two barely living chickens pecking at each other’s broken bodies. To the outsiders, neither chicken appears to be winning and neither looks too wise.

Take an idealized example of the sort of chicken fight behavioral decision theorists love to have. The camps even have fancy names like Meliorists or Pessimists versus Panglossians or Optimists. The former tend to cast human judgment and decision making as irrational, while the former see it as ecologically well-adapted and, well, more or less rational under the circumstances. Let’s say a certain decision-making test is conducted and, by chance, we’d expect a 50% accuracy rate. Of course, quite often, researchers disagree on what accurate means in a given context, but let’s say that all sides agree. Let’s say a sample is collected and found to have a 75% accuracy rate, which given the sample size is significantly different from chance performance (50%). The Optimists say, “see, people are significantly better than chance. There is good evidence for human rationality!” The Pessimists counter, “not so fast! People are significantly worse than the optimal (100%). There is good evidence for human irrationality!”

Now, let’s say we had a confidence interval around the 75% sample estimate, such that we were 95% confident that the accuracy rate was 75% plus or minus 10%. Note that this may do little to stop nearly dead chickens from fighting. Each side can still make essentially the same claims as they would with NHST. However, switching to estimation may do something since with the two-sided confidence interval it may be a bit harder to ignore the other camp’s perspective. Yes, 50% is outside the interval, but so is 100% — and vice versa. So, perhaps an estimative approach would encourage perspective taking a bit more in cases such as these. Maybe, but I won’t hold my breath.

The disagreement in this case stems mainly from the application of differing standards. The Optimists compare human performance to a dart-throwing chimp; the Pessimists compare human performance to Laplace’s demon. They disagree mainly on what the proper benchmark for gauging human performance should be. But, since they are supposed to be empiricists, they pretend to be arguing about the evidence they collected. Secretly, both sides know that performance is better than chance and far from optimal. However, neither side feels it can admit to the full statement, so the unnecessary pecking continues. It’s not even clear if there are imaginary pitbulls in the ring. Perhaps adversarialism is merely a mish-mash of ritualism and strategic calculation.

At any rate, I’m back where I always seem to end up: believing that what science and scientists need most of all is clear, well explicated reasoning that is allowed to flourish. The conditions that promote such flourishing are against ritualism and dogma, gaming the system, and imprecision and deception (including self-deception).

Probabilistic estimation of the kind actively promoted in the New Statistics is a positive step forwards, but it should be for something. For what? Hypothesis testing? That is too narrow an aim. Silly hypotheses may be tested in order to buttress crumbling theories. Sound probabilistic estimation may be recruited in the service of such ignoble aims. Those who are advancing the statistical methods are well placed to remind scientists of that.

No, we need probabilistic estimation to be for more than hypothesis testing. It should, in general, be for theory evaluation and revision. I liked that at the start of this post because it spelled out PETER, and was catchy (get it: Probabilistic Estimation for Theory Evaluation and Revision). But now, I realize it’s sorely incomplete because probabilistic estimation can be used for theory evaluation and revision in ways that are either sound or unsound. Most of what I’ve commented on here is the fact that the soundness of that process is what matters most.

So, while PETER was catchy, all is not lost. I’ve decided to PESTER you by proposing that we use Probabilistic Estimation for Sound Theory Evaluation and Revision.

If using the New Statistics for PESTER can be called pestering, then I wish my Australian colleagues much success pestering scientists and getting them to be pesterers themselves.  I would gladly join them in that cause.