1 Extended description of experiment
The Christmas 2018 statistical cognition experiment ran continuously from 16 December 2018 to 1 January 2019. Participants were recruited via social media (particularly Twitter and Facebook). We used Qualtrics to deploy the experiment, which was written in HTML/CSS/Javascript. Participants were asked to perform a series of fictitious experiments with a two-group design and come to a conclusion regarding which of the two groups was “faster.” The way the fictitious results were reported to the participants was unusual — no numbers were given — to test participants’ ability to use significance testing logic.
Of particular interest to us were:
- whether participants sought information relevant to a significance test, and ignored irrelevant information,
- whether participants could come to the right conclusion with high probability,
- whether participants’ conclusions were reasonable given the information they were given, and
- whether participants’ descriptions of their strategies were consistent with significance testing logic.
1.1 Basic task setup
Toy name | Hidden effect size (δ) | Probability |
---|---|---|
whizbang balls | 0.00 | 0.25 |
constructo bricks | 0.10 | 0.11 |
rainbow clickers | 0.19 | 0.11 |
doodle noodles | 0.30 | 0.11 |
singing bling rings | 0.43 | 0.11 |
brahma buddies | 0.60 | 0.11 |
magic colorclay | 0.79 | 0.11 |
moon-candy makers | 1.00 | 0.11 |
Participants were randomly assigned to one of two evidence powers (wide, \(q=3\); or narrow, \(q=7\)) with equal probability. Participants were also randomly assigned to one of eight “true” effect sizes. Because the behaviour of participants when there is no true effect was particularly of interest, the probability of assignment to no effect (\(\delta=0\)) was 25%. Across the seven other effect sizes listed in Table 1.1, the remaining 75% probability was evenly distributed.
The cover story (which can be read here) presented a problem in which it was desired to know which of two groups of elves (“Sparklies” or “Jinglies”) was faster. Participants were presented with the results of fictitious experiments as they requested them.
Participants could increase or decrease the sample size for the experiments as well; importantly, they were not aware of the actual sample size. Participants could adjust the sample size with a slider that had 20 divisions. The corresponding 20 hidden sample sizes are shown in Table 1.2.
n index | n | Time (s) |
---|---|---|
1 | 10 | 1 |
2 | 12 | 2 |
3 | 14 | 2 |
4 | 16 | 2 |
5 | 19 | 2 |
6 | 22 | 3 |
7 | 26 | 3 |
8 | 30 | 3 |
9 | 35 | 4 |
10 | 41 | 5 |
11 | 48 | 5 |
12 | 57 | 6 |
13 | 66 | 7 |
14 | 78 | 8 |
15 | 91 | 10 |
16 | 106 | 11 |
17 | 125 | 13 |
18 | 146 | 15 |
19 | 171 | 18 |
20 | 200 | 20 |
The results were returned to the participants in a visual fashion: instead of being presented with a test statistic, each result was associated with a color intensity (from white to red) and a horizontal location that we will describe as left (-1) to right (1). Results on the far left were red and were associated with maximum evidence for the Sparklies being faster; results in the center were white were not evidence for either group; and results on the right were again red and were evidence for the Jinglies being faster. The intensity of the red color was defined by a linear gradient on the transparency (alpha) from 0 to -1 or 1 as defined using CSS.1 The resulting horizintal axis is shown in the figure below.
1.2 Underlying statistical details
Each fictitious result was the result of applying a transformation to randomly sampled \(Z\) statistic. The distribution of the \(Z\) statistic was a function of the randomly-assigned (but unknown to the participant) effect size \(\delta\) and a group sample size \(n\) (adjustable by, but unknown to, the participant).
\[ Z \sim \mbox{Normal}(\delta\sqrt{n/2}, 1) \]
The \(x\) location of the result, and hence the color, was then defined by the transformation: \[ x = \mbox{sgn}(Z)\left[1 - \left(1 - F_{\chi_1^2}\left(Z^2\right)\right)^{\frac{1}{q}}\right] \] where evidence power \(q\in\{3,7\}\), the location \(x \in (-1,1)\), and \(F_{\chi_1^2}\) is the cumulative distribution function of the \(\chi_1^2\) distribution.
The figure below shows the transformation from \(Z\) statistics (top axis) and one-sided \(p\) values (bottom axis) to \(x\) locations.
2 Visual examples of task
2.1 Video example
2.2 One participant’s final result
This participant’s (id: R_0xLf49ng52wHp1n) final decision was “no_detect.”
3 Difficulties for likelihood or Bayesian accounts
The task is set up specifically to make it difficult to apply other modes of inference to the problem, while encouraging significance testing. We would not argue that significance testing is the only mode of inference that participants might use in other contexts, or that it is their preferred mode; rather, the goal was to test whether they had enough of an understanding of significance testing to apply it.
In order to test this, it was necessary to block other kinds of inference. Two aspects of the task make other inference methods difficult: the arbitrary transformation of the test statistic, and the removal of sample size information.
3.1 Likelihood inference
Consider what would be necessary for a likelihood inference from these data. A likelihoodist would require a model with a parameter \(\theta\): \[ l(\theta; \mathbf x) \propto \prod_{i=1}^M f_{n_i}(x_i;\theta) \] where \(\mathbf x\) is the vector of length \(M\) of all evidence/\(x\)-locations of experimental samples produced by the participant, \(x_i\) represents the \(i\)th element of \(\mathbf x\), and \(f_{n_i}\) represents the density function of experiments for the hidden sample size \(n_i\). The density functions are unknown, as are the sample sizes. There is no obvious measure of effect size on which to build a model.
One might choose an impoverished model that throws out information about \(x\) and only uses the ordering of null samples and experimental samples: \[ \theta = Pr(X>X_0), \] where \(X\) is a draw from an experiment and \(X_0\) is a draw from the shuffle reports. But then \(\theta\) would be dependent on \(n_i\) in an unknown way except under the null hypothesis where \(\theta=0.5\) (because under the null hypothesis experiments and random shuffle reports have the same distribution, by definition). Thus, the null is the only thing the participant knows.
Even a clever participant who knew about the arbitrary transformation from the \(z\) statistic would have difficulty. They might use the following strategy:
- Sample many random shuffle reports, estimate their distribution
- Find a transformation \(z_i = g(x_i)\) to take the shuffle reports to standard normal deviates
- Use this transformation to transform experimental samples to \(z\) statistics
- Use likelihood inference on the mean of the \(z\) statistics from experimental samples
This strategy would require extensive knowledge of statistical theory and, likely, sophisticated programming skills used during the task. It is difficult to imagine anyone applying it (and certainly no participant reported doing such a strategy). But even this strategy is frustrated, because the transformation conflates the effect size and sample size. Only \(\delta\sqrt{n/2}\) is estimable, and \(n\) is unknown. No inference about \(\delta\) is possible, except that it is different from 0.
3.2 Bayesian inference
“For [statistical] hypotheses, Bayes’ theorem tells us this: Unless the observed facts are absolutely impossible on hypothesis \(H_0\), it is meaningless to ask how much those facts tend ‘in themselves’ to confirm or refute \(H_0\). Not only the mathematics, but also our innate common sense (if we think about it for a moment) tell us that we have not asked any definite, well-posed question until we specify the possible alternatives to \(H_0\).” – Jaynes (2003)
Bayesian inference, being dependent on the likelihood, is saddled with the difficulties outlined above, as well as an additional one: the Bayesian has need of a prior. As Jaynes points out, Bayesian inference requires specified alternative. Typically, this would be a set of alternatives rendered as a prior distribution over some parameter, \(p(\theta)\). In this problem, there is no clear parameter and hence it is not clear what the prior would be placed on. Additionally, the sample sizes are unknown, leading to the problem that it is unclear how a Bayesian would update their prior to the posterior; the sample sizes are clearly relevant information, but hidden from the analyst. The final difficulty is that the likelihood is unknown due to the arbitrary transform applied to the \(z\) statistic.
Suppose, however, that the Bayesian analyst knew about the transformation, and applied the strategy described in the previous section to reverse-engineer the transformation back to \(z\) scores. The inference about \(\delta\) would be dependent on their prior about the sample sizes. As emphasized by Bayesians who appeal to the Jeffreys-Lindley paradox (Lindley 1957), the statistical support for the null hypothesis \(\delta=0\) depends on the sample size such that assuming larger sample sizes guarantees evidence for the null hypothesis. The inference about \(\delta\) is completely confounded with \(n\): assuming larger \(n\) means inferring smaller \(\delta\).
It is possible that a Bayesian making just the right assumptions, and working very hard to reverse-engineer the transformation, could perform the task well. However, their inference would be dependent on these strong assumptions.
No participant reported using such a strategy (or anything like it).
4 Self-reported understanding of shuffle reports
Question: “Do you understand why the random shuffle reports could be useful?”
5 Participants’ confidence in their responses
Question: How confident are you in your assessment above?
- Not confident at all
- Somewhat doubtful
- Somewhat confident
- Very confident
6 Open-ended strategy questions
6.2 Frequencies of strategies
The figure below shows the frequencies of various self-reported strategies as coded by the authors.
Comparison | Samp. variance | Asymmetry | Inc. asymmetry | No shuffles | |
---|---|---|---|---|---|
Comparison | 1 | 0.71 | 0.59 | 0.64 | 0 |
Samp. variance | 0.7 | 1 | 0.62 | 0.6 | 0.07 |
Asymmetry | 0.62 | 0.65 | 1 | 1 | 0.71 |
Inc. asymmetry | 0.2 | 0.19 | 0.29 | 1 | 0.25 |
No shuffles | 0 | 0.01 | 0.07 | 0.08 | 1 |
7 Exploratory sanity checks
This section reports a few analyses based on the coded open-text responses. Analyses in this section should be treated with skepticism due to theie exploratory nature and the fact that they are conditioning on an observed variable that is likely correlated in interesting ways with the decision. Also, some people did not respond to these questions, so their strategy is unknown; yet they will be categorized with those who responded and offered an invalid strategy.
7.1 Error rates
We can compare error rates of those who reported using a strong significance-testing strategy versus ones who did not.
7.2 Sampling behaviour
Did participants who reported more strong significance testing strategies use more null samples, on average? It appears so (left figure below; note the log scale).
Did participants who reported more strong significance testing strategies use more (or fewer) experimental samples, on average? There’s not much evidence either way, but importantly (see plot) those who reported fewer strong significance testing strategies didn’t also sample fewer experiments on average. There is nothing here to suggest they were less engaged on average.
8 Signal detection analysis
We can treat the participants as “detectors” of the true effect, and ask how well the scientists were able to distinguish signal from noise, in the aggregate. To do this, we fit a signal detection model that allowed \(d'\) to vary as a function of the true effect size.
Let \(d'_i\) (\(i=-7,\ldots,7\)) be the means for the 15 effect size conditions from \(\delta=-1\) to \(\delta=1\). We constrain \(d'_0=0\) , \(d'_i\leq d'_j\) when \(i<j\) (monotonicity), and \(d_{i}=-d_{-i}\) (symmetry). The probabilities of responding “Jinglies” or “Sparklies” in the \(i\)th effect size condition are \[ \begin{eqnarray} Pr_i(\mbox{Sparklies}) &=& \Phi\left(c_1 - d'_i\right)\\ Pr_i(\mbox{Jinglies}) &=& 1 - \Phi\left(c_2 - d'_i\right) \end{eqnarray} \] where \(\Phi\) is the cumulative distribution function of the standard normal distribution and \(c_1,c_2\) are criteria (not necessarily symmetric). The probability of responding “same” or “cannot detect” is the remaining probability.
The model has 9 parameters (two criteria and seven means, though the monotonicity assumption adds more constraint) for 30 non-redundant data points. The model was fit using maximum likelihood (R code is available in the source of this document).
Figure 8.1 shows the estimated \(d'\) parameters as a function of effect size. They range from 1.415 when \(\delta=.1\) to 3.237 when \(\delta=1\).
Figure 8.2 shows the predicted and observed probabilities of all responses. The good fit of this model licenses the interpretation of the \(d'\) parameters and the collapsing across the sign of the effect in Figure 3 of the main document.
Compiled 2020-11-04 16:46:21 (Europe/London) under R version 4.0.3 (2020-10-10).
References
The CSS code for the transition between red (-1) to white (0) was
linear-gradient(to right, rgba(255,0,0,1), rgba(255,0,0,0))
; from 0 to 1 the transition was the reverse.↩︎