1 Extended description of experiment

The Christmas 2018 statistical cognition experiment ran continuously from 16 December 2018 to 1 January 2019. Participants were recruited via social media (particularly Twitter and Facebook). We used Qualtrics to deploy the experiment, which was written in HTML/CSS/Javascript. Participants were asked to perform a series of fictitious experiments with a two-group design and come to a conclusion regarding which of the two groups was “faster.” The way the fictitious results were reported to the participants was unusual — no numbers were given — to test participants’ ability to use significance testing logic.

Of particular interest to us were:

  • whether participants sought information relevant to a significance test, and ignored irrelevant information,
  • whether participants could come to the right conclusion with high probability,
  • whether participants’ conclusions were reasonable given the information they were given, and
  • whether participants’ descriptions of their strategies were consistent with significance testing logic.

1.1 Basic task setup

Table 1.1: “Toy” name, corresponding true effect sizes used in the study (in standard deviation units) and the probability that a participant would be randomly assigned to that effect size.
Toy name Hidden effect size (δ) Probability
whizbang balls 0.00 0.25
constructo bricks 0.10 0.11
rainbow clickers 0.19 0.11
doodle noodles 0.30 0.11
singing bling rings 0.43 0.11
brahma buddies 0.60 0.11
magic colorclay 0.79 0.11
moon-candy makers 1.00 0.11

Participants were randomly assigned to one of two evidence powers (wide, \(q=3\); or narrow, \(q=7\)) with equal probability. Participants were also randomly assigned to one of eight “true” effect sizes. Because the behaviour of participants when there is no true effect was particularly of interest, the probability of assignment to no effect (\(\delta=0\)) was 25%. Across the seven other effect sizes listed in Table 1.1, the remaining 75% probability was evenly distributed.

The cover story (which can be read here) presented a problem in which it was desired to know which of two groups of elves (“Sparklies” or “Jinglies”) was faster. Participants were presented with the results of fictitious experiments as they requested them.

Participants could increase or decrease the sample size for the experiments as well; importantly, they were not aware of the actual sample size. Participants could adjust the sample size with a slider that had 20 divisions. The corresponding 20 hidden sample sizes are shown in Table 1.2.

Table 1.2: Sample size index, underlying hidden per-group sample size, and corresponding time delay in seconds taken to return the experimental result, if the result requested was an experimental sample. Random shuffle reports were returned instantaneously.
n index n Time (s)
1 10 1
2 12 2
3 14 2
4 16 2
5 19 2
6 22 3
7 26 3
8 30 3
9 35 4
10 41 5
11 48 5
12 57 6
13 66 7
14 78 8
15 91 10
16 106 11
17 125 13
18 146 15
19 171 18
20 200 20

The results were returned to the participants in a visual fashion: instead of being presented with a test statistic, each result was associated with a color intensity (from white to red) and a horizontal location that we will describe as left (-1) to right (1). Results on the far left were red and were associated with maximum evidence for the Sparklies being faster; results in the center were white were not evidence for either group; and results on the right were again red and were evidence for the Jinglies being faster. The intensity of the red color was defined by a linear gradient on the transparency (alpha) from 0 to -1 or 1 as defined using CSS.1 The resulting horizintal axis is shown in the figure below.

The interface on which fictitious results were presented to the participants.

Figure 1.1: The interface on which fictitious results were presented to the participants.

1.2 Underlying statistical details

Each fictitious result was the result of applying a transformation to randomly sampled \(Z\) statistic. The distribution of the \(Z\) statistic was a function of the randomly-assigned (but unknown to the participant) effect size \(\delta\) and a group sample size \(n\) (adjustable by, but unknown to, the participant).

\[ Z \sim \mbox{Normal}(\delta\sqrt{n/2}, 1) \]

The \(x\) location of the result, and hence the color, was then defined by the transformation: \[ x = \mbox{sgn}(Z)\left[1 - \left(1 - F_{\chi_1^2}\left(Z^2\right)\right)^{\frac{1}{q}}\right] \] where evidence power \(q\in\{3,7\}\), the location \(x \in (-1,1)\), and \(F_{\chi_1^2}\) is the cumulative distribution function of the \(\chi_1^2\) distribution.

The figure below shows the transformation from \(Z\) statistics (top axis) and one-sided \(p\) values (bottom axis) to \(x\) locations.

Transformation between traditional test statistics ($Z$, $p$) and the $x$ location

Figure 1.2: Transformation between traditional test statistics (\(Z\), \(p\)) and the \(x\) location

Evidence distributions as functions of underlying effect size, for non-negative effect sizes. Distributions from left to right correspond to increasingly large true effect sizes. Evidence distributions for negative effect sizes were mirror images of these distributions. A: Smallest sample size, wide evidence distribution; B: Largest sample size, wide evidence distribution; C: Smallest sample size, narrow evidence distribution; D: Largest sample size, narrow evidence distribution

Figure 1.3: Evidence distributions as functions of underlying effect size, for non-negative effect sizes. Distributions from left to right correspond to increasingly large true effect sizes. Evidence distributions for negative effect sizes were mirror images of these distributions. A: Smallest sample size, wide evidence distribution; B: Largest sample size, wide evidence distribution; C: Smallest sample size, narrow evidence distribution; D: Largest sample size, narrow evidence distribution

Selected two-sided $p$ values for the null distribution of the evidence for the narrow evidence distribution ($q=7$)

Figure 1.4: Selected two-sided \(p\) values for the null distribution of the evidence for the narrow evidence distribution (\(q=7\))

Selected two-sided $p$ values for the null distribution of the evidence for the wide evidence distribution ($q=3$)

Figure 1.5: Selected two-sided \(p\) values for the null distribution of the evidence for the wide evidence distribution (\(q=3\))

2 Visual examples of task

2.1 Video example

2.2 One participant’s final result

Selected participant's random shuffle report samples.

Figure 2.1: Selected participant’s random shuffle report samples.

Selected participant's experimental samples.

Figure 2.2: Selected participant’s experimental samples.

This participant’s (id: R_0xLf49ng52wHp1n) final decision was “no_detect.”

3 Difficulties for likelihood or Bayesian accounts

The task is set up specifically to make it difficult to apply other modes of inference to the problem, while encouraging significance testing. We would not argue that significance testing is the only mode of inference that participants might use in other contexts, or that it is their preferred mode; rather, the goal was to test whether they had enough of an understanding of significance testing to apply it.

In order to test this, it was necessary to block other kinds of inference. Two aspects of the task make other inference methods difficult: the arbitrary transformation of the test statistic, and the removal of sample size information.

3.1 Likelihood inference

Consider what would be necessary for a likelihood inference from these data. A likelihoodist would require a model with a parameter \(\theta\): \[ l(\theta; \mathbf x) \propto \prod_{i=1}^M f_{n_i}(x_i;\theta) \] where \(\mathbf x\) is the vector of length \(M\) of all evidence/\(x\)-locations of experimental samples produced by the participant, \(x_i\) represents the \(i\)th element of \(\mathbf x\), and \(f_{n_i}\) represents the density function of experiments for the hidden sample size \(n_i\). The density functions are unknown, as are the sample sizes. There is no obvious measure of effect size on which to build a model.

One might choose an impoverished model that throws out information about \(x\) and only uses the ordering of null samples and experimental samples: \[ \theta = Pr(X>X_0), \] where \(X\) is a draw from an experiment and \(X_0\) is a draw from the shuffle reports. But then \(\theta\) would be dependent on \(n_i\) in an unknown way except under the null hypothesis where \(\theta=0.5\) (because under the null hypothesis experiments and random shuffle reports have the same distribution, by definition). Thus, the null is the only thing the participant knows.

Even a clever participant who knew about the arbitrary transformation from the \(z\) statistic would have difficulty. They might use the following strategy:

  1. Sample many random shuffle reports, estimate their distribution
  2. Find a transformation \(z_i = g(x_i)\) to take the shuffle reports to standard normal deviates
  3. Use this transformation to transform experimental samples to \(z\) statistics
  4. Use likelihood inference on the mean of the \(z\) statistics from experimental samples

This strategy would require extensive knowledge of statistical theory and, likely, sophisticated programming skills used during the task. It is difficult to imagine anyone applying it (and certainly no participant reported doing such a strategy). But even this strategy is frustrated, because the transformation conflates the effect size and sample size. Only \(\delta\sqrt{n/2}\) is estimable, and \(n\) is unknown. No inference about \(\delta\) is possible, except that it is different from 0.

3.2 Bayesian inference

“For [statistical] hypotheses, Bayes’ theorem tells us this: Unless the observed facts are absolutely impossible on hypothesis \(H_0\), it is meaningless to ask how much those facts tend ‘in themselves’ to confirm or refute \(H_0\). Not only the mathematics, but also our innate common sense (if we think about it for a moment) tell us that we have not asked any definite, well-posed question until we specify the possible alternatives to \(H_0\).” – Jaynes (2003)

Bayesian inference, being dependent on the likelihood, is saddled with the difficulties outlined above, as well as an additional one: the Bayesian has need of a prior. As Jaynes points out, Bayesian inference requires specified alternative. Typically, this would be a set of alternatives rendered as a prior distribution over some parameter, \(p(\theta)\). In this problem, there is no clear parameter and hence it is not clear what the prior would be placed on. Additionally, the sample sizes are unknown, leading to the problem that it is unclear how a Bayesian would update their prior to the posterior; the sample sizes are clearly relevant information, but hidden from the analyst. The final difficulty is that the likelihood is unknown due to the arbitrary transform applied to the \(z\) statistic.

Suppose, however, that the Bayesian analyst knew about the transformation, and applied the strategy described in the previous section to reverse-engineer the transformation back to \(z\) scores. The inference about \(\delta\) would be dependent on their prior about the sample sizes. As emphasized by Bayesians who appeal to the Jeffreys-Lindley paradox (Lindley 1957), the statistical support for the null hypothesis \(\delta=0\) depends on the sample size such that assuming larger sample sizes guarantees evidence for the null hypothesis. The inference about \(\delta\) is completely confounded with \(n\): assuming larger \(n\) means inferring smaller \(\delta\).

It is possible that a Bayesian making just the right assumptions, and working very hard to reverse-engineer the transformation, could perform the task well. However, their inference would be dependent on these strong assumptions.

No participant reported using such a strategy (or anything like it).

4 Self-reported understanding of shuffle reports

Question: “Do you understand why the random shuffle reports could be useful?”

5 Participants’ confidence in their responses

Question: How confident are you in your assessment above?

  • Not confident at all
  • Somewhat doubtful
  • Somewhat confident
  • Very confident
Reported confidence ratings (vertical axis) in their response for participants who judged that they either could not detect a difference between the groups, or that the groups were the same, by true effect size (horizontal axis). Columns with very vew participants in them (when the effect size was large, but they did not respond that there was a difference) are faded to indicate lack of trustworthiness of the frequencies.

Figure 5.1: Reported confidence ratings (vertical axis) in their response for participants who judged that they either could not detect a difference between the groups, or that the groups were the same, by true effect size (horizontal axis). Columns with very vew participants in them (when the effect size was large, but they did not respond that there was a difference) are faded to indicate lack of trustworthiness of the frequencies.

Reported confidence ratings (vertical axis) in their response for participants who judged that they either could not detect a difference between the groups, or that the groups were the same, by true effect size (horizontal axis).

Figure 5.2: Reported confidence ratings (vertical axis) in their response for participants who judged that they either could not detect a difference between the groups, or that the groups were the same, by true effect size (horizontal axis).

6 Open-ended strategy questions

6.2 Frequencies of strategies

The figure below shows the frequencies of various self-reported strategies as coded by the authors.

Coded frequencies of different strategies in the open-ended responses. A "Strong" indicates use of comparison to the null distribution or  its use to assess sampling variability/distribution. A "Weak only" response indicates some indication of symmetry or asymmetry (or its use as a strategy).

Figure 6.1: Coded frequencies of different strategies in the open-ended responses. A “Strong” indicates use of comparison to the null distribution or its use to assess sampling variability/distribution. A “Weak only” response indicates some indication of symmetry or asymmetry (or its use as a strategy).

Table 6.1: The conditional probabilities of the various responses. Each entry shows the probability of the response type on the row given the response type on column. These probabilities exclude the 29 missing/irrelevant responses.
(given)
Comparison Samp. variance Asymmetry Inc. asymmetry No shuffles
Comparison 1 0.71 0.59 0.64 0
Samp. variance 0.7 1 0.62 0.6 0.07
Asymmetry 0.62 0.65 1 1 0.71
Inc. asymmetry 0.2 0.19 0.29 1 0.25
No shuffles 0 0.01 0.07 0.08 1

7 Exploratory sanity checks

This section reports a few analyses based on the coded open-text responses. Analyses in this section should be treated with skepticism due to theie exploratory nature and the fact that they are conditioning on an observed variable that is likely correlated in interesting ways with the decision. Also, some people did not respond to these questions, so their strategy is unknown; yet they will be categorized with those who responded and offered an invalid strategy.

7.1 Error rates

We can compare error rates of those who reported using a strong significance-testing strategy versus ones who did not.

Proportion who have a "difference" response as a function of effect size, for participants who were coded as reported 0, 1, or 2 strong significance testing strategies. Ribbons are standard errors.

Figure 7.1: Proportion who have a “difference” response as a function of effect size, for participants who were coded as reported 0, 1, or 2 strong significance testing strategies. Ribbons are standard errors.

7.2 Sampling behaviour

Did participants who reported more strong significance testing strategies use more null samples, on average? It appears so (left figure below; note the log scale).

Did participants who reported more strong significance testing strategies use more (or fewer) experimental samples, on average? There’s not much evidence either way, but importantly (see plot) those who reported fewer strong significance testing strategies didn’t also sample fewer experiments on average. There is nothing here to suggest they were less engaged on average.

Boxes, in order from left to right (red, green, blue) are 0, 1, and 2 strong significance testing strategies, respectively. Lines are robust regression fits. Stars show participants who indicated they did not use the random shuffle reports. Null samples (left) order from bottom to top (0, 1, 2); experimental samples (right) order is (3, 0, 1) from bottom to top.

Figure 7.2: Boxes, in order from left to right (red, green, blue) are 0, 1, and 2 strong significance testing strategies, respectively. Lines are robust regression fits. Stars show participants who indicated they did not use the random shuffle reports. Null samples (left) order from bottom to top (0, 1, 2); experimental samples (right) order is (3, 0, 1) from bottom to top.

8 Signal detection analysis

We can treat the participants as “detectors” of the true effect, and ask how well the scientists were able to distinguish signal from noise, in the aggregate. To do this, we fit a signal detection model that allowed \(d'\) to vary as a function of the true effect size.

Let \(d'_i\) (\(i=-7,\ldots,7\)) be the means for the 15 effect size conditions from \(\delta=-1\) to \(\delta=1\). We constrain \(d'_0=0\) , \(d'_i\leq d'_j\) when \(i<j\) (monotonicity), and \(d_{i}=-d_{-i}\) (symmetry). The probabilities of responding “Jinglies” or “Sparklies” in the \(i\)th effect size condition are \[ \begin{eqnarray} Pr_i(\mbox{Sparklies}) &=& \Phi\left(c_1 - d'_i\right)\\ Pr_i(\mbox{Jinglies}) &=& 1 - \Phi\left(c_2 - d'_i\right) \end{eqnarray} \] where \(\Phi\) is the cumulative distribution function of the standard normal distribution and \(c_1,c_2\) are criteria (not necessarily symmetric). The probability of responding “same” or “cannot detect” is the remaining probability.

The model has 9 parameters (two criteria and seven means, though the monotonicity assumption adds more constraint) for 30 non-redundant data points. The model was fit using maximum likelihood (R code is available in the source of this document).

Aggregate signal detection analysis. Fitted d' for each effect size. See text for a description of the fitted model.

Figure 8.1: Aggregate signal detection analysis. Fitted d’ for each effect size. See text for a description of the fitted model.

Figure 8.1 shows the estimated \(d'\) parameters as a function of effect size. They range from 1.415 when \(\delta=.1\) to 3.237 when \(\delta=1\).

Observed and fitted probabilities for each effect size and response. Lines show predicted probabilities; ribbons show where 68% of the observed probabilities should fall given the predicted probabilities. These limits are approximate due to the discreteness of the response.

Figure 8.2: Observed and fitted probabilities for each effect size and response. Lines show predicted probabilities; ribbons show where 68% of the observed probabilities should fall given the predicted probabilities. These limits are approximate due to the discreteness of the response.

Figure 8.2 shows the predicted and observed probabilities of all responses. The good fit of this model licenses the interpretation of the \(d'\) parameters and the collapsing across the sign of the effect in Figure 3 of the main document.


Compiled 2020-11-04 16:46:21 (Europe/London) under R version 4.0.3 (2020-10-10).


References

Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK: Cambridge University Press.
Lindley, D. V. 1957. “A Statistical Paradox.” Biometrika 44: 187–92.

  1. The CSS code for the transition between red (-1) to white (0) was linear-gradient(to right, rgba(255,0,0,1), rgba(255,0,0,0)); from 0 to 1 the transition was the reverse.↩︎