As we were writing the paper and analyzing participant responses, we had a number of thoughts that might be helpful for interpreting the results and designing future studies. We include them here.

Fallacies/errors in the responses

Misunderstanding of test statistics

The task did not specify what the “evidence” was; in reality, it was a transformed \(z\) statistic. Like most test statistics, it was engineered to have the same distribution for all sample sizes.

Many participants were confused by the fact that the variability of the random shuffle reports did not decrease with N, apparently expecting the “evidence” to be like an effect size (e.g., decreasing sampling uncertainty with increasing sample size). They took the lack of difference in sampling variability to mean that there was very little difference between the sample sizes from small to large, or that the sample size was already very large.

A representative comment along these lines is R_1kZDBRI07FeT7mM, who said they used the smallest sample size because the random shuffle distribution didn’t change with the sample size.

Shuffle reports as data

Although the vast majority of participants understood the task, a small number (single digits) thought the random shuffle reports were data and based their decision largely on them, because they took no time to sample.

Other uses of shuffle reports

Some people used the random shuffle reports to learn about the task, to check assumptions (e.g., “they should be unbiased/normal”; “problem is well-behaved”; etc), or to try to infer something about the differences in the sample sizes (as described above). This was a use of the shuffle reports we had not anticipated.

Confusions between power and sample size

For instance, R_2CVYgvlxT5zVGTk said “I conducted 3 highly powered studies to detect even small differences. The results were not conclusive and had similar variability as a shuffle study.” Because the sample sizes were unknown and no measure of effect size was calculable, participants had no way of assessing the power of the design they chose; in fact, they had no way of even knowing what that sample size was. Moreover, power requires a decision criterion, and none was given.

Fooled by random “patterns”

Sometimes people were tricked by seeing patterns in random variation (e.g., R_2twK11fFMD0Fk77). There doesn’t seem to be a way to avoid this; it is the nature of randomness. But one might be more explicit about the design so that participants know more about the sampling, or trust it more.

Interface issues

Sampling delay

There was a mix of people saying the experiments took too long and others who said that they didn’t, and the participants used a variety of sample sizes (i.e., they didn’t automatically use the largest sample size all the time). This seems to indicate that we were successful in inducing people to think about trade-offs in resource use.

Instructions

It would have been good to have the instructions available during the task. We offered a short recap before the experiment, but a few participants said they went through the instructions too quickly and missed something.

Visual overwriting

Some people complained about the overlap between the points in the interface. We didn’t really anticipate people sampling so much! There are a variety of ways of solving this problem (jitter? histogram?), but they all have drawbacks (in particular, people need to understand the visualization), so it is a trade off. Given the informal nature of the desired inference, it is not clear this is a “problem”; we just wanted more of an impression from people, not an inference for which a precise interface would be needed. Also, the summary box helps with this (e.g., you could see the last result).

Importance of training in understanding the task

The task was meant to be intuitive in a way that vignettes are often not: many vignettes use technical terminology like \(t\) statistics and \(p\) values outside of the context of any other results. The danger is that although people might understand the basic logic of significance testing, they may not strongly grasp this particular way of expressing the result.

However, our task suffers from this same issue (though to a lesser extent, we would argue). We made use of ideas in the instructions such as “null” and “sampling distribution”, which are part of scientists’ training. This allowed us to make use of scientists’ prior knowledge and spend less time training them in the task.

Other populations will not know such terms. A failure to properly train participants from other populations may result in poor performance, for the simple reason that these participants don’t understand the task (not because they can’t reason about it). Translating the task to other populations will require care.