Statistical Learning and Scientific Decisions Andrew Gelman, Department of Statistics and Department of Political Science Blakely B. McShane, Kellogg School of Management, Northwestern University 30 Sep 2017 There are a lot of ways to gain replicable quantitative knowledge about the world. In the design stage, we try to gather data that are _representative_ of the population of interest, collected under _realistic_ conditions, with _measurements_ that are accurate and relevant to underlying questions of interest. Over the past century, statisticians and applied researchers have refined methods such as probability sampling, randomized experimentation, and reliable and valid measurement in order to allow them to gather data that ever more directly speaks to the scientific hypotheses and predictions of interest. And each of these protocols is associated with statistical methods to assess systematic and random error, which is necessary given that we are inevitably interested in making inferences that generalize beyond the data at hand. In practice, it is necessary to go even further and introduce assumptions to deal with issues such as missing data, nonresponse, selection based on unmeasured factors, and plain old interpolation and extrapolation. All the above is basic to statistics, and the usual textbook story--at least until recently--was that it all was working fine. Researchers design experiments, gather data, add some assumptions, and produce inferences and probabilistic predictions which are approximately calibrated. And the scientific community verifies this through out-of-sample tests--that is, replications. On the ground, though, something else has been happening. Yes, researchers conduct their studies, but often with little regard to reliability and validity of measurement. Why? Because the necessary step on the way to publication and furthering of a line of research was _not_ prediction, _not_ replicable inference, but rather statistical significance--p-values below 0.05--a _threshold_ that was supposed to protect science from false alarms but instead, for reasons of "researcher degrees of freedom" (Simmons, Nelson, and Simonsohn, 2011) or the "garden of forking paths" (Gelman and Loken, 2014) was stunningly easy to attain even from pure noise. In the past few years, it has been increasingly clear that there are areas of science, including certain subfields of social psychology and medicine, where false alarms (which can be defined as cases in which a strong claim is made that is not--and often cannot be--supported by the data at hand) have been sounding so loudly and frequently that little else can be heard. There are two pieces of evidence to support the claim that these areas of science are broken. First, the use of significance testing as a threshold screening rule in domains with noisy data and abundant researcher degrees of freedom yields claims with exaggerated effect sizes (type M or magnitude errors) that are frequently in the wrong direction (type S or sign errors), as discussed, for example, by Button et al. (2013) and Gelman and Carlin (2014). Second, empirical efforts to replicate published findings have frequently failed (Open Science Collaboration, 2015), with many prominent examples; see, for example, Engber (2016) and Bartels (2017). This combined onslaught from statistical theory and empirical efforts has led to the replication crisis being recognized from nearly all corners of the biomedical and social sciences. What should be done? In a widely discussed recent paper, Benjamin et al. (2017) recommend replacing the conventional p<0.05 threshold by the more stringent p<0.005 for "claims of new discoveries." However, because merely changing the p-value threshold cannot alleviate the theoretical issues discussed above and likely not the empirical ones, we and our colleagues recommend moving beyond the paradigm in which "discoveries" are made by single studies and single (thresholded) p-values. In particular, we suggest the p-value be treated, continuously, as just one among many pieces of evidence such as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. In doing so, we offer concrete recommendations--for editors and reviewers as well as for authors--for how this can be achieved in practice; see McShane, Gal. et al. (2017). Looking forward, we think more work is needed in designing experiments and taking measurements that are more precise and more closely tied to theory/constructs, doing within-person comparisons as much as possible, and using models that harness prior information, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017a, b), and--of course--tying this to realism in experimental conditions. Acknowledgements Published in the Brains Blog, http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx We thank the National Science Foundation, Institute for Education Sciences, Office of Naval Research, and Defense Advanced Research Projects Agency for partial support of this work. References Benjamin, D. J., et al. (2017). Redefine statistical significance. Nature Human Behaviour. doi:10.1038/s41562-017-0189-z Bartels, M. (2017). 'Power poses' don't really make you more powerful, nine more studies confirm. Newsweek, 13 Sep. http://www.newsweek.com/power-poses-dont-make-you-more-powerful-studies-664261 Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B., Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14, 1-12. Engber, D. (2016). Everything Is crumbling. Slate, 6 Mar. http://www.slate.com/articles/health_and_science/cover_story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.html Gelman, A. (2015). The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. Journal of Management 41, 632–643. Gelman, A. (2017). The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personality and Social Psychology Bulletin. Gelman, A., and Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science 9, 641-651. McShane, B. B., and Bockenholt, U. (2017a). Single paper meta-analysis: Benefits for study summary, theory-testing, and replicability. Journal of Consumer Research 43, 1048–1063. McShane, B. B., and Bockenholt, U. (2017b). Multilevel multivariate meta-analysis with application to choice overload. Psychometrika. McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2017). Abandon statistical significance. http://www.stat.columbia.edu/~gelman/research/unpublished/abandon.pdf Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349, 943. Simmons, J., Nelson, L., and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant. Psychological Science 22, 1359-1366.