Prologue

The history- and practice of science is convoluted (Wootton 2015), but as a student it was taught to me in a relatively uncomplicated manner. Among the things I remember from my high school science classes are how to convert a metric distance into (AUs) and that there was something called the research cycle (I always forgot the separate steps and their order, which will ironically be a crucial subject of this dissertation). Those classes presented things such as the AU and the empirical cycle as unambiguous truths. In hindsight, it is difficult to imagine these constructed ideas as historically unambiguous. For example, I was taught the AU as simple arithmetic while that calculation implies accepting a historically complex process full of debate on how an AU should be defined (Standish 2004). As such, that calculation was path-dependent, similar to how the history and practice of science in general is also path-dependent (Latour and Woolgard 1986; Gelman and Loken 2013).

Scientific textbooks understandably present a distillation of the scientific process. Not everyone needs the (full) history of discussions after broad consensus has already been reached. This is a useful heuristic for progress but also minimizes (maybe even belittles) the importance of the process (Latour and Woolgard 1986). As such, textbook science (vademecum science; Fleck 1984), with which science teaching starts, provides high certainty, little detail, and provides the breeding ground for a view of science as producing certain knowledge. Through this kind of teaching, storybook images of scientists and science might arise, often as the actors and process of discovering absolute truths rather than of uncertain and iterative production of coherent and consistent knowledge. Such storybook images likely result in substantively higher ratings for scientists than non-scientists with respect to objectivity, rationality, skepticism, rigor, and ethics, even after taking into account educational level (Veldkamp et al. 2016).

Scientific research articles tend to provide more details and less certainty than scientific textbooks, but still present storified findings that simplify a complicated process into a single, linear narrative (particularly salient in tutorials on writing journal publications; Bem 2000). Compared to scientific textbooks, which present a narrative across many studies, scientific articles provide a narrative across relatively few studies. Hence, scientific articles should be relatively better than scientific textbooks for understanding the validity of findings because they get more space to nuance, provide more details, and contextualize research findings. Nonetheless, the linear narrative of the scientific article distills and distorts a complicated non-linear research process and thereby provides little space to encapsulate the full nuance, detail, and context of findings. Moreover, storification of research results requires flexibility, where its manifestation in the flexibility of analyses may be one of the main culprits of false positive findings (i.e., incorrectly claiming an effect; Ioannidis 2005) and detracts from accurate reporting. The lack of detail and (excessive) storification go hand in hand with the misrepresentation of event chronology to present a more comprehensible narrative to the reader and researcher. For example, breaks from a main narrative (i.e., nonconfirming results) may be excluded from the reporting. Such misrepresentation becomes particularly problematic if the validity of the presented findings rests on the actual and complete order of events — as it does in the prevalent epistemological model based on the empirical research cycle (De Groot 1994). Moreover, the storification within scholarly articles can create highly discordant stories across scholarly articles, leading to conflicting narratives and confusion in research fields or news reports and, ultimately, less coherent understanding of science by both general- and specialized audiences.

When I started as a psychology student in 2009, I implicitly perceived science and scientists in the storybook way. I was the first in my immediate family to go to university, so I had no previous informal education about what ‘true’ scientists or ‘true’ science looked like — I was only influenced by the depictions in the media and popular culture. In other words, I thought scientists were objective, disinterested, skeptical, rigorous, ethical (and predominantly male). The textbook- and article based education I received at the university did not disconfirm or recalibrate this storybook image and, in hindsight, might have served to reinforce it (e.g., textbooks provided a decontextualized history that presented the path of discovery as linear, ‘the truth’ as unequivocal, multiple choice exams which could only receive correct or wrong answers, and certified stories in the form of peer reviewed publications). Granted, the empirical scientist was warranted the storybook qualities exactly because the empirical research cycle provided a way to overcome human biases and provided grounds for the widespread belief that search for ‘the truth’ was more important than individual gain.

As I progressed throughout my science education, it became apparent how naive the storybook image of science and the scientist was through a series of events that undercut the very epistemological model that granted these qualities. As a result of these events, I had what I somewhat dramatically called two ‘personal crises of epistemological faith in science’ (or put plainly: wake up calls). These crises strongly correlated with several major events within the psychology research community and raised doubts about the value of the research I was studying and conducting. Both these crises made me consider leaving scientific research and I am sure I was not alone in experiencing this sentiment.

My first crisis of epistemological faith was when the psychology professor who got me interested in research publicly confessed to having fabricated data throughout his academic career (Stapel 2012). Having been inspired to go down the path of scholarly research by this very professor and having worked as a research assistant for him, I doubted myself and my abilities and asked whether I was critical enough to conduct and notice valid research. After all, I had not had even an inch of suspicion while working with him. Moreover, I wondered what to make of my interest in research, given that the person who got me inspired appeared to be such a bad example to model myself to. Ultimately, I considered it unlikely that the majority of researchers would be fraudsters like this professor and simply realized that research could fail at various stages (e.g., data sharing, peer review). This event also unveiled to me the politics of science and how validity, rigor, and ‘truth’ finding was not a given (see for example Broad and Wade 1983). Regardless, the self-reported prevalence of fraudulent behaviors among scientists (viz. 2%; Fanelli 2009) was sufficiently low to not undermine the epistemological effort of the scientific collective (although it could still severely distort it). As a result, I became more skeptical of the certified stories in peer-reviewed journals and in my own and other’s research. I ultimately shifted my focus towards studying statistics to improve research.

A second epistemological crisis arose when I took a class that indicated that scientists undermine the empirical research cycle at a large scale. These behaviors were sometimes intentional, sometimes unintentional, but often the result of misconceptions and ill procedures in order to play the game of getting published (Bakker, Dijk, and Wicherts 2012). More specifically, this epistemological crisis originated from learning about how loose application of statistical procedures could produce statistically significant results from pretty much anything (e.g., Simmons, Nelson, and Simonsohn 2011). Additionally, these behaviors result in biased publication of results (Mahoney 1977) through the invisible (and often unaccountable) hand of peer review (Harnad 2000) that in itself suffers from various misconceptions. This combination potentially leads to a vicious cycle of overestimated (and sometimes false positive) effects leading to underpowered research that is selectively published leading to overestimated effects and underpowered research, and so on until that cycle gets disrupted. These issues are not necessarily new and have been discussed for over 40 years in some way or form (Sedlmeier and Gigerenzer 1989; Cohen 1962; Rosenthal 1979; Marszalek et al. 2011; Kerr 1998; Mills 1993). Given this longstanding vicious cycle, it seemed unlikely the issues in empirical research would resolve themselves — they seemed more likely to be further exacerbated if left unattended. Progress on these issues would not be trivial or self-evident, given that previous awareness subsided and attempts to improve the situation did not stick in the long run. It also indicated to me that the reforms needed had to be substantial, because the improvements made over the last decades remained insufficient (although the historical context is highly relevant, see Spellman 2015). Because of the failed attempts in the past and the awareness of these issues throughout the last six years or so, my epistemological crisis is ongoing and oscillates between frustration and hope for improvement.

Nonetheless, these two epistemological crises caused me to become increasingly engaged with various initiatives and research domains to actively contribute towards improving science. This was not only my personal way of coping with these crises and more specific incidents, it also felt like an exciting space to contribute to. In late 2012, I was introduced to the concept of Open Science for my first big research project. It seemed evident to me that Open Science was a great way to improve the verifiability of research (see also Hartgerink 2015a). The Open Science Framework had launched only recently (Spies 2017), which is where I started to document my work openly. I found it scary, difficult, and did not know where to start simply because I had never been taught to do science this way nor did anyone really know how. It led me to experiment with these new tools and processes, find out the practicalities of actually making my own work open, and have continued to do so ever since. It made my work in a more reproducible, open manner and also led me to become engaged in what are often called the Open Access and Open Science movements. Both these movements aim to make knowledge available to all in various ways, going beyond dumping excessive amounts of information but also making it comprehensible by providing clear documentation to for example data. Not only are the communities behind these movements supportive in educating each other in open practices, they also activated me to help others see the value of Open Science and how to implement it (it all started with Hartgerink 2014). Through this, activism within the realm of science became part of my daily scientific practice.

Actively improving science through doing research became the main motivation for me to pursue a PhD project. Initially, we set out to focus purely on statistical detection of data fabrication (linking back to the first epistemological crisis) within the PhD project. After all, the proposed methods to detect data fabrication had not been tested widely nor validated and there was a clear opportunity for a valuable contribution. Rather quickly, our attention widened towards a broader set of issues, resulting in a broad perspective on issues in science by looking at not only data fabrication, but also at questionable research practices, statistical results and the reporting thereof, complemented by thinking about incentivizing rigorous practices. This dissertation presents the results of this work in two parts.

Part 1 of this dissertation (chapters 1-6) pertains to research on understanding and detecting the tripartite of research practice (the good [responsible], the bad [fraudulent], and the ugly [questionable] practices so to speak). Chapter 1 reviews literature on research misconduct, questionable research practices, and responsible conduct of research. In addition to providing an introduction to these three topics in a systematic way by asking ‘What is it?’, ‘What do researchers do?’ and ‘How can we improve?’, the chapter also proposes a practical computer folder structure for transparent research practices in an attempt to promote responsible conduct of research. In Chapter 2, I report the reanalysis of data indicating widespread \(p\)-hacking across various scientific domains (Head et al. 2015b; Head et al. 2015a). The original research was highly reproducible itself, but slight and justifiable changes to the analyses failed to confirm the finding of widespread \(p\)-hacking across scientific domains. This chapter offered an initial indication of how difficult it is to robustly detect \(p\)-hacking. In an attempt to improve the detection and estimation of \(p\)-hacking, Chapter 3 replicated and extended the findings from Chapter 2. We replicated the analyses using an independent data set of statistical results in psychology (Nuijten, Hartgerink, et al. 2015) and found that \(p\)-value distributions are distorted through reporting habits (e.g., rounding to two decimals). Additionally, we set out to create and apply new statistical models in an attempt to improve detection of \(p\)-hacking. Chapter 4 focuses on the opposite of false positive results, namely false negative results. Here we argue that, based on the published statistically nonsignificant results in combination with typically small sample sizes, researchers are letting a lot of potential true effects slip off their radar if nonsignificant findings are interpreted as true zero effects. We introduce the adjusted Fisher method for testing the presence of non-zero true effects among a set of statistically nonsignificant results, and present three applications of this method. In Chapter 5 I report on a dataset containing over half a million statistical results extracted with the tool statcheck from the psychology literature. This chapter, in the form of a data paper, explains the methodology underlying the data collection process, how the data can be downloaded, that there are no copyright restrictions on the data, and what the limitations of the data are. This dataset was documented and shared for further research on understanding the reporting and reported results (original research using these data has already been conducted; Aczel, Palfi, and Szaszi 2017). Chapter 6 presents results on two studies where we tried to classify genuine- and fabricated data solely using statistical methods. In these two studies, we relied heavily on openly shared data from two Many Labs projects (R. A. Klein et al. 2014; Ebersole et al. 2016) and had a total of 67 researchers fabricate data in a controlled setting to determine which statistical methods distinguish between genuine- and fabricated data the best.

Part 2 of this dissertation (chapters 7-9) pertains to practical ways to improve the epistemological sustainability of science. Epistemological sustainability of science pertains to both the reliability of the knowledge produced as the longevity of the system that produces it. Chapter 7 specifically focuses on data retrieval from empirical research articles presenting vector images. We developed and tested software to this end, which is a promising way to mitigate the effect of rapidly decreasing odds of data retrieval as a paper gets older (Vines et al. 2014). In Chapter 8 we present a conceptual redesign of the scholarly communication system based on piecemeal modules, focusing on how networked scholarly communication might facilitate improved research and researcher evaluation. This conceptual redesign takes into account the issues of restricted access, researcher degrees of freedom, publication biases, perverse incentives for researchers, and other human biases in the conduct of research. The basis of this redesign is to shift from a reconstructive and text-based research article into a decomposed set of research modules that are communicated continuously and contain information in any form (e.g., text, code, data, video). Chapter 9 extends this new form of scholarly communication in its technical foundations and contextualizes it in the library- and information sciences (LIS). From LIS, five key functions of a scholarly communication system emerge: registration, certification, preservation, awareness, and incentives (Roosendaal and Geurts 1998; Sompel et al. 2004). First, I extend how the article-based scholarly communication system takes a narrow and unsatisfactory approach to the five functions. Second, I extend how new Web protocols, when used to implement the redesign proposed in Chapter 8, could fulfill the five scholarly communication functions in a wider and more satisfactory sense. In the Epilogue, I provide a high level framework to inform radical change in the scientific system, which brings together all the lessons from this dissertation.

The order of the chapters in this dissertation does not reflect the exact chronological order of events. Table 0.1 re-sorts the chapters in the chronological order and provides additional information for each chapter. More specifically, it includes the copyright license (all can be freely reused and redistributed without permission), a direct link to the collection of materials underlying that chapter (if relevant), whether the chapter was shared as a preprint, and the associated peer-reviewed publication (if any). If published, the chapters in this dissertation may be slightly different in word use or formatting, but contain substantively the same content. These are additional aspects to the chapters that attempt to improve the reproducibility of the chapters, in order to prevent the issues from my epistemological crises. A digital version of this dissertation is available under https://phd.chjh.nl.

## Warning in kableExtra::kable_styling(., bootstrap_options = c("striped", :
## Please specify format in kable. kableExtra can customize either HTML or
## LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Table 0.1: Chronologically ordered dissertation chapters, supplemented with identifiers data package, preprint, and peer-reviewed article.
Chapter Data package Preprint Article
4 https://osf.io/4d2g9/ http://doi.org/c9tf http://doi.org/c9s7
2 http://doi.org/c9s5
6 http://doi.org/c9th http://doi.org/c9td http://doi.org/c9s6
3 http://doi.org/c9tj http://doi.org/c9tc http://doi.org/c9s8
5 http://doi.org/c9tk http://doi.org/c9tg http://doi.org/gfrjj3
8 http://doi.org/c9tm https://arxiv.org/abs/1709.02261
9 http://doi.org/c9tb http://doi.org/c9s9
1
7 http://doi.org/c9tn http://doi.org/c9tq
10 http://doi.org/c9tp http://doi.org/gf4hpr
11
12

References

Aczel, Balazs, Bence Palfi, and Barnabas Szaszi. 2017. “Estimating the Evidential Value of Significant Results in Psychological Science.” Edited by Jelte M. Wicherts. PLOS ONE 12 (8). Public Library of Science (PLoS): e0182651. doi:10.1371/journal.pone.0182651.

Bakker, Marjan, Annette van Dijk, and Jelte M Wicherts. 2012. “The rules of the game called psychological science.” Perspectives on Psychological Science 7 (6): 543–54. doi:10.1177/1745691612459060.

Bem, Daryl J. 2000. “Writing an Empirical Article.” Edited by Robert J. Sternberg. Cambridge University Press, 3–16. doi:10.1017/cbo9780511807862.002.

Broad, William, and Nicholas Wade. 1983. Betrayers of the Truth. New York, NY: Simon; Schuster.

Cohen, Jacob. 1962. “The Statistical Power of Abnormal-Social Psychological Research: A Review.” The Journal of Abnormal and Social Psychology 65 (3). American Psychological Association (APA): 145–53. doi:10.1037/h0045186.

De Groot, A.D. 1994. Methodologie: Grondslagen van Onderzoek En Denken in de Gedragswetenschappen [Methodology: Foundations of Research and Thinking in the Behavioral Sciences]. Assen, the Netherlands: Van Gorcum.

Ebersole, Charles R., Olivia E. Atherton, Aimee L. Belanger, Hayley M. Skulborstad, Jill M. Allen, Jonathan B. Banks, Erica Baranski, et al. 2016. “Many Labs 3: Evaluating Participant Pool Quality Across the Academic Semester via Replication.” Journal of Experimental Social Psychology 67 (November). Elsevier BV: 68–82. doi:10.1016/j.jesp.2015.10.012.

Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PloS ONE 4 (5): e5738. doi:10.1371/journal.pone.0005738.

Fleck, Ludwig. 1984. Genesis and Development of a Scientific Fact. Chicago, IL: University of Chicago Press.

Gelman, Andrew, and Eric Loken. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” https://wayback.archive.org/web/20180712075516/http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf.

Harnad, Stevan. 2000. “The Invisible Hand of Peer Review.” Exploit Interactive. http://cogprints.org/1646/.

Hartgerink, Chris H. J. 2014. “Poster: Low Threshold Open Science.” figshare. doi:10.6084/m9.figshare.928315.v2.

Hartgerink, Chris H. 2015a. “Do Not Trust Science Verify It.” Authorea, Inc. doi:10.15200/winn.144232.26366.

Head, Megan, Luke Holman, Rob Lanfear, Andrew Kahn, and Michael Jennions. 2015a. “Data from: The extent and consequences of p-hacking in science.” Dryad Digital Repository. doi:10.5061/dryad.79d43.

Head, Megan, Luke Holman, Rob Lanfear, Andrew Kahn, and Michael Jennions. 2015b. “The extent and consequences of p-hacking in science.” PLOS Biology 13: e1002106. doi:10.1371/journal.pbio.1002106.

Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8). Public Library of Science (PLoS): e124. doi:10.1371/journal.pmed.0020124.

Kerr, Norbert L. 1998. “HARKing: Hypothesizing After the Results Are Known.” Personality and Social Psychology Review 2 (3). SAGE Publications: 196–217. doi:10.1207/s15327957pspr0203_4.

Klein, Richard A., Kate A Ratliff, Michelangelo Vianello, Reginald B Adams Jr., Štěpán Bahník, Michael J Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability.” Social Psychology 45 (3): 142–52. doi:10.1027/1864-9335/a000178.

Latour, Bruno, and Steve Woolgard. 1986. Laboratory Life: The Construction of Scientific Facts. Princeton, NJ: Princeton University Press. http://aacoa.net/bookinfo/laboratory-life-the-construction-of-scientific-facts.pdf/.

Mahoney, Michael J. 1977. “Publication Prejudices: An Experimental Study of Confirmatory Bias in the Peer Review System.” Cognitive Therapy and Research 1 (2). Springer Nature: 161–75. doi:10.1007/bf01173636.

Marszalek, Jacob M., Carolyn Barber, Julie Kohlhart, and B. Holmes Cooper. 2011. “Sample Size in Psychological Research over the Past 30 Years.” Perceptual and Motor Skills 112 (2). SAGE Publications: 331–48. doi:10.2466/03.11.pms.112.2.331-348.

Mills, James L. 1993. “Data Torturing.” New England Journal of Medicine 329 (16). New England Journal of Medicine (NEJM/MMS): 1196–9. doi:10.1056/nejm199310143291613.

Nuijten, Michèle B., Chris H. J. Hartgerink, Marcel A.L.M. Van Assen, Epskamp Sacha, and Jelte M. Wicherts. 2015. “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013).” Behavior Research Methods 48 (4). Springer Nature: 1205–26. doi:10.3758/s13428-015-0664-2.

Roosendaal, Hans E, and Peter A Th M Geurts. 1998. “Forces and Functions in Scientific Communication: An Analysis of Their Interplay.” http://web.archive.org/web/20180223112609/http://www.physik.uni-oldenburg.de/conferences/crisp97/roosendaal.html.

Rosenthal, Robert. 1979. “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin 86 (3). American Psychological Association (APA): 638–41. doi:10.1037/0033-2909.86.3.638.

Sedlmeier, Peter, and Gerd Gigerenzer. 1989. “Do Studies of Statistical Power Have an Effect on the Power of Studies?” Psychological Bulletin 105 (2). American Psychological Association (APA): 309–16. doi:10.1037/0033-2909.105.2.309.

Simmons, Joseph P, Leif D Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11): 1359–66. doi:10.1177/0956797611417632.

Sompel, Herbert Van de, Sandy Payette, John Erickson, Carl Lagoze, and Simeon Warner. 2004. “Rethinking Scholarly Communication.” D-Lib Magazine 10 (9). CNRI Acct. doi:10.1045/september2004-vandesompel.

Spellman, Barbara A. 2015. “A Short (Personal) Future History of Revolution 2.0.” Perspectives on Psychological Science 10 (6). SAGE Publications: 886–99. doi:10.1177/1745691615609918.

Spies, Jeffrey Robert. 2017. “The Open Science Framework: Improving Science by Making It Open and Accessible,” April. Center for Open Science. doi:10.31237/osf.io/t23za.

Standish, E. M. 2004. “The Astronomical Unit Now.” Proceedings of the International Astronomical Union 2004 (IAUC196). Cambridge University Press (CUP): 163–79. doi:10.1017/s1743921305001365.

Stapel, Diederik A. 2012. Ontsporing. Amsterdam, the Netherlands: Prometheus.

Veldkamp, Coosje L. S., Chris H. J. Hartgerink, Marcel A. L. M. Van Assen, and Jelte M. Wicherts. 2016. “Who Believes in the Storybook Image of the Scientist?” Accountability in Research 24 (3). Informa UK Limited: 127–51. doi:10.1080/08989621.2016.1268922.

Vines, Timothy H., Arianne Y.K. Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. 2014. “The Availability of Research Data Declines Rapidly with Article Age.” Current Biology 24 (1). Elsevier BV: 94–97. doi:10.1016/j.cub.2013.11.014.

Wootton, David. 2015. The Invention of Science: A New History of the Scientific Revolution. Toronto, Canada: Allen Lane.