Summary

This dissertation focuses on either understanding and detecting threats to the epistemology of science (chapters 1-6) or making practical advances to remedy epistemological threats (chapters 7-9)

Chapter 1 reviews the literature on responsible conduct of research, questionable research practices, and research misconduct. Responsible conduct of research is often defined in terms of a set of abstract, normative principles, professional standards, and ethics in doing research. In order to accommodate the normative principles of scientific research, the professional standards, and a researcher’s moral principles, transparent research practices can serve as a framework for responsible conduct of research. Here I suggest a ‘prune-and-add’ project structure to enhance transparency and by extension, responsible conduct of research. Questionable research practices are defined as practices that are detrimental to the research process. The prevalence of questionable research practices remains largely unknown and reproducibility of findings has been shown to be problematic. Questionable practices are discouraged by transparent practices because practices that arise from them will become more apparent to scientific peers. Most effective might be preregistrations of research design, hypotheses, and analyses, which reduce particularism of results by providing an a priori research scheme. Research misconduct has been defined as fabrication, falsification, and plagiarism (FFP), which is clearly the worst type of research practice. Despite it being clearly wrong, it can be approached from a scientific and legal perspective. The legal perspective sees research misconduct as a form of white-collar crime. The scientific perspective seeks to answer the question “were results invalidated because of the misconduct?” I review how misconduct is typically detected, how its detection can be improved, and how prevalent it might be. Institutions could facilitate detection of data fabrication and falsification by implementing data auditing. Nonetheless, the effect of misconduct is pervasive: many retracted articles are still cited after the retraction has been issued.

Head et al. (2015 b) provided a large collection of \(p\)-values that, from their perspective, indicates widespread statistical significance seeking (i.e., \(p\)-hacking). Chapter 2 inspects this result for robustness. Theoretically, the \(p\)-value distribution should be a smooth, decreasing function, but the distribution of reported \(p\)-values shows systematically more reported \(p\)-values for .01, .02, .03, .04, and .05 than \(p\)-values reported to three decimal places, due to apparent tendencies to round \(p\)-values to two decimal places. Head et al. (2015 b) correctly argue that an aggregate \(p\)-value distribution could show a bump below .05 when left-skew \(p\)-hacking occurs frequently. Moreover, the elimination of \(p=.045\) and \(p=.05\), as done in the original paper, is debatable. Given that eliminating \(p=.045\) is a result of the need for symmetric bins and systematically more \(p\)-values are reported to two decimal places than to three decimal places, I did not exclude \(p=.045\) and \(p=.05\). I applied Fisher’s method on \(.04<p<.05\) and reanalyzed the data by adjusting the bin selection to \(.03875<p\leq.04\) versus \(.04875<p\leq.05\). Results of the reanalysis indicate that no evidence for left-skew \(p\)-hacking remains when I look at the entire range between \(.04<p<.05\) or when I inspect the second-decimal. Taking into account reporting tendencies when selecting the bins to compare is especially important because this dataset does not allow for the recalculation of the \(p\)-values. Moreover, inspecting the bins that include two-decimal reported \(p\)-values potentially increases sensitivity if strategic rounding down of \(p\)-values as a form of \(p\)-hacking is widespread. Given the far-reaching implications of supposed widespread \(p\)-hacking throughout the sciences Head et al. (2015 b), it is important that these findings are robust to data analysis choices if the conclusion is to be considered unequivocal. Although no evidence of widespread left-skew \(p\)-hacking is found in this reanalysis, this does not mean that there is no \(p\)-hacking at all. These results nuance the conclusion by Head et al. (2015 b), indicating that the results are not robust and that the evidence for widespread left-skew \(p\)-hacking is ambiguous at best.

Chapter 3 examined 258,050 test results across 30,710 articles from eight high impact journals to investigate the existence of a peculiar prevalence of \(p\)-values just below .05 (i.e., a bump) in the psychological literature, and a potential increase thereof over time. I indeed found evidence for a bump just below .05 in the distribution of exactly reported \(p\)-values in the journals Developmental Psychology, Journal of Applied Psychology, and Journal of Personality and Social Psychology, but the bump did not increase over the years and disappeared when using recalculated \(p\)-values. I found clear and direct evidence for the QRP “incorrect rounding of \(p\)-value” (John, Loewenstein, and Prelec 2012) in all psychology journals. Finally, I also investigated monotonic excess of \(p\)-values, an effect of certain QRPs that has been neglected in previous research, and developed two measures to detect this by modeling the distributions of statistically significant \(p\)-values. Using simulations and applying the two measures to the retrieved test results, I argue that, although one of the measures suggests the use of QRPs in psychology, it is difficult to draw general conclusions concerning QRPs based on modeling of \(p\)-value distributions.

In Chapter 4 I examined evidence for false negatives in nonsignificant results in three different ways. I adapted the Fisher method to detect the presence of at least one false negative in a set of statistically nonsignificant results. Simulations show that the adapted Fisher method generally is a powerful method to detect false negatives. I examined evidence for false negatives in the psychology literature in three applications of the adapted Fisher method. These applications indicate that (i) the observed effect size distribution of nonsignificant effects exceeds the expected distribution assuming a null-effect, and approximately two out of three (66.7%) psychology articles reporting nonsignificant results contain evidence for at least one false negative, (ii) nonsignificant results on gender effects contain evidence of true nonzero effects, and (iii) the statistically nonsignificant replications from the Reproducibility Project Psychology (RPP) do not warrant strong conclusions about the absence or presence of true zero effects underlying these nonsignificant results. I conclude that false negatives deserve more attention in the current debate on statistical practices in psychology. Potentially neglecting effects due to a lack of statistical power can lead to a waste of research resources and stifle the scientific discovery process.

Chapter 5 describes a dataset that is the result of content mining 167,318 published articles for statistical test results reported according to the standards prescribed by the American Psychological Association (APA). Articles published by the APA, Springer, Sage, and Taylor & Francis were included (mining from Wiley and Elsevier was actively blocked). As a result of this content mining, 688,112 results from 50,845 articles were extracted. In order to provide a comprehensive set of data, the statistical results are supplemented with metadata from the article they originate from. The dataset is provided in a comma separated file (CSV) in long-format. For each of the 688,112 results, 20 variables are included, of which seven are article metadata and 13 pertain to the individual statistical results (e.g., reported and recalculated \(p\)-value). A five-pronged approach was taken to generate the dataset: (i) collect journal lists; (ii) spider journal pages for articles; (iii) download articles; (iv) add article metadata; and (v) mine articles for statistical results. All materials, scripts, etc. are available at https://github.com/chartgerink/2016statcheck_data ⁵ and preserved at http://dx.doi.org/10.5281/zenodo.59818.

In Chapter 6, I test the validity of statistical methods to detect fabricated data in two studies. In Study 1, I tested the validity of statistical methods to detect fabricated data at the study level using summary statistics. Using (arguably) genuine data from the Many Labs 1 project on the anchoring effect (\(k=36\)) and fabricated data for the same effect by our participants (\(k=39\)), I tested the validity of our newly proposed ‘reversed Fisher method’, variance analyses, and extreme effect sizes, and a combination of these three indicators using the original Fisher method. Results indicate that the variance analyses perform fairly well when the homogeneity of population variances is accounted for and that extreme effect sizes perform similarly well in distinguishing genuine from fabricated data. The performance of the ‘reversed Fisher method’ was poor and depended on the types of tests included. In Study 2, I tested the validity of statistical methods to detect fabricated data using raw data. Using (arguably) genuine data from the Many Labs 3 project on the classic Stroop task (\(k=21\)) and fabricated data for the same effect by our participants (\(k=28\)), I investigated the performance of digit analyses, variance analyses, multivariate associations, and extreme effect sizes, and a combination of these four methods using the original Fisher method. Results indicate that variance analyses, extreme effect sizes, and multivariate associations perform fairly well to excellent in detecting fabricated data using raw data, while digit analyses perform at chance levels. The two studies provide mixed results on how the use of random number generators affects the detection of data fabrication. Ultimately, I consider the variance analyses, effect sizes, and multivariate associations valuable tools to detect potential data anomalies in empirical (summary or raw) data. However, I argue against widespread (possible automatic) application of these tools, because some fabricated data may be irregular in one aspect but not in another. Considering how violations of the assumptions of fabrication detection methods may yield high false positive or false negative probabilities, I recommend comparing potentially fabricated data to genuine data on the same topic.

Chapter 7 tackles the issue of data extraction. It is common for authors to communicate their results in graphical figures, but those data are frequently unavailable for reanalysis. Reconstructing data points from a figure manually requires the author to measure the coordinates either on printed pages using a ruler, or from the display screen using a cursor. This is time-consuming (often hours) and error-prone, and limited by the precision of the display or ruler. What is often not realised is that the data themselves are held in the PDF document to much higher precision (usually 0.0-0.01 pixels), if the figure is stored in vector format. We developed alpha software to automatically reconstruct data from vector figures and tested it on funnel plots in the meta-analysis literature. Our results indicate that reconstructing data from vector based figures is promising, where I correctly extracted data for 12 out of 24 funnel plots with extracted data (50%). However, I observed that vector based figures are relatively sparse (15 out of 136 papers with funnel plots) and strongly insist publishers to provide more vector based data figures in the near future for the benefit of the scholarly community.

Scholarly research faces threats to its sustainability on multiple domains (access, incentives, reproducibility, inclusivity). In Chapter 8 I argue that “after-the-fact” research papers do not help and actually cause some of these threats because the chronology of the research cycle is lost in a research paper. I propose to give up the academic paper and propose a digitally native “as-you-go” alternative. In this design, modules of research outputs are communicated along the way and are directly linked to each other to form a network of outputs that can facilitate research evaluation. This embeds chronology in the design of scholarly communication and facilitates recognition of more diverse outputs that go beyond the paper (e.g., code, materials). Moreover, using network analysis to investigate the relations between linked outputs could help align evaluation tools with evaluation questions. I illustrate how such a modular “as-you-go” design of scholarly communication could be structured and how network indicators could be computed to assist in the evaluation process, with specific use cases for funders, universities, and individual researchers.

A scholarly communication system needs to register, distribute, certify, archive, and incentivize knowledge production. Chapter 9 proposes that the current article-based system technically fulfills these functions, but suboptimally. I propose a module-based communication infrastructure that attempts to take a wider view of these functions and optimize the fulfillment of the five functions of scholarly communication. Scholarly modules are conceptualized as the constituent parts of a research process as determined by a researcher. These can be text, but also code, data, and any other relevant piece of information. The chronology of these modules is registered by iteratively linking to each other, creating a provenance record of parent- and child modules (and a network of modules). These scholarly modules are linked to scholarly profiles, creating a network of profiles, and a network of profiles and their constituent modules. All these scholarly modules would be communicated on the new peer-to-peer Web protocol Dat (datproject.org), which provides a decentralized register that is immutable, facilitates greater content integrity through verification, and is open by design. Open by design would also allow diversity in the way content is consumed, discovered, and evaluated to arise. This initial proposal needs to be refined and developed further based on technical developments of the Dat protocol and its implementations, and discussions within the scholarly community to evaluate the qualities claimed here. Nonetheless, a minimal prototype is available today and this is technically feasible.

References

Head, Megan, Luke Holman, Rob Lanfear, Andrew Kahn, and Michael Jennions. 2015b. “The extent and consequences of p-hacking in science.” PLOS Biology 13: e1002106. doi:10.1371/journal.pbio.1002106.

John, Leslie K, George Loewenstein, and Drazen Prelec. 2012. “Measuring the prevalence of questionable research practices with incentives for truth telling.” Psychological Science 23 (5): 524–32. doi:10.1177/0956797611430953.

This GitHub repository has been deleted since this chapter was previously published. The links are included to remain consistent with the published version.↩