Statistical Forensics


A p value of < 0.05 is widely regarded as the necessary threshold in order for a paper to be publishable. However, many researchers have started engaging in p-hacking in order for their results to be publishable. P-hacking is the process of manipulating data to make non-significant results statistically significant. P-hacking is usually not outright fraud but falls within a researcher’s freedom to make choices about the study design. This could include decided what factors to control for and analyzing many measures but reporting only those with a p-value of less than .05.

Resources about p-hacking:

Science isn’t broken. Fivethirtyeight

Statiscital Techniques to Detect Fraud:

Checking the Distribution of Rightmost Digits:
When trying to fabricate data the leftmost digit must be manipulated to match the desired level of magnitude. The rightmost digit is usually given little thought. Statistically speaking, rightmost digits are approximately uniformly distributed in many circumstances. However, humans have a very difficult time creating a uniform distribution of digits even if they are trying to do so. This makes testing for a uniform distribution of the rightmost digit a powerful tool for detecting fraud.
Check Rightmost Digits for Uniform Distribution

Testing for Linguistic Obfuscation:
Two Stanford researchers Markowitz and Hancock found that studies that had been retracted for scientific fraud had similar linguistic patterns. Scientists who are committing fraud often use more jargon, avoid positive language and use more negative language, and have generally lower readability. In the future, it is possible that these findings could be used to create a computer program that would search for these patterns in papers before being published. Journals would then be able to investigate flagged studies in more depth to ensure that they were not fraudulent.
Markowitz, D.M. &amp Hancock, J.T. (2015)Linguistic Obfuscation in Fraudulent Science. Journal of Language and Social Psychology, 35(4), 435-445.
Stanford researchers uncover patterns in how scientists lie about their data

This paper proposes using P-Curves as a way to determine whether a set of statistically significant results has evidential value. When a study shows that an effect exists its p-curve will be right skewed. When a researcher p-hacks it is likely that they stop upon reaching statistical significant (p<0.05). This would cause the p-curve of a p-hacked study to be left skewed. Analyzing p-curves can be employed as a way to screen studies. Studies that have left skewed p-curves can be evaluated by journals and other researchers in more detail. Simmons, J.P. , Nelson, L.D., &amp Simonsohn, U. (2014). P-Curve: A Key to the File-Drawer Journal of Experimental Psychology, 143(2), 534-537.

Replicability Index: is a statistical tool that can be used to determine the replicability of studies without having to actually invest the resources necessary to replicate a study.
Replicability Index

How the Scientific Community Should Move Forward?:

Joseph Simmons, Leif D. Nelson, and Uri Simonsohn (2013), Life After P-Hacking, in NA – Advances in Consumer Research Volume 41, eds. Simona Botti and Aparna Labroo, Duluth, MN : Association for Consumer Research.
To prevent p-hacked results, the scientific community must value statistical power and and ensure that underpowered studies are unsustainable and difficult to publish. P-hacking can prove to be of some value during scientific exploration, but any results found from said exploration need to be replicated. Furthermore, by emphasizing the importance of large samples and replication, the quantity of published scientific papers will decrease, but quality of these papers will drastically increase.

Simmons, J.P. , Nelson, L.D., &amp Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359-1366.
To avoid false-positive results, authors must take a disclosure-based approach and justify any manipulation of variables, data, conditions, etc. within their published papers. Reviewers must ensure that studies are not underpowered, and require replication if a study’s results are not plausible.

Simmons, J.P. (2016). What I Want Our Field To Prioritize. DataColada.
The scientific community must value studies that are easily replicable, and stress the importance of creating methods that are clear and easy to replicate given certain conditions. Replication studies should be encouraged and more highly incentivized by the scientific community.

Statistical Forensics at Work:

Research 2000: ranked Research 2000’s polls as among the most inaccurate. However, these polls were praised for their transparency because they published detailed crosstabs and methodology reports. Three men approached Daily Kos, who paid for and published the Research 2000 polls, with concerns that the data was falsified. It was determined that the results of Research 2000’s polls were extremely statistically improbable. Quoted directly from their report they analyzed these three features: “

A large set of number pairs which should be independent of each other in detail, yet almost always are either both even or both odd.

A set of polls on separate groups which track each other far too closely, given the statistical uncertainties.

The collection of week-to-week changes, in which one particular small change (zero) occurs far too rarely. This test is particularly valuable because the reports exhibit a property known to show up when people try to make up random sequences.”

Daily Kos terminated their relationship with Research 2000 and filed a lawsuit against Research 2000 that was later settled. The report concluded that it was unknown how the Research 2000 results were created but that they were certainly not random polls. The full report is available in the link below.
Research 2000: Problems in plain sight
Statistical Forensics Launches a Polling Donnybrook