For those living under a rock [Ed: note irony, as it’s doubtful this is on your mind], you may be unaware of the tremendous controversy brewing in academic circles on the topic of reproducibility of published research. For those who think this may be a silly intellectual argument, the truth is quite alarming. The whole point of published research is to collaborate and build on learnings, but if an experiment cannot be reproduced, this is a big problem, and it is widespread. Closely related to the reproducibility crisis, or as some argue, a harbinger of it, is p-hacking.
FiveThirtyEight has a great non-technical way of getting your head around the concept, and in conjunction with the graphic below (via nature), might make you say “ahhhh.” P-hacking is especially alarming/concerning when dealing with very large multi-dimensional data where causation plays a distant second fiddle to correlation. This, in essence, is the challenge of novel analytical techniques (which I try not to lump as “Big Data,” forever in quotes).
If you’ve made it this far, are you in for a treat! The comic-style explanation (below) of evaluating p-values brings it all home. The idea is fairly straight-forward, but the implications are vaster-than-vast as non stats types conjure up spurious-er claims, leading decision makers into potentially murky and unfounded territory. Having a foundational understanding is crucial.
I’m currently reading Everybody Lies by Seth Stephens-Davidowitz and this anecdote about data scientists sums it up for me:
Too many data scientists today are accumulating massive sets of data and telling us very little of importance, eg, that the [New York] Knicks are very popular in New York.