article-71342_1280.jpg

If there’s one thing you need to know about data, it lies. A lot.

Science is a game of listening to the whispers of mother nature. Sometimes those whispers are relatively easy to hear (for example, objects fall to the Earth with identical acceleration) and sometimes those whispers can barely be discerned above the noise (for example, the theory of general relativity).

But those whispers can be very deceitful, leading to conclusions that aren’t fundamentally sound.

The reason why the whispers don’t always tell the truth is that the whispers must be heard. In other words, we need to design, build, and operate listening devices. We need to record the data that they gather, and we need to interpret that raw data into meaningful, useful conclusions. And that’s where the lies come in.

Every instrument built for the purpose of science, from massive telescopes and atom-smashers to surveys and study groups, introduces bias. A telescope will preferentially see one kind of light over another. A survey will preferentially reach one kind of population over another. And so on. Bias is a part of scientific life. Not intentional, of course. Every scientist would love to live the bias-free life when it comes to collecting data, but it’s simply a part of the game.

So even the raw data, before any analyses or interpretations are done, before charts or graphs are constructed, before dry, obtuse language is applied in a journal article, is full of lies. We don’t ever collect what nature is rally saying, but we collect what our instruments report to us as nature saying. These are different things, and that distinction is crucial. To blindly trust in the data without treating it with caution and wariness is a fool’s errand, and will simply lead to useless, if not outright wrong, results.

The biases get worse with the complexity of the experiment and the tinier the effect one is looking for. If there’s a strong signal – a supernova, a preference between populations, etc. – then even with bias it can be picked out. But a weak signal, barely susceptible already, can be easily confused for any number of results.

This is what makes evidence-based thinking and policy much harder than it seems on the surface. To the uninitiated, the data are there, ready to b plucked and used to inform our decisions and actions. But what if you didn’t account for bias? What if what you thought were the biases in your instruments were off-base? What if the signal is so small, so subtle, that you can’t possibly hope to cleanly separate useful information from bad?

The antidote is always more data, more experiments, more variation, more trials, more independent looks. Eventually reality reveals itself, but only through a slow, agonizing, meticulous process. Or processes. And who has time for that?


Comment