The Sketchy Data Collector and the Problem with Repeated Measurements

I’ve been asked before to explain how repeated measurements can impact statistical models. We often bake repeated measurements – more than one measurement take on the same person, the same experimental animal, etc – into our experiments. They can give us more confidence in noisy data. The downside is that they do need to be accounted for properly when doing the analysis afterwards.

For example, an experiment with repeated measurements probably shouldn’t be analyzed with a simple t-test.

Let’s use an example to give us an intuition for why.

Imagine you wanted to know whether carb-loading resulted in faster race times for mid-distance runners. Say you hire someone to collect some data for you: they recruit runners and assign them to either carb-load or eat a balanced meal before running a timed 5k.

You see a dataset that looks something like this:


Looks pretty good, right? You do a t-test (the data is about normal) and find a significant p-value: .001. Great! This looks significant! You get ready to publish a paper (or a blog post).

But then imagine that your data collector neglected to mention to you that the measurements were all taken on the same person. The data collector is an avid runner themselves, and managed to run 100 5ks. Do you still believe your t-test?

Of course you don’t. Those measurements are all correlated; they’re quite a good measurement of that one person, but who can say how well that one person generalizes to all runners?

And this is the intuition you are looking for: the measurements collected on your sketchy 5k-running data collector are all correlated, and are not independent. A t-test assumes that measurements are independent, and so if you used that test, you would see a falsely-inflated p-value.

A slightly better way to do this experiment would be to identify ten different runners, and have them run ten races each: five with each type of preparatory meal. In this  case we would still be looking at correlated data, because the measurements taken on one person will be similar to each other. Even better would be fifty independent runners, each running two races.

What if the study budget only allows for ten runners? Or what if you expect very noisy measurements and so you want to collect more than one measurement from each subject? These experimental designs are still fine, and we can test our carb-loading hypothesis with something called a mixed-effects model. And that’s a topic for another post.