Data Science Notes: Confidence Intervals
This year I resolved to start learning Data Science on my free time with the expectation of finding a way to use it in my every day work.
To help me in the process I have been writing some notes in the format of a private blog (which is a recurring tip that experts give to begginners). I’ll start publishing my private notes some other day, but today I wanted to write about the lastest item I have read about.
Disclaimer: The following are more my notes and less a tutorial.
Confidence Intervals
According to Harvey Motulsky, in Essential Biostatistics:
CIs express precission or margin of errors and so let you make a general conclusion from limited data. But this only works under the following assumptions:
- random sample / representative sample
- independent observations
- correctly tabulated data / free of bias
Meaning of 95% CI
If you calculate the 95% CI of a given observation, you would expect the real population to be encompassed by your CI 95% of the times. But there is no way for you to know whether the real population lies within your CI or not.
We do not say there is a 95% chance that the true population is in my CI, that’s flipping things around. The true population is fixed, and there is 95% chance that your CI contains it.
To make this more clear: let’s say you forgot your keys at the kitchen table (but you don’t remember where). If I ask you to guess where your keys are you may say something like “I am 95% sure that I forgot them at home”, which is different from “There is a 95% chance that the keys are at home”. Your keys are 100% at home, whether you know it or not.
A simulation
Harvey proposes the following exercise to understand better. I added some code to help.
Imagine you have a bowl with 100 balls, 25 of them are red and 75 are black. Pretend you are a researcher who doesn’t know the real distribution.
Next we mix the balls, choose one randomly and put it back again in the bowl. We repeat this process 15 times and calculate the 95% CI for this proportion:
If we repeat this exercise 20 times we should see that:
- about half of the times the observed proportion is above the real population
- the other half of the times it would be lower
- 5% of the times the calculate CI will not encompass the real population
The following figure shows the 20 confidence intervals in the form of bars, with a line in the middle indicating the observed proportion. A horizontal line show the true proportion (25% of the balls are red).
Figure 1: Confidence intervals of 20 samples from a binomial distribution B(15,0.25)
The code
You can find and run the full code at: github.com/ariera/essential-biostatistics