Data Science Notes: Confidence Intervals

This year I resolved to start learning Data Science on my free time with the expectation of finding a way to use it in my every day work.

To help me in the process I have been writing some notes in the format of a private blog (which is a recurring tip that experts give to begginners). I’ll start publishing my private notes some other day, but today I wanted to write about the lastest item I have read about.

Disclaimer: The following are more my notes and less a tutorial.

Confidence Intervals

According to Harvey Motulsky, in Essential Biostatistics:

CIs express precission or margin of errors and so let you make a general conclusion from limited data. But this only works under the following assumptions:

random sample / representative sample
independent observations
correctly tabulated data / free of bias

Meaning of 95% CI

If you calculate the 95% CI of a given observation, you would expect the real population to be encompassed by your CI 95% of the times. But there is no way for you to know whether the real population lies within your CI or not.

We do not say there is a 95% chance that the true population is in my CI, that’s flipping things around. The true population is fixed, and there is 95% chance that your CI contains it.

To make this more clear: let’s say you forgot your keys at the kitchen table (but you don’t remember where). If I ask you to guess where your keys are you may say something like “I am 95% sure that I forgot them at home”, which is different from “There is a 95% chance that the keys are at home”. Your keys are 100% at home, whether you know it or not.

A simulation

Harvey proposes the following exercise to understand better. I added some code to help.

Imagine you have a bowl with 100 balls, 25 of them are red and 75 are black. Pretend you are a researcher who doesn’t know the real distribution.

red_balls = ['red' for i in range(25)]
black_balls = ['black' for i in range(75)]
bowl = red_balls + black_balls

Next we mix the balls, choose one randomly and put it back again in the bowl. We repeat this process 15 times and calculate the 95% CI for this proportion:

import random
from statsmodels.stats.proportion import proportion_confint

def simulation():
    number_of_red_balls_observed = 0
    NUMBER_OF_TRIALS = 15
    CONFIDENCE_LEVEL = 0.95
    for i in range(NUMBER_OF_TRIALS):
        random.shuffle(bowl)
        ball = random.choice(bowl)
        if ball == 'red':
            number_of_red_balls_observed += 1
    ci_low, ci_up = proportion_confint(number_of_red_balls_observed, NUMBER_OF_TRIALS, alpha=1.0 - CONFIDENCE_LEVEL)
    observed_proportion = number_of_red_balls_observed/NUMBER_OF_TRIALS
    return (ci_low, ci_up, observed_proportion)

If we repeat this exercise 20 times we should see that:

about half of the times the observed proportion is above the real population
the other half of the times it would be lower
5% of the times the calculate CI will not encompass the real population

The following figure shows the 20 confidence intervals in the form of bars, with a line in the middle indicating the observed proportion. A horizontal line show the true proportion (25% of the balls are red).

Figure 1: Confidence intervals of 20 samples from a binomial distribution B(15,0.25)

The code

You can find and run the full code at: github.com/ariera/essential-biostatistics