Day 3: Statistical Inference - Estimation and Confidence Intervals

Welcome to Day 3. Today we discuss the most powerful magic trick in statistics: Inference.

Imagine you want to know the average height of every human on Earth (the Population). It is physically impossible to measure 8 billion people. But, through the math of Statistical Inference, if you measure a Sample of just a few thousand people, you can predict the global average with stunning accuracy.

Point Estimation vs. Interval Estimation

Point Estimate: A single guess. (e.g., "The average height is exactly 5'8\""). This is almost guaranteed to be wrong.
Interval Estimate: A range. (e.g., "The average height is between 5'7\" and 5'9\""). This is much safer, but how mathematically confident are we?

This leads us to the Confidence Interval (CI).

The Confidence Interval

A Confidence Interval provides a range of values within which the true population "Mean" is likely to lie. Standard practice in AI and science is to use a 95% Confidence Interval.

This means: "If I took 100 random samples and calculated the interval each time, the true global average would be inside my interval 95 times."

How to Calculate a CI

Let's look at day3_ex1.py. We have a Sample of 100 numbers. We want to know the true Mean.

The formula requires three things: 1. Sample Mean: The simple average of our 100 data points. 2. Standard Error: How spread out is our data? We calculate this using the Standard Deviation divided by the square root of \(n\) (our sample size). 3. Z-Score: A mathematical constant dictating how confident we want to be. For 95% confidence on a normal distribution, the Z-score is roughly 1.96. We can grab the exact number using scipy.stats.norm.ppf.

# day3_ex1.py
import numpy as np
from scipy.stats import norm

# Generate 100 random data points (This is our "Sample"!)
data = np.random.normal(loc=50, scale=10, size=100)

# Calculate the Sample Mean and Standard Deviation
mean = np.mean(data)
# ddof=1 means we use "Bessel's correction" for a sample rather than a population!
std = np.std(data, ddof=1) 
n = len(data)

# Calculate the 95% Confidence Interval
z_value = norm.ppf(0.975) # This gets the ~1.96 Z-score!
margin_of_error = z_value * (std / np.sqrt(n))

ci = (mean - margin_of_error, mean + margin_of_error)

print("Sample Mean: ", mean)
print("95% Confidence Interval: ", ci)
# Output: (48.1, 52.0)
# We isolated the true average using only 100 data points!

The T-Distribution

What happens if you have an incredibly small dataset? Imagine you are building a medical AI and you only have trial data for 7 patients.

If your Sample Size (\(n\)) is less than 30, the Z-Score math breaks. Instead, William Sealy Gosset invented the T-Distribution (under the pseudonym "Student"). It makes the confidence interval wider to account for the massive uncertainty of a tiny sample!

# day3_sample.py
import numpy as np
from scipy.stats import t

# Sample Data (n = 7)
data = [12, 14, 15, 16, 17, 18, 19]

mean = np.mean(data)
std = np.std(data, ddof=1)
n = len(data)

# Use the Student's T distribution! df = degrees of freedom (n-1)
t_value = t.ppf(0.975, df=n-1)
margin_of_error = t_value * (std / np.sqrt(n))
ci = (mean - margin_of_error, mean + margin_of_error) 

print("95% Confidence Interval: ", ci)

Wrapping Up Day 3

Whenever you report a metric in Machine Learning (like "My model achieves 92% accuracy"), you should always include a Confidence Interval ("92% accuracy ± 1.5%"). It proves you understand that your test set was just a sample!

Tomorrow on Day 4: Hypothesis Testing, we take everything we learned about intervals and use it to formally prove hypotheses about our data using the fabled P-Value!