Day 1: Probability Theory and Random Variables
Welcome to Day 1 of Probability and Statistics Week. We begin with the foundation of uncertainty: Probability.
When an AI tries to translate a sentence from English to French, it doesn't "know" the right word. It calculates the probability of every possible word and picks the one with the highest score.
Basic Probability Concepts
- Sample Space: The set of all possible outcomes. (e.g., A six-sided die has a sample space of
{1, 2, 3, 4, 5, 6}). - Event: A specific outcome or subset of outcomes we care about. (e.g., Rolling an even number:
{2, 4, 6}). - Independence: Two events are independent if the outcome of one does not affect the outcome of the other. If you flip a coin twice, the first flip has zero effect on the second flip!
We can simulate probability effortlessly with numpy:
# day1_ex1.py
import numpy as np
# Simulate rolling a 6-sided die 10,000 times!
rolls = np.random.randint(1, 7, size=10000)
# What is the probability of rolling an Even number?
# (Sum up all the True values, and divide by the total number of rolls)
P_even = np.sum(rolls % 2 == 0) / len(rolls)
print("P(Even): ", P_even)
# Output: ~0.50 (As expected, 50% chance!)
# What is the probability of rolling greater than a 4? (5 or 6)
P_greater_than_4 = np.sum(rolls > 4) / len(rolls)
print("P(Greater than 4): ", P_greater_than_4)
# Output: ~0.33 (33% chance!)
Random Variables
A Random Variable (RV) is a mathematical way to map the outcome of a random event to a number. * Discrete RVs have specific, countable values. (e.g., The number rolled on a die). * Continuous RVs can be any value within a range. (e.g., The exact height of a human).
RVs are defined by their curves: * A PMF (Probability Mass Function) shows the probability curve for Discrete variables. * A PDF (Probability Density Function) shows the curve for Continuous variables.
# day1_ex2.py
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import uniform
# We can plot the PDF of a Continuous "Uniform" curve
# A uniform curve means every single number has the exact same probability of being picked!
x = np.linspace(0, 1, 100)
pdf = uniform.pdf(x, loc=0, scale=1)
plt.plot(x, pdf, color="red")
plt.title("PDF of Uniform(0,1)")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.show() # Draws a perfectly flat red line!
Expectation and Variance
If you play a game of chance infinitely, what is your average mathematical outcome? This is called your Expectation (or Mean).
But how "wild" is the game? Does it always result in a number close to the mean, or does it swing wildly between massive wins and massive losses? That spread is the Variance (and its square root, the Standard Deviation).
# day1_sample.py
import numpy as np
# A fair 6-sided die
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6]* 6)
# The Expectation is the Sum of (Outcome * Probability of that Outcome)
expectation = np.sum(outcomes * probabilities)
print("Expectation (Mean): ", expectation)
# Output: 3.5
variance = np.sum((outcomes - expectation)**2 * probabilities)
std_dev = np.sqrt(variance)
print("Variance: ", variance) # Output: ~2.91
print("Standard Deviation: ", std_dev) # Output: ~1.70
Wrapping Up Day 1
You now understand the difference between discrete and continuous variables, and how to calculate mathematically what you "expect" to happen in a scenario involving uncertainty.
But these were simple, flat distributions. Real world data isn't flat; it clusters into beautiful, predictable bell curves. Tomorrow, on Day 2: Probability Distributions in ML, we will dive deep into Gaussians!