Skip to content

Day 1: Probability Theory and Random Variables

Welcome to Day 1 of Probability and Statistics Week. We begin with the foundation of uncertainty: Probability.

When an AI tries to translate a sentence from English to French, it doesn't "know" the right word. It calculates the probability of every possible word and picks the one with the highest score.

Basic Probability Concepts

  • Sample Space: The set of all possible outcomes. (e.g., A six-sided die has a sample space of {1, 2, 3, 4, 5, 6}).
  • Event: A specific outcome or subset of outcomes we care about. (e.g., Rolling an even number: {2, 4, 6}).
  • Independence: Two events are independent if the outcome of one does not affect the outcome of the other. If you flip a coin twice, the first flip has zero effect on the second flip!

We can simulate probability effortlessly with numpy:

# day1_ex1.py
import numpy as np

# Simulate rolling a 6-sided die 10,000 times!
rolls = np.random.randint(1, 7, size=10000)

# What is the probability of rolling an Even number?
# (Sum up all the True values, and divide by the total number of rolls)
P_even = np.sum(rolls % 2 == 0) / len(rolls)
print("P(Even): ", P_even) 
# Output: ~0.50 (As expected, 50% chance!)

# What is the probability of rolling greater than a 4? (5 or 6)
P_greater_than_4 = np.sum(rolls > 4) / len(rolls)
print("P(Greater than 4): ", P_greater_than_4)
# Output: ~0.33 (33% chance!)

Random Variables

A Random Variable (RV) is a mathematical way to map the outcome of a random event to a number. * Discrete RVs have specific, countable values. (e.g., The number rolled on a die). * Continuous RVs can be any value within a range. (e.g., The exact height of a human).

RVs are defined by their curves: * A PMF (Probability Mass Function) shows the probability curve for Discrete variables. * A PDF (Probability Density Function) shows the curve for Continuous variables.

# day1_ex2.py
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import uniform

# We can plot the PDF of a Continuous "Uniform" curve
# A uniform curve means every single number has the exact same probability of being picked!
x = np.linspace(0, 1, 100)
pdf = uniform.pdf(x, loc=0, scale=1)

plt.plot(x, pdf, color="red")
plt.title("PDF of Uniform(0,1)")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.show() # Draws a perfectly flat red line!

Expectation and Variance

If you play a game of chance infinitely, what is your average mathematical outcome? This is called your Expectation (or Mean).

But how "wild" is the game? Does it always result in a number close to the mean, or does it swing wildly between massive wins and massive losses? That spread is the Variance (and its square root, the Standard Deviation).

# day1_sample.py
import numpy as np

# A fair 6-sided die
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6]* 6)

# The Expectation is the Sum of (Outcome * Probability of that Outcome)
expectation = np.sum(outcomes * probabilities)
print("Expectation (Mean): ", expectation) 
# Output: 3.5 

variance = np.sum((outcomes - expectation)**2 * probabilities)
std_dev = np.sqrt(variance)

print("Variance: ", variance) # Output: ~2.91
print("Standard Deviation: ", std_dev) # Output: ~1.70
This tells us that the average roll is a 3.5, and we expect most rolls to deviate from that average by roughly 1.7!

Wrapping Up Day 1

You now understand the difference between discrete and continuous variables, and how to calculate mathematically what you "expect" to happen in a scenario involving uncertainty.

But these were simple, flat distributions. Real world data isn't flat; it clusters into beautiful, predictable bell curves. Tomorrow, on Day 2: Probability Distributions in ML, we will dive deep into Gaussians!