Day 4: Introduction to Classification and Logistic Regression

Welcome to Day 4. Today we shift from predicting continuous numbers (Regression) to predicting discrete categories (Classification).

If I give an AI a photo and ask "Is this a Hotdog?", the answer is binary: Yes (1) or No (0).

Logistic Regression

The most famous beginner classification algorithm is ironically named Logistic Regression.

If you use standard Linear Regression to predict a 1 or a 0, the mathematical line will shoot off into infinity (It might predict 45,000 or -30). This makes no sense for probabilities.

To fix this, Logistic Regression takes the standard Linear sum, and crushes it through a magical mathematical filter called the Sigmoid Function.

The Sigmoid Function

The Sigmoid function, denoted as \(\sigma(z)\), mathematically takes any infinite number and squashes it into a perfect range between 0 and 1.

We can easily visualize this mathematical curve:

# day4_sigmoid.py
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    # The magical formula that maps infinity to 0 - 1
    return 1 / (1 + np.exp(-z)) 

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.title("Sigmoid Function")
plt.show()

Because the output is always between 0 and 1, we treat it as a Probability. * If the model outputs 0.85, it is 85% confident the photo is a hotdog. * By default, the Decision Boundary is set to 0.5. Anything > 0.5 is classified as a 1, anything less is a 0.

Hands-On Let's Classify!

Let's look at day4_ex1.py. We generate synthetic data where Age and Salary determine if an individual will make a Purchase (1) or not (0).

# day4_ex1.py
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# ... (Synthetic Data Generation Hidden) ...

# 1. Split Data
X_train, X_test, y_train, y_test = train_test_split(
    df[['Age', 'Salary']], 
    df['Purchase'], 
    test_size=0.2, 
    random_state=42
)

# 2. Train the Logistic Regression classifier
model = LogisticRegression() # Scikit-Learn handles the Calculus!
model.fit(X_train, y_train)

# 3. Predict the classifications (Outputs arrays of 1s and 0s)
y_pred = model.predict(X_test)

Classification Metrics

How do we know if it worked? We can't use MSE or \(R^2\) anymore because we aren't predicting a continuous curve! We use Accuracy: The percentage of guesses the model got correct.

from sklearn.metrics import accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred))
# Output: Accuracy: 0.95! (95% of its predictions were correct)

Wrapping Up Day 4

You have officially trained a Classification model. You can now predict binary labels and visualize the statistical decision boundaries separating your data.

But is Accuracy always the best metric? If you are predicting a rare disease that only affects 1% of the population, an AI that blindly predicts "No Disease" for every single person will mathematically achieve 99% accuracy—while being completely useless.

Tomorrow, on Day 5: Model Evaluation, we will learn about the Confusion Matrix, Precision, and Recall to truly audit our AI models.