Day 2: Data Scaling and Normalization

Welcome to Day 2! Imagine you are predicting the price of a house. Your dataset has two features: 1. Number of Bedrooms: Values range from 1 to 5. 2. Square Footage: Values range from 800 to 5,000.

If you feed this into a distance-based AI algorithm (like k-NN), the math breaks. The algorithm will look at a difference of \(2,000\) square feet and panic, thinking it is infinitely more important than a difference of \(3\) bedrooms.

To fix this, we mathematically force every single numerical column to be the exact same size. This process is called Scaling.

Two Major Scaling Techniques

1. Min-Max Scaling (Normalization)

This technique squashes all data into a hard range between 0 and 1. The smallest number in the column becomes 0. The largest becomes 1. Everything else is a decimal in between.

Pros: Perfectly retains the exact shape of your original distribution. Great for Neural Networks!
Cons: Highly sensitive to outliers. If one house is a \(100,000\) sq-ft mansion, the mansion becomes 1, and every normal house is crushed into 0.001.

2. Standardization (Z-Score Scaling)

This is the golden standard. Instead of hard boundaries, Standardization centers the data perfectly around 0, with a Standard Deviation of 1.

Pros: Handles outliers gracefully. Most numbers land between -3 and 3. This is required for Logistic Regression, SVMs, and PCA!

Hands-On Let's Scale!

Let's look at day2_ex.py. We load the Iris Dataset and test a k-NN model three times: Unscaled, Min-Max, and Standardized!

# day2_ex.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# ... [Load data and split into X_train, y_train hidden] ...

# 1. Train WITHOUT Scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print("Accuracy Without Scaling:", accuracy_score(y_test, knn.predict(X_test)))

# 2. Train WITH Min-Max Scaling
min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
# ... [Split and train model again] ...
print("Accuracy with Min-Max Scaling:", accuracy_score(y_test_scaled, knn_scaled.predict(X_test_scaled)))

# 3. Train WITH Standardization
std_scaler = StandardScaler()
X_stand = std_scaler.fit_transform(X)
# ... [Split and train model again] ...
print("Accuracy with Standardization:", accuracy_score(y_test_std, knn_stand.predict(X_test_std)))

Depending on the random state and the exact distribution of the data, scaling almost always boosts accuracy by 2%-10%! Furthermore, gradient-descent algorithms (like Neural Networks) will train exponentially faster if the data is scaled to zero.

Which Algorithms DON'T need Scaling?

There is one family of ML algorithms that does not care about scaling at all: Tree-Based Models (Decision Trees, Random Forests, XGBoost). Because they split data using logical thresholds ("Is Square Footage > 1000?"), the scale of the number is completely irrelevant!

Wrapping Up Day 2

You now know how to stabilize numerical gradients using MinMaxScaler and StandardScaler.

But what if the column doesn't have numbers at all? Tomorrow, on Day 3: Encoding Categorical Variables, we learn how to convert words like "Male" and "Female" into pure mathematics.