Day 7: Cross Validation and Hyperparameter Tuning
Welcome to the end of Week 6. You now know the complete pipeline:
Data -> Fill Missing -> Scale -> Encode -> Split -> Train -> Evaluate.
Today, we learn how to do all of that with two lines of code using ColumnTransformer and build a model that recursively tunes itself using GridSearchCV.
1. The ColumnTransformer
Manually fit_transforming six different pandas columns is tedious. Scikit-Learn provides ColumnTransformer to automate the entire Feature Engineering process in one swoop!
# day7_project.py (Loading Titanic Survival Dataset)
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
# ... Clean Missing Values First ...
# We build an array of instruction tuples!
# We tell the Transformer EXACTLY what to do with what columns.
preprocessor = ColumnTransformer(
transformers=[
# Apply the StandardScaler ONLY to 'Age' and 'Fare'
('num', StandardScaler(), ['Age', 'Fare']),
# Apply the OneHotEncoder ONLY to categorical strings
('cat', OneHotEncoder(), ['Pclass', 'Sex', 'Embarked'])
]
)
# Boom! The entire dataset is scaled, encoded, and transformed in one command!
X_preprocessed = preprocessor.fit_transform(X)
2. Hyperparameter Tuning using GridSearch
Every model has internal settings you must pick. Let's look at RandomForestClassifier. How many trees should be in the forest (n_estimators)? How deep should each tree be (max_depth)?
These are Hyperparameters.
Instead of manually changing the code and re-running the script 50 times, we can use GridSearchCV. We give it a "Grid" (Dictionary) of all the settings we want to try. It will use a for loop to systematically train a new model on every possible combination of settings, use K-Fold Cross Validation to ensure they are robust, and output the absolute highest-scoring set of rules!
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# 1. Define the parameters you want to test!
# (3 estimators * 3 depths * 3 splits = 27 totally different models!)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# 2. Set up the GridSearch autonomous bot
# (It will test all 27 models 5 independent times via cv=5... that's 135 models!)
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
scoring='accuracy',
cv=5, # Use 5-Fold Cross Validation for safety!
n_jobs=-1 # Use all cores of your computer CPU simultaneously!
)
# 3. Hit run, and watch your laptop fan spin up!
grid_search.fit(X_preprocessed, y)
# 4. Which of the 27 combinations won the tournament?
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.2f}")
# Output: Best hyperparameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 50}
# With an Accuracy of 83% Survival Rate prediction!
Wrapping Up Week 6!
Congratulations! You just built a professional, production-grade Machine Learning pipeline. You engineered a beautiful dataset, and dynamically grid-searched your way to the pinnacle of Accuracy.
Next week, we graduate to Week 7: Advanced Machine Learning Algorithms. We will look at Ensemble models, Boosting, and the most dominant tabular algorithm in the world: XGBoost. See you there!