Day 4: Data Cleaning and Preparation with Pandas
Welcome to Day 4. Today we confront the reality of the AI industry: data is rarely handed to you on a silver platter. Real datasets generated by humans or logging systems are disastrous.
They contain missing fields, misspelled column names, mismatched types, and are often split across a dozen different tables. If you feed missing data into a machine learning algorithm, it simply crashes.
Today we learn how to clean and merge our data!
Coping with Missing Data
When you load a .csv into Pandas, any blank cells are automatically converted into a special NumPy value called NaN (Not a Number). You have to handle them before you model.
You basically have three choices:
1. Drop them: Delete the entire row. Fast, but you lose data. (df.dropna())
2. Fill them: Give them a default value, or the mathematical average of the column! (df.fillna())
3. Interpolate: Estimate the missing value based on the values before and after it. (df.interpolate())
Let's look at day4_ex1.py where we generate a broken dataset, use pandas .mean() function to safely estimate missing ages, and use interpolation to estimate an exam score!
# day4_ex1.py
import pandas as pd
import numpy as np # We import numpy so we can artificially inject NaNs
# Create a sample dataset full of missing (NaN) values
data = {
"Name": ["Alice", "Bob", np.nan, "David"],
"Age": [25, np.nan, 30, 35],
"Score": [85, 90, np.nan, 88],
}
df = pd.DataFrame(data)
print(df)
# 1. Fill missing 'Age' with the statistical MEAN (Average) of the Age column
df["Age"] = df["Age"].fillna(df["Age"].mean())
# 2. Interpolate the missing 'Score'
# (It will look at the 90 before it, and 88 after it, and guess 89!)
df["Score"] = df["Score"].interpolate()
# 3. Clean up messy column names
df = df.rename(columns={"Name":"Student_Name", "Score": "Exam:Score"})
print("Cleaned Dataset: \n", df)
Name Age Score
0 Alice 25.0 85.0
1 Bob NaN 90.0
2 NaN 30.0 NaN
3 David 35.0 88.0
Cleaned Dataset:
Student_Name Age Exam:Score
0 Alice 25.0 85.0
1 Bob 30.0 90.0
2 NaN 30.0 89.0
3 David 35.0 88.0
Creating New Features (Data Transformation)
Data Cleaning isn't just fixing errors; it's also preparing the data to be more useful for the AI. Often, you will take existing columns and do math on them to create entirely new columns. In AI, this is called "Feature Engineering."
# Create a new column by dividing 'Score' by 200 and multiplying by 100
merged["Score_Percentage"] = (merged["Score"] / 200) * 100
Bringing Datasets Together
Imagine you have one spreadsheet with Customer Names and Addresses, and a second spreadsheet with their Purchase History. To do ML, you must merge them. Pandas handles this elegantly using .merge(), .concat(), and .join().
Let's look at Exercise 2 (day4_ex2.py). We have two totally different DataFrames that share a single common column: an ID.
# day4_ex2.py
import pandas as pd
df1 = pd.DataFrame({
"ID": [1,2,3],
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
})
df2 = pd.DataFrame({
"ID": [1,2,3],
"Score": [85, 90, 88]
})
# We MERGE them together, joining matching rows on the "ID" column!
merged = pd.merge(df1, df2, how="inner", on="ID")
print("Merged Dataset: \n", merged)
# Output shows a beautiful Table containing ID, Name, Age, AND Score!
Types of Merges
The how="..." parameter determines what happens if there isn't a perfect match:
* inner: Only keep rows where the ID exists in both tables.
* left: Keep all rows from the first table, and fill with NaN if the second table is missing data.
* concat: Stack tables on top of each other! (pd.concat([df1, df2], axis=0))
Wrapping Up Day 4
Congratulations. You are now equipped to scrub bad data from datasets, build new engineered features, and merge dozens of spreadsheets into a single master tabular model.
Tomorrow, on Day 5: Data Aggregation and Grouping in Pandas, we learn how to answer high-level questions about our data by grouping rows together and summarizing them!