Handling Missing Data - Replacement
February 13, 2018
Handling Missing Data - Replacement
Ideally, all data will be complete and without errors. In the real world, this cannot be further from the truth. Thankfully, Pandas comes with many methods that make addressing these issues easier. This is especially important as most machine learning algorithms are unable to handle missing data.
There are in general, two ways to handle entries with missing data. We either fill the blanks with reasonable values, or remove the entire record from the dataset. This article will discuss how we can identify missing fields and fill them - you can refer to link if you are interested to learn about how to remove such entries.
Import Libraries
import pandas as pd
import numpy as np
How Does Pandas Identify “Missing” Data
To understand how we can remove missing data, we need to first understand how Pandas determines that something is “missing”.
The following is an exhaustive list of what Pandas considers as missing.
- The Python keyword
None
np.nan
pd.NaT
(Not a Date)
Create DataFrame with Missing Data
screening_data = {'name': ['Amy', 'Barry', 'Cory'],
'weight_kg': [60, np.nan, 70],
'height_m': [1.6, np.nan, np.nan]}
screening_df = pd.DataFrame(screening_data)
screening_df
height_m | name | weight_kg | |
---|---|---|---|
0 | 1.6 | Amy | 60.0 |
1 | NaN | Barry | NaN |
2 | NaN | Cory | 70.0 |
If we wish to replace all blank values with a single value, simply pass it as a parameter to the fillna
function.
screening_df.fillna(50)
height_m | name | weight_kg | |
---|---|---|---|
0 | 1.6 | Amy | 60.0 |
1 | 50.0 | Barry | 50.0 |
2 | 50.0 | Cory | 70.0 |
If you want to fill blanks by the averages of each column, you can do the following.
screening_df.fillna(screening_df.mean())
height_m | name | weight_kg | |
---|---|---|---|
0 | 1.6 | Amy | 60.0 |
1 | 1.6 | Barry | 65.0 |
2 | 1.6 | Cory | 70.0 |
Another common way to fill missing values will be to determine the group that they belong to and use the means for that group.
scores = {'gender': ['m', 'm', 'm', 'f', 'f', 'f'],
'score': [1, 1, np.nan, 10, 10, np.nan]}
scores_df = pd.DataFrame(scores)
scores_df
gender | score | |
---|---|---|
0 | m | 1.0 |
1 | m | 1.0 |
2 | m | NaN |
3 | f | 10.0 |
4 | f | 10.0 |
5 | f | NaN |
If we believe that there is a distinct difference for test performance between males and females and that the mean test scores for each gender will be a good proxy for the missing values, we can utilise the following code snippet to accomplish this.
fill_mean = lambda scores: scores.fillna(scores.mean())
scores_df = scores_df.groupby('gender')['score'].apply(fill_mean)
scores_df
0 1.0
1 1.0
2 1.0
3 10.0
4 10.0
5 10.0
Name: score, dtype: float64