Handling Missing Data - Replacement

February 13, 2018

Handling Missing Data - Replacement

Ideally, all data will be complete and without errors. In the real world, this cannot be further from the truth. Thankfully, Pandas comes with many methods that make addressing these issues easier. This is especially important as most machine learning algorithms are unable to handle missing data.

There are in general, two ways to handle entries with missing data. We either fill the blanks with reasonable values, or remove the entire record from the dataset. This article will discuss how we can identify missing fields and fill them - you can refer to link if you are interested to learn about how to remove such entries.


Import Libraries

import pandas as pd
import numpy as np


How Does Pandas Identify “Missing” Data

To understand how we can remove missing data, we need to first understand how Pandas determines that something is “missing”.

The following is an exhaustive list of what Pandas considers as missing.


Create DataFrame with Missing Data

screening_data = {'name': ['Amy', 'Barry', 'Cory'], 
                  'weight_kg': [60, np.nan, 70], 
                  'height_m': [1.6, np.nan, np.nan]}

screening_df = pd.DataFrame(screening_data)
screening_df
height_m name weight_kg
0 1.6 Amy 60.0
1 NaN Barry NaN
2 NaN Cory 70.0


If we wish to replace all blank values with a single value, simply pass it as a parameter to the fillna function.

screening_df.fillna(50)
height_m name weight_kg
0 1.6 Amy 60.0
1 50.0 Barry 50.0
2 50.0 Cory 70.0


If you want to fill blanks by the averages of each column, you can do the following.

screening_df.fillna(screening_df.mean())
height_m name weight_kg
0 1.6 Amy 60.0
1 1.6 Barry 65.0
2 1.6 Cory 70.0


Another common way to fill missing values will be to determine the group that they belong to and use the means for that group.

scores = {'gender': ['m', 'm', 'm', 'f', 'f', 'f'], 
          'score': [1, 1, np.nan, 10, 10, np.nan]}

scores_df = pd.DataFrame(scores)
scores_df
gender score
0 m 1.0
1 m 1.0
2 m NaN
3 f 10.0
4 f 10.0
5 f NaN


If we believe that there is a distinct difference for test performance between males and females and that the mean test scores for each gender will be a good proxy for the missing values, we can utilise the following code snippet to accomplish this.

fill_mean = lambda scores: scores.fillna(scores.mean())
scores_df = scores_df.groupby('gender')['score'].apply(fill_mean)
scores_df
0     1.0
1     1.0
2     1.0
3    10.0
4    10.0
5    10.0
Name: score, dtype: float64
comments powered by Disqus