Handling Missing Data

Handling Missing Data - Replacement

February 13, 2018

Handling Missing Data - Replacement

Ideally, all data will be complete and without errors. In the real world, this cannot be further from the truth. Thankfully, Pandas comes with many methods that make addressing these issues easier. This is especially important as most machine learning algorithms are unable to handle missing data.

There are in general, two ways to handle entries with missing data. We either fill the blanks with reasonable values, or remove the entire record from the dataset. This article will discuss how we can identify missing fields and fill them - you can refer to link if you are interested to learn about how to remove such entries.

Import Libraries

import pandas as pd
import numpy as np

How Does Pandas Identify “Missing” Data

To understand how we can remove missing data, we need to first understand how Pandas determines that something is “missing”.

The following is an exhaustive list of what Pandas considers as missing.

The Python keyword None
np.nan
pd.NaT (Not a Date)

Create DataFrame with Missing Data

screening_data = {'name': ['Amy', 'Barry', 'Cory'], 
                  'weight_kg': [60, np.nan, 70], 
                  'height_m': [1.6, np.nan, np.nan]}

screening_df = pd.DataFrame(screening_data)
screening_df

	height_m	name	weight_kg
0	1.6	Amy	60.0
1	NaN	Barry	NaN
2	NaN	Cory	70.0

If we wish to replace all blank values with a single value, simply pass it as a parameter to the fillna function.

screening_df.fillna(50)

	height_m	name	weight_kg
0	1.6	Amy	60.0
1	50.0	Barry	50.0
2	50.0	Cory	70.0

If you want to fill blanks by the averages of each column, you can do the following.

screening_df.fillna(screening_df.mean())

	height_m	name	weight_kg
0	1.6	Amy	60.0
1	1.6	Barry	65.0
2	1.6	Cory	70.0

Another common way to fill missing values will be to determine the group that they belong to and use the means for that group.

scores = {'gender': ['m', 'm', 'm', 'f', 'f', 'f'], 
          'score': [1, 1, np.nan, 10, 10, np.nan]}

scores_df = pd.DataFrame(scores)
scores_df

	gender	score
0	m	1.0
1	m	1.0
2	m	NaN
3	f	10.0
4	f	10.0
5	f	NaN

If we believe that there is a distinct difference for test performance between males and females and that the mean test scores for each gender will be a good proxy for the missing values, we can utilise the following code snippet to accomplish this.

fill_mean = lambda scores: scores.fillna(scores.mean())
scores_df = scores_df.groupby('gender')['score'].apply(fill_mean)
scores_df

0     1.0
1     1.0
2     1.0
3    10.0
4    10.0
5    10.0
Name: score, dtype: float64

Handling Missing Data - Replacement

February 13, 2018