Handling Missing Data - Removal

February 11, 2018

Handling Missing Data - Removal

Ideally, all data will be complete and without errors. In the real world, this cannot be further from the truth. Thankfully, Pandas comes with many methods that make addressing these issues easier. This is especially important as most machine learning algorithms are unable to handle missing data.

There are in general, two ways to handle entries with missing data. We either fill the blanks with reasonable values, or remove the entire record from the dataset. This article will discuss how we can identify and remove these records - you can refer to link if you are interested to learn about how we can reasonably plug the gaps.


Import Libraries

import numpy as np
import pandas as pd


How does Pandas Identify “Missing” Data

To understand how we can remove missing data, we need to first understand how Pandas determines that something is “missing”.

The following is an exhaustive list of what Pandas considers as missing.


Removing Missing Data

If it has been determined that the missing data can be ignored without any detriment to the analysis, we can utilise the dropna function.

In a Series, dropna is pretty straightforward - it removes any value that Pandas considers as missing.


Create Series with Missing Data

weights = pd.Series([50, pd.NaT, 60, None, 70, np.nan])
weights
0      50
1     NaT
2      60
3    None
4      70
5     NaN
dtype: object
weights.dropna()
0    50
2    60
4    70
dtype: object


Things are a little more complicated for the dropna function in the DataFrame due to the 2d nature of the data.


Create DataFrame with Missing Data

food_pref = {'name': ['Amy', 'Berry', 'Cory'],
             'ice_cream_pref': ['Chocolate', 'Vanilla', None], 
             'drinks_pref': ['Coke', None, None]}

food_pref_df = pd.DataFrame(food_pref)
food_pref_df
drinks_pref ice_cream_pref name
0 Coke Chocolate Amy
1 None Vanilla Berry
2 None None Cory


Calling dropna on the dataframe without any additional parameters will drop any row with empty fields.

food_pref_df.dropna()
drinks_pref ice_cream_pref name
0 Coke Chocolate Amy


If you wish to drop any columns that has empty fields, pass the parameter axis=1 to the function.

food_pref_df.dropna(axis=1)
name
0 Amy
1 Berry
2 Cory


If you are looking to retain rows with at least n non-blank values, you can utilise the thresh parameter.

# keeps all rows with at least 2 non-blank values
food_pref_df.dropna(thresh=2)
drinks_pref ice_cream_pref name
0 Coke Chocolate Amy
1 None Vanilla Berry
comments powered by Disqus