Handling Missing Data - Removal
February 11, 2018
Handling Missing Data - Removal
Ideally, all data will be complete and without errors. In the real world, this cannot be further from the truth. Thankfully, Pandas comes with many methods that make addressing these issues easier. This is especially important as most machine learning algorithms are unable to handle missing data.
There are in general, two ways to handle entries with missing data. We either fill the blanks with reasonable values, or remove the entire record from the dataset. This article will discuss how we can identify and remove these records - you can refer to link if you are interested to learn about how we can reasonably plug the gaps.
Import Libraries
import numpy as np
import pandas as pd
How does Pandas Identify “Missing” Data
To understand how we can remove missing data, we need to first understand how Pandas determines that something is “missing”.
The following is an exhaustive list of what Pandas considers as missing.
- The Python keyword
None
np.nan
pd.NaT
(Not a Date)
Removing Missing Data
If it has been determined that the missing data can be ignored without any detriment to the analysis, we can utilise the dropna
function.
In a Series, dropna
is pretty straightforward - it removes any value that Pandas considers as missing.
Create Series with Missing Data
weights = pd.Series([50, pd.NaT, 60, None, 70, np.nan])
weights
0 50
1 NaT
2 60
3 None
4 70
5 NaN
dtype: object
weights.dropna()
0 50
2 60
4 70
dtype: object
Things are a little more complicated for the dropna
function in the DataFrame due to the 2d nature of the data.
Create DataFrame with Missing Data
food_pref = {'name': ['Amy', 'Berry', 'Cory'],
'ice_cream_pref': ['Chocolate', 'Vanilla', None],
'drinks_pref': ['Coke', None, None]}
food_pref_df = pd.DataFrame(food_pref)
food_pref_df
drinks_pref | ice_cream_pref | name | |
---|---|---|---|
0 | Coke | Chocolate | Amy |
1 | None | Vanilla | Berry |
2 | None | None | Cory |
Calling dropna
on the dataframe without any additional parameters will drop any row with empty fields.
food_pref_df.dropna()
drinks_pref | ice_cream_pref | name | |
---|---|---|---|
0 | Coke | Chocolate | Amy |
If you wish to drop any columns that has empty fields, pass the parameter axis=1
to the function.
food_pref_df.dropna(axis=1)
name | |
---|---|
0 | Amy |
1 | Berry |
2 | Cory |
If you are looking to retain rows with at least n
non-blank values, you can utilise the thresh
parameter.
# keeps all rows with at least 2 non-blank values
food_pref_df.dropna(thresh=2)
drinks_pref | ice_cream_pref | name | |
---|---|---|---|
0 | Coke | Chocolate | Amy |
1 | None | Vanilla | Berry |