Selecting and Filtering Data

February 17, 2018

Selecting and Filtering Data

Being able to select and filter data easily is probably the most important feature in Pandas. We begin by demonstrating how this works for Series objects, followed by DataFrames.


Import Libraries

import pandas as pd
import seaborn as sns # to retrieve the titanic dataset


Load Data

names = ['Amy', 'Berry', 'Cory', 'Dewy', 'Eeewy']
flavors = ['chocolate', 'vanilla', 'strawberry', 'vanilla', 'strawberry']
preferences = pd.Series(data=flavors, index=names)
preferences
Amy       chocolate
Berry       vanilla
Cory     strawberry
Dewy        vanilla
Eeewy    strawberry
dtype: object


Data in Pandas can be assessed either by label, or by integer (row/column numbers).


Selection by Integer

The iloc operator allows us to select the values by integer (row or column numbers).

print(preferences.iloc[0])
chocolate


Series objects also allow for Python indexing and slicing.

print(preferences[0])
chocolate
print(preferences[0:2])
Amy      chocolate
Berry      vanilla
dtype: object


Selection by Label

loc allows us to select using an identifier, or a list of identifiers.

print(preferences.loc['Amy'])
chocolate
print(preferences.loc[['Amy', 'Berry']])
Amy      chocolate
Berry      vanilla
dtype: object


Series Filtering

There are multiple ways that we can filter data in a Series.


The first is to place the criteria (a string that evaluates to True/False) within square brackets.

preferences[preferences=='chocolate']
Amy    chocolate
dtype: object


Boolean operators (such as & and |) can also be used.

preferences[(preferences=='chocolate') | (preferences=='vanilla')]
Amy      chocolate
Berry      vanilla
Dewy       vanilla
dtype: object


The filter function also allows us to select via the labels.

preferences.filter(items=['Amy'])
Amy    chocolate
dtype: object


DataFrame Selection and Filtering

There are strong parallels between how selection and filtering is performed for Series and DataFrames, and expectedly it is slightly more involved for DataFrames due to their 2d nature.


Load DataFrame Data

titanic = sns.load_dataset('titanic')
cols_of_interest = ['survived', 'pclass', 'sex', 'age', 'sibsp']
titanic[cols_of_interest].head()
survived pclass sex age sibsp
0 0 3 male 22.0 1
1 1 1 female 38.0 1
2 1 3 female 26.0 0
3 1 1 female 35.0 1
4 0 3 male 35.0 0


DataFrame Selection

You can select a column by placing its identifier in square brackets. Multiple columns can be selected using a list of identifiers.

titanic[['sex']].head()
sex
0 male
1 female
2 female
3 female
4 male
titanic[['survived', 'sex']].head()
survived sex
0 0 male
1 1 female
2 1 female
3 1 female
4 0 male


You can also select columns using the loc function. For a DataFrame, the first parameter of loc references the rows, while the second references the columns.

# retrieve all rows
titanic.loc[:, ['survived', 'sex']].head()
survived sex
0 0 male
1 1 female
2 1 female
3 1 female
4 0 male
# retrieve only the first 2 rows
titanic.loc[0:2, ['survived', 'sex']]
survived sex
0 0 male
1 1 female
2 1 female


DataFrame Filtering

Filtering a DataFrame is very similar to how you would filter a Series - simply place the criterion/criteria between square brackets.

titanic[titanic['sex']=='male'][cols_of_interest].head(1)
survived pclass sex age sibsp
0 0 3 male 22.0 1


You can also access columns using . if you do not wish to use the square brackets.

titanic[titanic.sex=='male'][cols_of_interest].head(1)
survived pclass sex age sibsp
0 0 3 male 22.0 1


You can filter on multiple criteria by chaining them with boolean operators.

titanic[(titanic['sex']=='male') & (titanic['pclass']==3)][cols_of_interest].head(1)
survived pclass sex age sibsp
0 0 3 male 22.0 1
titanic[(titanic.sex=='male') & (titanic.pclass==3)][cols_of_interest].head(1)
survived pclass sex age sibsp
0 0 3 male 22.0 1
comments powered by Disqus