Selecting and Filtering Data
February 17, 2018
Selecting and Filtering Data
Being able to select and filter data easily is probably the most important feature in Pandas. We begin by demonstrating how this works for Series objects, followed by DataFrames.
Import Libraries
import pandas as pd
import seaborn as sns # to retrieve the titanic dataset
Load Data
names = ['Amy', 'Berry', 'Cory', 'Dewy', 'Eeewy']
flavors = ['chocolate', 'vanilla', 'strawberry', 'vanilla', 'strawberry']
preferences = pd.Series(data=flavors, index=names)
preferences
Amy chocolate
Berry vanilla
Cory strawberry
Dewy vanilla
Eeewy strawberry
dtype: object
Data in Pandas can be assessed either by label, or by integer (row/column numbers).
Selection by Integer
The iloc
operator allows us to select the values by integer (row or column numbers).
print(preferences.iloc[0])
chocolate
Series
objects also allow for Python indexing and slicing.
print(preferences[0])
chocolate
print(preferences[0:2])
Amy chocolate
Berry vanilla
dtype: object
Selection by Label
loc
allows us to select using an identifier, or a list of identifiers.
print(preferences.loc['Amy'])
chocolate
print(preferences.loc[['Amy', 'Berry']])
Amy chocolate
Berry vanilla
dtype: object
Series Filtering
There are multiple ways that we can filter data in a Series.
The first is to place the criteria (a string that evaluates to True
/False
) within square brackets.
preferences[preferences=='chocolate']
Amy chocolate
dtype: object
Boolean operators (such as &
and |
) can also be used.
preferences[(preferences=='chocolate') | (preferences=='vanilla')]
Amy chocolate
Berry vanilla
Dewy vanilla
dtype: object
The filter
function also allows us to select via the labels.
preferences.filter(items=['Amy'])
Amy chocolate
dtype: object
DataFrame Selection and Filtering
There are strong parallels between how selection and filtering is performed for Series and DataFrames, and expectedly it is slightly more involved for DataFrames due to their 2d nature.
Load DataFrame Data
titanic = sns.load_dataset('titanic')
cols_of_interest = ['survived', 'pclass', 'sex', 'age', 'sibsp']
titanic[cols_of_interest].head()
survived | pclass | sex | age | sibsp | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 |
1 | 1 | 1 | female | 38.0 | 1 |
2 | 1 | 3 | female | 26.0 | 0 |
3 | 1 | 1 | female | 35.0 | 1 |
4 | 0 | 3 | male | 35.0 | 0 |
DataFrame Selection
You can select a column by placing its identifier in square brackets. Multiple columns can be selected using a list of identifiers.
titanic[['sex']].head()
sex | |
---|---|
0 | male |
1 | female |
2 | female |
3 | female |
4 | male |
titanic[['survived', 'sex']].head()
survived | sex | |
---|---|---|
0 | 0 | male |
1 | 1 | female |
2 | 1 | female |
3 | 1 | female |
4 | 0 | male |
You can also select columns using the loc
function. For a DataFrame, the first parameter of loc
references the rows, while the second references the columns.
# retrieve all rows
titanic.loc[:, ['survived', 'sex']].head()
survived | sex | |
---|---|---|
0 | 0 | male |
1 | 1 | female |
2 | 1 | female |
3 | 1 | female |
4 | 0 | male |
# retrieve only the first 2 rows
titanic.loc[0:2, ['survived', 'sex']]
survived | sex | |
---|---|---|
0 | 0 | male |
1 | 1 | female |
2 | 1 | female |
DataFrame Filtering
Filtering a DataFrame is very similar to how you would filter a Series - simply place the criterion/criteria between square brackets.
titanic[titanic['sex']=='male'][cols_of_interest].head(1)
survived | pclass | sex | age | sibsp | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 |
You can also access columns using .
if you do not wish to use the square brackets.
titanic[titanic.sex=='male'][cols_of_interest].head(1)
survived | pclass | sex | age | sibsp | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 |
You can filter on multiple criteria by chaining them with boolean operators.
titanic[(titanic['sex']=='male') & (titanic['pclass']==3)][cols_of_interest].head(1)
survived | pclass | sex | age | sibsp | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 |
titanic[(titanic.sex=='male') & (titanic.pclass==3)][cols_of_interest].head(1)
survived | pclass | sex | age | sibsp | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 |