Applying Functions to Columns or Rows
February 9, 2018
Applying Functions to Columns or Rows
When tackling a machine learning problem, we will often discover that the features that we are provided with can often be combined to create new features to enhance the model. This process is formally known as feature engineering, and Pandas comes with several in-built functions for this purpose.
Load Libraries
import pandas as pd
import numpy as np
import seaborn as sns # to retrieve the titanic dataset
Load Data
The titanic dataset contains information about the passengers aboard the ill-fated ship, along with whether they survived.
titanic = sns.load_dataset('titanic')
cols_of_interest = ['survived', 'pclass', 'sex', 'age', 'who', 'alive']
titanic = titanic.loc[:, cols_of_interest]
titanic.head()
survived | pclass | sex | age | who | alive | |
---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | man | no |
1 | 1 | 1 | female | 38.0 | woman | yes |
2 | 1 | 3 | female | 26.0 | woman | yes |
3 | 1 | 1 | female | 35.0 | woman | yes |
4 | 0 | 3 | male | 35.0 | man | no |
Map
The map
function is used to, as you guessed it, map values given the input variables. A common use for this function is to change the values in a Series.
We will demonstrate by recreating the alive
column, which uses the survived
column as its input.
Based on the groupby
below, we can tell the relationship between the two columns.
list(titanic.groupby(['survived', 'alive']).groups)
[(0, 'no'), (1, 'yes')]
alive_map = {0: 'no', 1: 'yes'}
titanic = titanic.drop('alive', axis=1)
titanic.head()
survived | pclass | sex | age | who | |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | man |
1 | 1 | 1 | female | 38.0 | woman |
2 | 1 | 3 | female | 26.0 | woman |
3 | 1 | 1 | female | 35.0 | woman |
4 | 0 | 3 | male | 35.0 | man |
titanic['alive'] = titanic['survived'].map(alive_map)
titanic[['survived', 'alive']].head()
survived | alive | |
---|---|---|
0 | 0 | no |
1 | 1 | yes |
2 | 1 | yes |
3 | 1 | yes |
4 | 0 | no |
Apply
A closely related function is apply
- as the name suggests, it accepts an input function which is then applied to the row or column. We will demonstrate its functionality by recreating the who
column.
From the following groupby
statement and observing the maximum age of a ‘child’, we can easily derive the logic used to create this column.
list(titanic.groupby(['sex', 'who']).groups)
[('female', 'child'), ('female', 'woman'), ('male', 'child'), ('male', 'man')]
We summarise this relationship in the table below.
Sex | Age | ‘Who’ |
---|---|---|
female | <=15 | child |
female | >15 | woman |
male | <=15 | child |
male | >15 | man |
def generate_who(row):
if row['age'] <= 15:
return 'child'
else:
if row['sex'] == 'male':
return 'man'
else:
return 'woman'
titanic['who_recreated'] = titanic.apply(generate_who, axis=1)
titanic[['who', 'who_recreated']].head(10)
who | who_recreated | |
---|---|---|
0 | man | man |
1 | woman | woman |
2 | woman | woman |
3 | woman | woman |
4 | man | man |
5 | man | man |
6 | man | man |
7 | child | child |
8 | woman | woman |
9 | child | child |