Applying Functions to Columns or Rows

February 9, 2018

Applying Functions to Columns or Rows

When tackling a machine learning problem, we will often discover that the features that we are provided with can often be combined to create new features to enhance the model. This process is formally known as feature engineering, and Pandas comes with several in-built functions for this purpose.


Load Libraries

import pandas as pd
import numpy as np
import seaborn as sns # to retrieve the titanic dataset


Load Data

The titanic dataset contains information about the passengers aboard the ill-fated ship, along with whether they survived.

titanic = sns.load_dataset('titanic')
cols_of_interest = ['survived', 'pclass', 'sex', 'age', 'who', 'alive']
titanic = titanic.loc[:, cols_of_interest]
titanic.head()
survived pclass sex age who alive
0 0 3 male 22.0 man no
1 1 1 female 38.0 woman yes
2 1 3 female 26.0 woman yes
3 1 1 female 35.0 woman yes
4 0 3 male 35.0 man no


Map

The map function is used to, as you guessed it, map values given the input variables. A common use for this function is to change the values in a Series.

We will demonstrate by recreating the alive column, which uses the survived column as its input.

Based on the groupby below, we can tell the relationship between the two columns.

list(titanic.groupby(['survived', 'alive']).groups)
[(0, 'no'), (1, 'yes')]
alive_map = {0: 'no', 1: 'yes'}
titanic = titanic.drop('alive', axis=1)
titanic.head()
survived pclass sex age who
0 0 3 male 22.0 man
1 1 1 female 38.0 woman
2 1 3 female 26.0 woman
3 1 1 female 35.0 woman
4 0 3 male 35.0 man
titanic['alive'] = titanic['survived'].map(alive_map)
titanic[['survived', 'alive']].head()
survived alive
0 0 no
1 1 yes
2 1 yes
3 1 yes
4 0 no


Apply

A closely related function is apply - as the name suggests, it accepts an input function which is then applied to the row or column. We will demonstrate its functionality by recreating the who column.

From the following groupby statement and observing the maximum age of a ‘child’, we can easily derive the logic used to create this column.

list(titanic.groupby(['sex', 'who']).groups)
[('female', 'child'), ('female', 'woman'), ('male', 'child'), ('male', 'man')]


We summarise this relationship in the table below.

Sex Age ‘Who’
female <=15 child
female >15 woman
male <=15 child
male >15 man
def generate_who(row):
    if row['age'] <= 15:
        return 'child'
    else:
        if row['sex'] == 'male':
            return 'man'
        else:
            return 'woman'

titanic['who_recreated'] = titanic.apply(generate_who, axis=1)
titanic[['who', 'who_recreated']].head(10)
who who_recreated
0 man man
1 woman woman
2 woman woman
3 woman woman
4 man man
5 man man
6 man man
7 child child
8 woman woman
9 child child
comments powered by Disqus