Applying Functions to Columns or Rows

February 9, 2018

Applying Functions to Columns or Rows

When tackling a machine learning problem, we will often discover that the features that we are provided with can often be combined to create new features to enhance the model. This process is formally known as feature engineering, and Pandas comes with several in-built functions for this purpose.

Load Libraries

import pandas as pd
import numpy as np
import seaborn as sns # to retrieve the titanic dataset

Load Data

The titanic dataset contains information about the passengers aboard the ill-fated ship, along with whether they survived.

titanic = sns.load_dataset('titanic')
cols_of_interest = ['survived', 'pclass', 'sex', 'age', 'who', 'alive']
titanic = titanic.loc[:, cols_of_interest]
titanic.head()

	survived	pclass	sex	age	who	alive
0	0	3	male	22.0	man	no
1	1	1	female	38.0	woman	yes
2	1	3	female	26.0	woman	yes
3	1	1	female	35.0	woman	yes
4	0	3	male	35.0	man	no

Map

The map function is used to, as you guessed it, map values given the input variables. A common use for this function is to change the values in a Series.

We will demonstrate by recreating the alive column, which uses the survived column as its input.

Based on the groupby below, we can tell the relationship between the two columns.

list(titanic.groupby(['survived', 'alive']).groups)

[(0, 'no'), (1, 'yes')]

alive_map = {0: 'no', 1: 'yes'}
titanic = titanic.drop('alive', axis=1)
titanic.head()

	survived	pclass	sex	age	who
0	0	3	male	22.0	man
1	1	1	female	38.0	woman
2	1	3	female	26.0	woman
3	1	1	female	35.0	woman
4	0	3	male	35.0	man

titanic['alive'] = titanic['survived'].map(alive_map)
titanic[['survived', 'alive']].head()

	survived	alive
0	0	no
1	1	yes
2	1	yes
3	1	yes
4	0	no

Apply

A closely related function is apply - as the name suggests, it accepts an input function which is then applied to the row or column. We will demonstrate its functionality by recreating the who column.

From the following groupby statement and observing the maximum age of a ‘child’, we can easily derive the logic used to create this column.

list(titanic.groupby(['sex', 'who']).groups)

[('female', 'child'), ('female', 'woman'), ('male', 'child'), ('male', 'man')]

We summarise this relationship in the table below.

Sex	Age	‘Who’
female	<=15	child
female	>15	woman
male	<=15	child
male	>15	man

def generate_who(row):
    if row['age'] <= 15:
        return 'child'
    else:
        if row['sex'] == 'male':
            return 'man'
        else:
            return 'woman'

titanic['who_recreated'] = titanic.apply(generate_who, axis=1)
titanic[['who', 'who_recreated']].head(10)

	who	who_recreated
0	man	man
1	woman	woman
2	woman	woman
3	woman	woman
4	man	man
5	man	man
6	man	man
7	child	child
8	woman	woman
9	child	child

Applying Functions to Columns or Rows

February 9, 2018