Data Discretization

February 25, 2018

Data Discretization

One method of scaling a continuous variable is by binning it. While in general this is not a good idea as we are losing information, it can be useful in instances where we expect a discontinuity in the response at certain points (eg: the legal age of driving).

We will demonstrate how Pandas performs data discretization, followed by an example on how to perform custom discretization.

Import Libraries

import numpy as np
import pandas as pd

Load Data

scores = [10, 15, 20, 25, 30, 
          60, 70, 80, 90, 100]

The inbuilt function in Pandas, cut, splits the dataset into ranges of equal sizes.

In our example, if we wish to have 3 discrete buckets, the function will determine the ideal range for each bucket by equally dividing the distance between the smallest and largest values.

labels = ['low', 'medium', 'high']
pd.cut(scores, bins=3, right=False, labels=labels)

[low, low, low, low, low, medium, high, high, high, high]
Categories (3, object): [low < medium < high]

Custom Discretization

What if we do not want ranges of equal widths but instead wish to place values into pre-defined buckets? For instance, it is common to group ages in ranges of 10.

This will have to be performed by a custom function, as there is no pre-defined function in Python to round a value to the nearest tens/hundreds/thousands.

The trick is to make use of the ceil or floor function in Python’s math module, which rounds up or down a float to its nearest integer.

import math

print('The floor of 1.2 is', math.floor(1.2))
print('The ceiling of 1.2 is', math.ceil(1.2))

The floor of 1.2 is 1
The ceiling of 1.2 is 2

We will create the floor_ages function, which first divides each age by 10 to create a float. The float is then floored, and lastly we multiply each value by 10.

ages = pd.DataFrame(data={'age': [18, 27, 35, 42, 50, 69, 70, 81, 93]})

def floor_ages(age):
    return math.floor(age / 10) * 10

ages['age_floored'] = ages['age'].apply(floor_ages)
print(ages)

   age  age_floored
0   18           10
1   27           20
2   35           30
3   42           40
4   50           50
5   69           60
6   70           70
7   81           80
8   93           90

We will create a dictionary that will map our age_floored values to our desired string representations.

age_category_map = {10: '10 to 19', 20: '20 to 29', 30: '30 to 39', 
                    40: '40 to 49', 50: '50 to 59', 60: '60 to 69', 
                    70: '70 to 79', 80: '80 to 89', 90: '90 to 100'}
ages['category'] = ages['age_floored'].map(age_category_map)
print(ages.head())

   age  age_floored  category
0   18           10  10 to 19
1   27           20  20 to 29
2   35           30  30 to 39
3   42           40  40 to 49
4   50           50  50 to 59

Data Discretization

February 25, 2018