Data Discretization
February 25, 2018
Data Discretization
One method of scaling a continuous variable is by binning it. While in general this is not a good idea as we are losing information, it can be useful in instances where we expect a discontinuity in the response at certain points (eg: the legal age of driving).
We will demonstrate how Pandas performs data discretization, followed by an example on how to perform custom discretization.
Import Libraries
import numpy as np
import pandas as pd
Load Data
scores = [10, 15, 20, 25, 30,
60, 70, 80, 90, 100]
The inbuilt function in Pandas, cut
, splits the dataset into ranges of equal sizes.
In our example, if we wish to have 3 discrete buckets, the function will determine the ideal range for each bucket by equally dividing the distance between the smallest and largest values.
labels = ['low', 'medium', 'high']
pd.cut(scores, bins=3, right=False, labels=labels)
[low, low, low, low, low, medium, high, high, high, high]
Categories (3, object): [low < medium < high]
Custom Discretization
What if we do not want ranges of equal widths but instead wish to place values into pre-defined buckets? For instance, it is common to group ages in ranges of 10.
This will have to be performed by a custom function, as there is no pre-defined function in Python to round a value to the nearest tens/hundreds/thousands.
The trick is to make use of the ceil
or floor
function in Python’s math
module, which rounds up or down a float to its nearest integer.
import math
print('The floor of 1.2 is', math.floor(1.2))
print('The ceiling of 1.2 is', math.ceil(1.2))
The floor of 1.2 is 1
The ceiling of 1.2 is 2
We will create the floor_ages
function, which first divides each age by 10 to create a float. The float is then floored, and lastly we multiply each value by 10.
ages = pd.DataFrame(data={'age': [18, 27, 35, 42, 50, 69, 70, 81, 93]})
def floor_ages(age):
return math.floor(age / 10) * 10
ages['age_floored'] = ages['age'].apply(floor_ages)
print(ages)
age age_floored
0 18 10
1 27 20
2 35 30
3 42 40
4 50 50
5 69 60
6 70 70
7 81 80
8 93 90
We will create a dictionary that will map our age_floored
values to our desired string representations.
age_category_map = {10: '10 to 19', 20: '20 to 29', 30: '30 to 39',
40: '40 to 49', 50: '50 to 59', 60: '60 to 69',
70: '70 to 79', 80: '80 to 89', 90: '90 to 100'}
ages['category'] = ages['age_floored'].map(age_category_map)
print(ages.head())
age age_floored category
0 18 10 10 to 19
1 27 20 20 to 29
2 35 30 30 to 39
3 42 40 40 to 49
4 50 50 50 to 59