Basics of the Pandas GroupBy Object
February 21, 2018
Basics of the Pandas GroupBy Object
Before we can discuss how Pandas performs aggregations, we have to first understand the properties of the GroupBy object that is created when we call the groupby
function.
Graphically, the GroupBy object represents the objects listed in the “Split” step, before any functions are applied to the groups.
Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns # to retrieve the tips dataset
Import Data
We will be exploring the GroupBy object using the “tips” dataset found in seaborn
. The tips dataset contains information about how much people tipped, as well as their gender and whether they are smokers.
tips = sns.load_dataset('tips')
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
When the groupby
function is called, no splitting occurs until it is truly necessary. It only verifies that a valid mapping has been passed to the function.
group_sex_smoker = tips.groupby(['sex', 'smoker'])
print(group_sex_smoker)
<pandas.core.groupby.DataFrameGroupBy object at 0x0000025B92933DD8>
We can iterate the GroupBy object and print the label for each group. If we want to, we can also access the data of each group, but we will not be doing that here.
for group, data in group_sex_smoker:
print(group)
('Male', 'Yes')
('Male', 'No')
('Female', 'Yes')
('Female', 'No')
Optimised Aggregation Functions
Common aggregations can be performed using in-built, optimised implementations. Custom aggregations are also possible if you create and pass a custom function.
count
returns the number of records in each group.
print(group_sex_smoker.count())
total_bill tip day time size
sex smoker
Male Yes 60 60 60 60 60
No 97 97 97 97 97
Female Yes 33 33 33 33 33
No 54 54 54 54 54
sum
adds the values of each column and returns the total. This will not be performed for non-numeric columns.
print(group_sex_smoker.sum())
total_bill tip size
sex smoker
Male Yes 1337.07 183.07 150
No 1919.75 302.00 263
Female Yes 593.27 96.74 74
No 977.68 149.77 140
mean
calculates the average of each column for each group. This will not be performed for non-numeric columns.
print(group_sex_smoker.mean())
total_bill tip size
sex smoker
Male Yes 22.284500 3.051167 2.500000
No 19.791237 3.113402 2.711340
Female Yes 17.977879 2.931515 2.242424
No 18.105185 2.773519 2.592593
median
returns the value at the 50th percentile of each group. This will not be performed for non-numeric columns.
print(group_sex_smoker.median())
total_bill tip size
sex smoker
Male Yes 20.39 3.00 2
No 18.24 2.74 2
Female Yes 16.27 2.88 2
No 16.69 2.68 2
min
and max
returns the minimum/maximum value of each group. This will not be performed for non-numeric columns.
print('Minimum values for each group.\n')
print(group_sex_smoker.min())
print('\n')
print('Maximum values for each group.\n')
print(group_sex_smoker.max())
Minimum values for each group.
size tip total_bill
sex smoker
Male Yes 1 1.00 7.25
No 2 1.25 7.51
Female Yes 1 1.00 3.07
No 1 1.00 7.25
Maximum values for each group.
size tip total_bill
sex smoker
Male Yes 5 10.0 50.81
No 6 9.0 48.33
Female Yes 4 6.5 44.30
No 6 5.2 35.83
prod
multiplies the values in each group and returns the result. This will not be performed for non-numeric values.
print(group_sex_smoker.prod())
total_bill tip size
sex smoker
Male Yes 2.173399e+78 2.860617e+26 3.151799e+22
No 1.413952e+122 4.609979e+43 4.604192e+39
Female Yes 4.250240e+39 1.400379e+14 1.304596e+11
No 1.458771e+66 9.340368e+21 4.986245e+20
first
returns, literally the first value of each group.
print(group_sex_smoker.first())
total_bill tip day time size
sex smoker
Male Yes 38.01 3.00 Sat Dinner 4
No 10.34 1.66 Sun Dinner 3
Female Yes 3.07 1.00 Sat Dinner 1
No 16.99 1.01 Sun Dinner 2
Applying Multiple Functions
In SQL, it is common that we perform multiple aggregation functions in a single query. For instance, it is common to run queries like the following, where the intention is to calculate the average tip per day, as well as retrieve the counts of each day in the dataset.
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
This is addressed in Pandas using the agg
function, where we can pass a dictionary of aggregation methods to the function which will then be applied onto the DataFrameGroupBy
object.
The following code snippet demonstrates how we can perform multiple aggregations on the DataFrame.
group_by_day = tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
group_by_day = group_by_day.rename(columns={'tip': 'avg_tip', 'day': 'day_count'})
print(group_by_day)
avg_tip day_count
day
Thur 2.771452 62
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76