Basics of the Pandas GroupBy Object

February 21, 2018

Basics of the Pandas GroupBy Object

Before we can discuss how Pandas performs aggregations, we have to first understand the properties of the GroupBy object that is created when we call the groupby function.

png

Graphically, the GroupBy object represents the objects listed in the “Split” step, before any functions are applied to the groups.


Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns # to retrieve the tips dataset


Import Data

We will be exploring the GroupBy object using the “tips” dataset found in seaborn. The tips dataset contains information about how much people tipped, as well as their gender and whether they are smokers.

tips = sns.load_dataset('tips')
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4


When the groupby function is called, no splitting occurs until it is truly necessary. It only verifies that a valid mapping has been passed to the function.

group_sex_smoker = tips.groupby(['sex', 'smoker'])
print(group_sex_smoker)
<pandas.core.groupby.DataFrameGroupBy object at 0x0000025B92933DD8>


We can iterate the GroupBy object and print the label for each group. If we want to, we can also access the data of each group, but we will not be doing that here.

for group, data in group_sex_smoker:
    print(group)
('Male', 'Yes')
('Male', 'No')
('Female', 'Yes')
('Female', 'No')


Optimised Aggregation Functions

Common aggregations can be performed using in-built, optimised implementations. Custom aggregations are also possible if you create and pass a custom function.


count returns the number of records in each group.

print(group_sex_smoker.count())
               total_bill  tip  day  time  size
sex    smoker                                  
Male   Yes             60   60   60    60    60
       No              97   97   97    97    97
Female Yes             33   33   33    33    33
       No              54   54   54    54    54


sum adds the values of each column and returns the total. This will not be performed for non-numeric columns.

print(group_sex_smoker.sum())
               total_bill     tip  size
sex    smoker                          
Male   Yes        1337.07  183.07   150
       No         1919.75  302.00   263
Female Yes         593.27   96.74    74
       No          977.68  149.77   140


mean calculates the average of each column for each group. This will not be performed for non-numeric columns.

print(group_sex_smoker.mean())
               total_bill       tip      size
sex    smoker                                
Male   Yes      22.284500  3.051167  2.500000
       No       19.791237  3.113402  2.711340
Female Yes      17.977879  2.931515  2.242424
       No       18.105185  2.773519  2.592593


median returns the value at the 50th percentile of each group. This will not be performed for non-numeric columns.

print(group_sex_smoker.median())
               total_bill   tip  size
sex    smoker                        
Male   Yes          20.39  3.00     2
       No           18.24  2.74     2
Female Yes          16.27  2.88     2
       No           16.69  2.68     2


min and max returns the minimum/maximum value of each group. This will not be performed for non-numeric columns.

print('Minimum values for each group.\n')
print(group_sex_smoker.min())
print('\n')
print('Maximum values for each group.\n')
print(group_sex_smoker.max())
Minimum values for each group.

               size   tip  total_bill
sex    smoker                        
Male   Yes        1  1.00        7.25
       No         2  1.25        7.51
Female Yes        1  1.00        3.07
       No         1  1.00        7.25


Maximum values for each group.

               size   tip  total_bill
sex    smoker                        
Male   Yes        5  10.0       50.81
       No         6   9.0       48.33
Female Yes        4   6.5       44.30
       No         6   5.2       35.83


prod multiplies the values in each group and returns the result. This will not be performed for non-numeric values.

print(group_sex_smoker.prod())
                  total_bill           tip          size
sex    smoker                                           
Male   Yes      2.173399e+78  2.860617e+26  3.151799e+22
       No      1.413952e+122  4.609979e+43  4.604192e+39
Female Yes      4.250240e+39  1.400379e+14  1.304596e+11
       No       1.458771e+66  9.340368e+21  4.986245e+20


first returns, literally the first value of each group.

print(group_sex_smoker.first())
               total_bill   tip  day    time  size
sex    smoker                                     
Male   Yes          38.01  3.00  Sat  Dinner     4
       No           10.34  1.66  Sun  Dinner     3
Female Yes           3.07  1.00  Sat  Dinner     1
       No           16.99  1.01  Sun  Dinner     2


Applying Multiple Functions

In SQL, it is common that we perform multiple aggregation functions in a single query. For instance, it is common to run queries like the following, where the intention is to calculate the average tip per day, as well as retrieve the counts of each day in the dataset.

SELECT day, AVG(tip), COUNT(*) 
FROM tips
GROUP BY day;


This is addressed in Pandas using the agg function, where we can pass a dictionary of aggregation methods to the function which will then be applied onto the DataFrameGroupBy object.

The following code snippet demonstrates how we can perform multiple aggregations on the DataFrame.

group_by_day = tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
group_by_day = group_by_day.rename(columns={'tip': 'avg_tip', 'day': 'day_count'})
print(group_by_day)
       avg_tip  day_count
day                      
Thur  2.771452         62
Fri   2.734737         19
Sat   2.993103         87
Sun   3.255132         76
comments powered by Disqus