Exploratory Data Analysis

March 1, 2018

Exploratory Data Analysis

Before we can build models that can learn from data, we first need to understand its underlying structure and properties. This process is known as Exploratory Data Analysis (EDA), and is championed by mathematician John Tukey to “encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments”. This article will focus on how data visualisation can aid EDA.

Import Libraries

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns # to retrieve the tips dataset

%matplotlib inline

Set Seaborn Visualisation Options

sns.set_style('whitegrid')
sns.set_palette('pastel')

Import Data

We will be exploring the “tips” dataset in seaborn, which contains data recorded by a waiter about the tips that he received.

tips = sns.load_dataset('tips')
tips.head(5)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

For continuous variables such as total_bill and tip, a histogram succinctly captures its distribution by “bucketing” the values into discrete ranges.

bill_hist = sns.distplot(tips['total_bill'], kde=False)
bill_hist.set(xlabel='Total Bill', ylabel='Count')
bill_hist.set_ylabel(bill_hist.get_ylabel(), rotation=0, labelpad=25)
sns.despine()

png

Discrete variables such as sex and smoker can be summarized using a bar chart.

sex_counts = tips['sex'].value_counts().reset_index()
sex_counts = sex_counts.rename(columns={'index': 'gender', 
                                        'sex': 'count'})
sex_counts

	gender	count
0	Male	157
1	Female	87

# to control the aspect ratio of the plot
fig, ax = plt.subplots(figsize=(2.5,5))

sex_counts_bar = sns.barplot(x=sex_counts['gender'], y=sex_counts['count'])
ax.set_xlabel('Gender')
ax.set_ylabel('Count', labelpad=25, rotation=0)

sns.despine()

png

To understand the relationship between two continuous variables, such as total_bill and tip, we can place their values on a scatter plot. Here, we can see that in general, as total bill increases, tip increases too. However, there is some variance that cannot be explained simply by these 2 variables.

bill_tip_scatter = sns.regplot(x=tips['total_bill'], y=tips['tip'], fit_reg=False)
bill_tip_scatter.set_xlabel('Total Bill')
bill_tip_scatter.set_ylabel('Tip', rotation=0, labelpad=15)
sns.despine()

png

It will certainly be reasonable to believe that the time of day has a direct impact on the tips received. To confirm or denounce our hunch, we can plot a boxplot for the tips, split by the time of day.

tips['time'].unique()

[Dinner, Lunch]
Categories (2, object): [Dinner, Lunch]

# to control the aspect ratio
fig, ax = plt.subplots(figsize=(2.5, 6.5))

time_tip_box = sns.boxplot(x=tips['time'], y=tips['tip'])
time_tip_box.set_xlabel('Time')
time_tip_box.set_ylabel('Tip ($)', rotation=0, labelpad=20)
sns.despine()

png

Last, but certainly not least, is the mighty line plot to explore time-series data. For our example, we will explore Singapore’s population data.

year = [2005, 2006, 2007, 2008, 2009, 2010, 
        2011, 2012, 2013, 2014, 2015]
population = [4.2658, 4.4014, 4.5886, 4.8394, 4.9876, 5.0767, 
              5.1837, 5.3124, 5.3992, 5.4697, 5.535]

pop_line = sns.pointplot(x=year, y=population)
pop_line.set_xlabel('Year')
pop_line.set_ylabel('Population\n(millions)', rotation=0, labelpad=35)
sns.despine()

png

Exploratory Data Analysis

March 1, 2018