Exploratory Data Analysis

March 1, 2018

Exploratory Data Analysis

Before we can build models that can learn from data, we first need to understand its underlying structure and properties. This process is known as Exploratory Data Analysis (EDA), and is championed by mathematician John Tukey to “encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments”. This article will focus on how data visualisation can aid EDA.


Import Libraries

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns # to retrieve the tips dataset

%matplotlib inline


Set Seaborn Visualisation Options

sns.set_style('whitegrid')
sns.set_palette('pastel')


Import Data

We will be exploring the “tips” dataset in seaborn, which contains data recorded by a waiter about the tips that he received.

tips = sns.load_dataset('tips')
tips.head(5)
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4


For continuous variables such as total_bill and tip, a histogram succinctly captures its distribution by “bucketing” the values into discrete ranges.

bill_hist = sns.distplot(tips['total_bill'], kde=False)
bill_hist.set(xlabel='Total Bill', ylabel='Count')
bill_hist.set_ylabel(bill_hist.get_ylabel(), rotation=0, labelpad=25)
sns.despine()

png


Discrete variables such as sex and smoker can be summarized using a bar chart.

sex_counts = tips['sex'].value_counts().reset_index()
sex_counts = sex_counts.rename(columns={'index': 'gender', 
                                        'sex': 'count'})
sex_counts
gender count
0 Male 157
1 Female 87
# to control the aspect ratio of the plot
fig, ax = plt.subplots(figsize=(2.5,5))

sex_counts_bar = sns.barplot(x=sex_counts['gender'], y=sex_counts['count'])
ax.set_xlabel('Gender')
ax.set_ylabel('Count', labelpad=25, rotation=0)

sns.despine()

png


To understand the relationship between two continuous variables, such as total_bill and tip, we can place their values on a scatter plot. Here, we can see that in general, as total bill increases, tip increases too. However, there is some variance that cannot be explained simply by these 2 variables.

bill_tip_scatter = sns.regplot(x=tips['total_bill'], y=tips['tip'], fit_reg=False)
bill_tip_scatter.set_xlabel('Total Bill')
bill_tip_scatter.set_ylabel('Tip', rotation=0, labelpad=15)
sns.despine()

png


It will certainly be reasonable to believe that the time of day has a direct impact on the tips received. To confirm or denounce our hunch, we can plot a boxplot for the tips, split by the time of day.

tips['time'].unique()
[Dinner, Lunch]
Categories (2, object): [Dinner, Lunch]
# to control the aspect ratio
fig, ax = plt.subplots(figsize=(2.5, 6.5))

time_tip_box = sns.boxplot(x=tips['time'], y=tips['tip'])
time_tip_box.set_xlabel('Time')
time_tip_box.set_ylabel('Tip ($)', rotation=0, labelpad=20)
sns.despine()

png


Last, but certainly not least, is the mighty line plot to explore time-series data. For our example, we will explore Singapore’s population data.

year = [2005, 2006, 2007, 2008, 2009, 2010, 
        2011, 2012, 2013, 2014, 2015]
population = [4.2658, 4.4014, 4.5886, 4.8394, 4.9876, 5.0767, 
              5.1837, 5.3124, 5.3992, 5.4697, 5.535]

pop_line = sns.pointplot(x=year, y=population)
pop_line.set_xlabel('Year')
pop_line.set_ylabel('Population\n(millions)', rotation=0, labelpad=35)
sns.despine()

png

comments powered by Disqus