Univariate Analysis

March 3, 2018

Univariate Analysis

In general, there are three types of quantitative analysis.

univariate (one variable)
bivariate (two variables)
multivariate (more than two variables)

We will focus on univariate analysis in this article.

Import Libraries

import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import seaborn as sns

%matplotlib inline

Set Seaborn Visualisation Options

sns.set_style('whitegrid')
sns.set_palette('pastel')

Load Dataset

We will be using the tips dataset, which contains information collected by a waiter about his tips as well as the characteristics of the diner.

tips = sns.load_dataset('tips')
tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Nominal Variables

For nominal variables, we can summarise the data using either a frequency table or a bar chart.

sex_freq_table = tips['sex'].value_counts().reset_index()
sex_freq_table

	index	sex
0	Male	157
1	Female	87

# to control the aspect ratio of the plot
fig, ax = plt.subplots(figsize=(2.5, 5))

# unfortunate variable naming
sex_bar = sns.barplot(x=sex_freq_table['index'], y=sex_freq_table['sex'])
ax.set_xlabel('Gender')
ax.set_ylabel('Count', rotation=0, labelpad=25)

sns.despine()

png

Interval (Numerical) Variables

For numerical variables, we are interested in the following.

Measures of central tendency (mean, mode, median).
Measures of dispersion (standard deviation, variance).

Measures of Central Tendency

NumPy is able to calculate mean and median, while we will utilise scipy to derive the mode.

print(np.mean(tips['total_bill']))

19.785942622950824

print(np.median(tips['total_bill']))

17.795

print(sp.stats.mode(tips['total_bill']))

ModeResult(mode=array([13.42]), count=array([3]))

Measures of Dispersion

NumPy also comes with utility functions that calculate standard deviation and variance. We can adapt the functions to be used for both populations and samples by setting the appropriate degrees of freedom.

print(np.std(tips['total_bill'], ddof=0))

8.88415057777113

print(np.var(tips['total_bill'], ddof=0))

78.92813148851113

Univariate Analysis

March 3, 2018