Categoricals
February 27, 2018
Categoricals
In addition to the standard Python datatypes (strings, integers, floats etc), a variable can be declared as a categorical in Pandas. Categoricals correspond to categorical values in statistics, and typically take on a limited, and usually fixed, number of values. If you have had more experience with R, categoricals are very similar to R’s factors.
You should consider converting a variable to a Pandas categorical if:
- the variable is of string datatype, and limited to a fixed number of possible values. This can result in memory savings.
- its logical order is different from its lexical (alphabetical) order (eg: first, second and third)
- you wish to signal to other Python libraries that this variable is categorical.
Import Libraries
import pandas as pd
The Likert scale is a psychometric scale commonly used in research that employs questionnaires. In its string representation below, Pandas is unable to determine that there is a logical order to the values and therefore sorting is performed lexically.
likert = ['Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree']
responses = pd.Series(likert)
responses = responses.sort_values()
responses
1 Agree
3 Disagree
2 Neutral
0 Strongly Agree
4 Strongly Disagree
dtype: object
It is easy to convert an existing column in a DataFrame to a categorical.
Load Data
We will demonstrate the use of categoricals using the following fictituous classroom results. Through some error, the students’ scores were lost but the teacher was able to vaguely remember that Amy performed better than Bary and Cory had the highest score.
students = {'name': ['Amy', 'Bary', 'Cory'],
'score': ['better', 'good', 'best']}
students_df = pd.DataFrame(data=students)
students_df
name | score | |
---|---|---|
0 | Amy | better |
1 | Bary | good |
2 | Cory | best |
To enable Pandas to correctly sort the score
column, we will need to convert it to a category
using the Categorical
class.
score_type = pd.Categorical(values=students_df['score'],
categories=['good', 'better', 'best'], ordered=True)
students_df['score'] = students_df['score'].astype(score_type)
students_df = students_df.sort_values(by='score')
students_df
name | score | |
---|---|---|
1 | Bary | good |
0 | Amy | better |
2 | Cory | best |