Categoricals

February 27, 2018

Categoricals

In addition to the standard Python datatypes (strings, integers, floats etc), a variable can be declared as a categorical in Pandas. Categoricals correspond to categorical values in statistics, and typically take on a limited, and usually fixed, number of values. If you have had more experience with R, categoricals are very similar to R’s factors.

You should consider converting a variable to a Pandas categorical if:


Import Libraries

import pandas as pd

The Likert scale is a psychometric scale commonly used in research that employs questionnaires. In its string representation below, Pandas is unable to determine that there is a logical order to the values and therefore sorting is performed lexically.

likert = ['Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree']
responses = pd.Series(likert)

responses = responses.sort_values()
responses
1                Agree
3             Disagree
2              Neutral
0       Strongly Agree
4    Strongly Disagree
dtype: object

It is easy to convert an existing column in a DataFrame to a categorical.


Load Data

We will demonstrate the use of categoricals using the following fictituous classroom results. Through some error, the students’ scores were lost but the teacher was able to vaguely remember that Amy performed better than Bary and Cory had the highest score.

students = {'name': ['Amy', 'Bary', 'Cory'], 
          'score': ['better', 'good', 'best']}

students_df = pd.DataFrame(data=students)
students_df
name score
0 Amy better
1 Bary good
2 Cory best


To enable Pandas to correctly sort the score column, we will need to convert it to a category using the Categorical class.

score_type = pd.Categorical(values=students_df['score'], 
                            categories=['good', 'better', 'best'], ordered=True)
students_df['score'] = students_df['score'].astype(score_type)
students_df = students_df.sort_values(by='score')

students_df
name score
1 Bary good
0 Amy better
2 Cory best
comments powered by Disqus