Categoricals

February 27, 2018

Categoricals

In addition to the standard Python datatypes (strings, integers, floats etc), a variable can be declared as a categorical in Pandas. Categoricals correspond to categorical values in statistics, and typically take on a limited, and usually fixed, number of values. If you have had more experience with R, categoricals are very similar to R’s factors.

You should consider converting a variable to a Pandas categorical if:

the variable is of string datatype, and limited to a fixed number of possible values. This can result in memory savings.
its logical order is different from its lexical (alphabetical) order (eg: first, second and third)
you wish to signal to other Python libraries that this variable is categorical.

Import Libraries

import pandas as pd

The Likert scale is a psychometric scale commonly used in research that employs questionnaires. In its string representation below, Pandas is unable to determine that there is a logical order to the values and therefore sorting is performed lexically.

likert = ['Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree']
responses = pd.Series(likert)

responses = responses.sort_values()
responses

1                Agree
3             Disagree
2              Neutral
0       Strongly Agree
4    Strongly Disagree
dtype: object

It is easy to convert an existing column in a DataFrame to a categorical.

Load Data

We will demonstrate the use of categoricals using the following fictituous classroom results. Through some error, the students’ scores were lost but the teacher was able to vaguely remember that Amy performed better than Bary and Cory had the highest score.

students = {'name': ['Amy', 'Bary', 'Cory'], 
          'score': ['better', 'good', 'best']}

students_df = pd.DataFrame(data=students)
students_df

	name	score
0	Amy	better
1	Bary	good
2	Cory	best

To enable Pandas to correctly sort the score column, we will need to convert it to a category using the Categorical class.

score_type = pd.Categorical(values=students_df['score'], 
                            categories=['good', 'better', 'best'], ordered=True)
students_df['score'] = students_df['score'].astype(score_type)
students_df = students_df.sort_values(by='score')

students_df

	name	score
1	Bary	good
0	Amy	better
2	Cory	best

Categoricals

February 27, 2018

Categoricals

Import Libraries

Load Data

Comments