Handling Multiple Values in One Column
February 15, 2018
Handling Multiple Values in One Column
It is common to see multiple values stored in a single column, and this makes it challenging to analyse the data. This article discusses how we can split the values and appropriately assign them to rows that they are referencing.
Load Libraries
import pandas as pd
Load Data
The data for this exercise is adapted from a task in Udacity’s Data Analyst Nanodegree.
data = [{'movie_id': 135397, 'original_title': 'Jurassic World',
'genres': 'Action|Adventure|Science Fiction|Thriller'},
{'movie_id': 262500, 'original_title': 'Insurgent',
'genres': 'Adventure|Science Fiction|Thriller'}]
df = pd.DataFrame(data)
df.head()
genres | movie_id | original_title | |
---|---|---|---|
0 | Action|Adventure|Science Fiction|Thriller | 135397 | Jurassic World |
1 | Adventure|Science Fiction|Thriller | 262500 | Insurgent |
We first split the genres
column to create a list of genre lists.
genres = df.loc[:, ['movie_id', 'genres']]
genres_list = genres['genres'].str.split('|').tolist()
genres_list
[['Action', 'Adventure', 'Science Fiction', 'Thriller'],
['Adventure', 'Science Fiction', 'Thriller']]
We next create a stacked dataframe.
stacked_genres_df = pd.DataFrame(genres_list, index=genres['movie_id']).stack()
stacked_genres_df.head(7)
movie_id
135397 0 Action
1 Adventure
2 Science Fiction
3 Thriller
262500 0 Adventure
1 Science Fiction
2 Thriller
dtype: object
Resetting the index has the nice effect of interpolating the movie_id
.
stacked_genres_df = stacked_genres_df.reset_index()
stacked_genres_df.head(7)
movie_id | level_1 | 0 | |
---|---|---|---|
0 | 135397 | 0 | Action |
1 | 135397 | 1 | Adventure |
2 | 135397 | 2 | Science Fiction |
3 | 135397 | 3 | Thriller |
4 | 262500 | 0 | Adventure |
5 | 262500 | 1 | Science Fiction |
6 | 262500 | 2 | Thriller |
Lastly, we remove the unnecessary columns and rename column 0
to genre
.
movie_genre_mapping = stacked_genres_df.loc[:, ['movie_id', 0]]
movie_genre_mapping.columns = ['movie_id', 'genre']
movie_genre_mapping.head(7)
movie_id | genre | |
---|---|---|
0 | 135397 | Action |
1 | 135397 | Adventure |
2 | 135397 | Science Fiction |
3 | 135397 | Thriller |
4 | 262500 | Adventure |
5 | 262500 | Science Fiction |
6 | 262500 | Thriller |