Handling Multiple Values in One Column

February 15, 2018

Handling Multiple Values in One Column

It is common to see multiple values stored in a single column, and this makes it challenging to analyse the data. This article discusses how we can split the values and appropriately assign them to rows that they are referencing.


Load Libraries

import pandas as pd


Load Data

The data for this exercise is adapted from a task in Udacity’s Data Analyst Nanodegree.

data = [{'movie_id': 135397, 'original_title': 'Jurassic World', 
         'genres': 'Action|Adventure|Science Fiction|Thriller'},
        {'movie_id': 262500, 'original_title': 'Insurgent', 
         'genres': 'Adventure|Science Fiction|Thriller'}]

df = pd.DataFrame(data)
df.head()
genres movie_id original_title
0 Action|Adventure|Science Fiction|Thriller 135397 Jurassic World
1 Adventure|Science Fiction|Thriller 262500 Insurgent


We first split the genres column to create a list of genre lists.

genres = df.loc[:, ['movie_id', 'genres']]
genres_list = genres['genres'].str.split('|').tolist()
genres_list
[['Action', 'Adventure', 'Science Fiction', 'Thriller'],
 ['Adventure', 'Science Fiction', 'Thriller']]


We next create a stacked dataframe.

stacked_genres_df = pd.DataFrame(genres_list, index=genres['movie_id']).stack()
stacked_genres_df.head(7)
movie_id   
135397    0             Action
          1          Adventure
          2    Science Fiction
          3           Thriller
262500    0          Adventure
          1    Science Fiction
          2           Thriller
dtype: object


Resetting the index has the nice effect of interpolating the movie_id.

stacked_genres_df = stacked_genres_df.reset_index()
stacked_genres_df.head(7)
movie_id level_1 0
0 135397 0 Action
1 135397 1 Adventure
2 135397 2 Science Fiction
3 135397 3 Thriller
4 262500 0 Adventure
5 262500 1 Science Fiction
6 262500 2 Thriller


Lastly, we remove the unnecessary columns and rename column 0 to genre.

movie_genre_mapping = stacked_genres_df.loc[:, ['movie_id', 0]]
movie_genre_mapping.columns = ['movie_id', 'genre']
movie_genre_mapping.head(7)
movie_id genre
0 135397 Action
1 135397 Adventure
2 135397 Science Fiction
3 135397 Thriller
4 262500 Adventure
5 262500 Science Fiction
6 262500 Thriller
comments powered by Disqus