Metflix: How to recommend movies

https://unsplash.com/photos/ngMtsE5r9eI

Where are we at?

This is what we did so far

In part 0, we downloaded our data from MovieLens, did some EDA and created our user item matrix. The matrix has 671 unique users, 9066 unique movies and is 98.35% sparse
In part 1, we described 3 of the most common recommendation methods: User Based Collaborative Filtering, Item Based Collaborative Filtering and Matrix Factorization
In part 2, this part, we will implement Matrix Factorization through ALS and find similar movies

Matrix Factorization

We want to factorize our user item interaction matrix into a User matrix and Item matrix. To do that, we will use the Alternating Least Squares (ALS) algorithm to factorize the matrix. We could write our own implementation of ALS like how it’s been done in this post or this post, or we can use the already available, fast implementation by Ben Frederickson. The ALS model here is from implicit and can easily be added to your Python packages with pip or with Anaconda package manager with conda.

import implicit

model = implicit.als.AlternatingLeastSquares(factors=10,
                                             iterations=20,
                                             regularization=0.1,
                                             num_threads=4)
model.fit(user_item.T)

Here, we called ALS with the following parameters:

10 factors. This indicates the number of latent factors to be used
20 iterations
0.1 regularization. This regularization term is the lambda in the loss function
4 threads. This code can be parallelized which makes it super fast. it takes about 5 sec to train.

One thing to note is that the input for the ALS model is a item user interaction matrix, so we just have to pass the transpose of our item user matrix to the model fit function

Recommending similar movies

It’s time to get some results. We want to find similar movies for a selected title. The implicit module offers a ready to use method that returns similar items by providing the movie index in the item user matrix. However, we need to translate that index to the movie ID in the movies table

movies_table = pd.read_csv("data/ml-latest-small/movies.csv")
movies_table.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

def similar_items(item_id, movies_table, movies, N=5):
    """
    Input
    -----

    item_id: int
        MovieID in the movies table

    movies_table: DataFrame
        DataFrame with movie ids, movie title and genre

    movies: np.array
        Mapping between movieID in the movies_table and id in the item user matrix

    N: int
        Number of similar movies to return

    Output
    -----

    recommendation: DataFrame
        DataFrame with selected movie in first row and similar movies for N next rows

    """
    # Get movie user index from the mapping array
    user_item_id = movies.index(item_id)
    # Get similar movies from the ALS model
    similars = model.similar_items(user_item_id, N=N+1)    
    # ALS similar_items provides (id, score), we extract a list of ids
    l = [item[0] for item in similars]
    # Convert those ids to movieID from the mapping array
    ids = [movies[ids] for ids in l]
    # Make a dataFrame of the movieIds
    ids = pd.DataFrame(ids, columns=['movieId'])
    # Add movie title and genres by joining with the movies table
    recommendation = pd.merge(ids, movies_table, on='movieId', how='left')

    return recommendation

Let’s try it!

Let’s see what similar movies do we get for a James Bond Movie: Golden Eye

df = similar_items(10, movies_table, movies, 5)
df

	movieId	title	genres
0	10	GoldenEye (1995)	Action\|Adventure\|Thriller
1	208	Waterworld (1995)	Action\|Adventure\|Sci-Fi
2	316	Stargate (1994)	Action\|Adventure\|Sci-Fi
3	592	Batman (1989)	Action\|Crime\|Thriller
4	185	Net, The (1995)	Action\|Crime\|Thriller
5	153	Batman Forever (1995)	Action\|Adventure\|Comedy\|Crime

Interesting recommendations. One thing to notice is that all recommended movies are also in the Action genre. Remember that there was no indication to the ALS algorithm about movies genres. Let’s try another example

df = similar_items(500, movies_table, movies, 5)
df

	movieId	title	genres
0	500	Mrs. Doubtfire (1993)	Comedy\|Drama
1	586	Home Alone (1990)	Children\|Comedy
2	587	Ghost (1990)	Comedy\|Drama\|Fantasy\|Romance\|Thriller
3	597	Pretty Woman (1990)	Comedy\|Romance
4	539	Sleepless in Seattle (1993)	Comedy\|Drama\|Romance
5	344	Ace Ventura: Pet Detective (1994)	Comedy

Selected movie is a comedy movie and so are the recommendations. Another interesting thing to note is that recommended movies are in the same time frame (90s).

df = similar_items(1, movies_table, movies, 5)
df

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	527	Schindler's List (1993)	Drama\|War
2	356	Forrest Gump (1994)	Comedy\|Drama\|Romance\|War
3	260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Sci-Fi
4	318	Shawshank Redemption, The (1994)	Crime\|Drama
5	593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller

This is a case where the recommendations are not relevant. Recommending Silence of the Lambs for a user that just watched Toy Story does not seem as a good idea.

Make it fancy

So far, the recommendations are displayed in a DataFrame. Let’s make it fancy by showing the movie posters instead of just titles. This might help us later when we deploy our model and separate the work into Front End and Back End. To do that we will download movies metadata that I found on Kaggle. We will need the following data:

movies_metadata.csv
links.csv

metadata = pd.read_csv('data/movies_metadata.csv')
metadata.head(2)

	adult	budget	genres	imdb_id	...	title	vote_average	vote_count
0	False	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	tt0114709	...	Toy Story	373554033.0	81.0
1	False	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	tt0113497	...	Jumanji	262797249.0	104.0

2 rows × 24 columns

From this metadata file we only need the imdb_id and poster_path columns.

image_data = metadata[['imdb_id', 'poster_path']]
image_data.head()

	imdb_id	poster_path
0	tt0114709	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1	tt0113497	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2	tt0113228	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3	tt0114885	/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4	tt0113041	/e64sOI48hQXyru7naBFyssKFxVd.jpg

We want to merge this column with the movies table. Therefore, we need the links file to map between imdb id and movieId

links = pd.read_csv("data/links.csv")
links.head()

	movieId	imdbId	tmdbId
0	1	114709	862.0
1	2	113497	8844.0
2	3	113228	15602.0
3	4	114885	31357.0
4	5	113041	11862.0

links = links[['movieId', 'imdbId']]

Merging the ids will be done in 2 steps:

First merge the poster path with the mapping links
Then merge with movies_table

But first we need to remove missing imdb ids and extract the integer ID

image_data = image_data[~ image_data.imdb_id.isnull()]

def app(x):
    try:
        return int(x[2:])
    except ValueError:
        print x

image_data['imdbId'] = image_data.imdb_id.apply(app)

image_data = image_data[~ image_data.imdbId.isnull()]

image_data.imdbId = image_data.imdbId.astype(int)

image_data = image_data[['imdbId', 'poster_path']]

image_data.head()

	imdbId	poster_path
0	114709	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1	113497	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2	113228	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3	114885	/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4	113041	/e64sOI48hQXyru7naBFyssKFxVd.jpg

posters = pd.merge(image_data, links, on='imdbId', how='left')

posters = posters[['movieId', 'poster_path']]

posters = posters[~ posters.movieId.isnull()]

posters.movieId = posters.movieId.astype(int)

posters.head()

	movieId	poster_path
0	1	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1	2	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2	3	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3	4	/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4	5	/e64sOI48hQXyru7naBFyssKFxVd.jpg

movies_table = pd.merge(movies_table, posters, on='movieId', how='left')
movies_table.head()

	movieId	title	genres	poster_path
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2	3	Grumpier Old Men (1995)	Comedy\|Romance	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4	5	Father of the Bride Part II (1995)	Comedy	/e64sOI48hQXyru7naBFyssKFxVd.jpg

Now that we have the poster path, we need to download them from a website. One way to do it is to use the TMDB API to get movie posters. However, we will have to make an account on the website, apply to use the API and wait for approval to get a token ID. We don’t have time for that, so we’ll improvise.

All movie posters can be accessed through a base URL plus the movie poster path that we got, and using HTML module for Python we can display them directly in Jupyter Notebook.

from IPython.display import HTML
from IPython.display import display

def display_recommendations(df):

    images = ''
    for ref in df.poster_path:
            if ref != '':
                link = 'http://image.tmdb.org/t/p/w185/' + ref
                images += "<img style='width: 120px; margin: 0px; \
                  float: left; border: 1px solid black;' src='%s' />" \
              % link
    display(HTML(images))

df = similar_items(500, movies_table, movies, 5)
display_recommendations(df)

Put all of it into one small method

def similar_and_display(item_id, movies_table, movies, N=5):

    df = similar_items(item_id, movies_table, movies, N=N)

    display_recommendations(df)

similar_and_display(10, movies_table, movies, 5)

Conclusion

In this post we implemented ALS through the implicit module to find similar movies. Additionally we did some hacking to display the movie posters instead of just DataFrame. In the next post we will see how to make recommendations for users depending on what movies they’ve seen. We will also see how we can set up an evaluation scheme and optimize the ALS parameters for.

Stay tuned!