from fastai.collab import *
from fastai.tabular.all import *
42) set_seed(
Collaborative filtering - How recommendation systems work
Have you ever wondered how websites make recommendation sytems for you. Think about youtube, netflix, amazon or just how facebook and linkedin recommend for you friends in the add friend lists. At times companies have just user ids, product ids and maybe a rating showing how the product satisfied the user. With such a problem at hand, there is a solution to this which has been used by a variety of companies around the globe and its non other than collaborative filtering.
The way collaborative filtering works is that it looks at what the current user has used or liked, then then find other users that have liked similar products, and then recommend the current user other products that these users have liked or used.
In this tutorial we are going to build one from scratch and thereafter I shall show you how to do the same very fast. We will briefly be looking at these:
- Dot product multiplication from scratch
- Use collablearner framework
- Use Deeplearning
As usual let’s import the librabries…
The datset we shall use is movielens a 25 million recommendation dataset and is available for free. It has a combination of userid, movieid rating and time-stamp variables. Although I tried using the whole dataset for this tutorial, It was taking long and made me impatient, so decided to use a different version of the data with just 100k rows. You can consider using the whole dataset if you like
= untar_data(URLs.ML_100k) path
Jupyter notebooks provide you with the ability to write both terminal and python code. Let’s check what we’ve dowloaded.
!ls {path}
README u.data u.item u1.base u2.test u4.base u5.test ub.base
allbut.pl u.genre u.occupation u1.test u3.base u4.test ua.base ub.test
mku.sh u.info u.user u2.base u3.test u5.base ua.test
When you read through the readme file, it tells us that the data is stored in u.data and it is tab separated ordered by user, movie, rating. However the columns are not labelled. We can use pandas to sort everything out.
= pd.read_csv(path/'u.data', delimiter='\t', header=None,
ratings =['user','movie','rating','timestamp'])
names ratings.head()
user | movie | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
ratings.shape
(100000, 4)
The shape shows that we have 100,000 rows and 4 columns. The above dataframe shows all the data that we need, however it’s not in a friendly view. Here is a cross-tabulated excel version depicting details in a more friendly way.
The data represents ratings that users have given to movies. The blank gaps imply that the user hasn’t given a rating yet, most probably we assume they haven’t watched it. In ourcase, the blank gaps are the values we want to predict.
Latent factors
The idea behind collaborative filtering is latent factors. Latent factors make it possible for a model to tell what product you may like and these kinds of factors can be learned based on what other users like.
jargon: Latent factors are factors that are important for the prediction of the recommendations, but are not explicitly given to the model and instead learned.
Learning the factors
Step 1: Randomly initialize some parameters. The left matrix represents the embedding matrix for the the userid whereas the upper matrix is for the movieid and all these parameters are randomly initialised. For both cases, blue represents the biases. For ourcase each id (wether movieid or userid) the number of latent factors are the same. (for our case each id has 5 factors and the extra 6th (bias)).
jargon: Embedding matrix is what we multiply and embedding with, and in this case collaborative filtering problem is learned through training.
Step 2: Calculate the predictions. Just think of a simple math problem \(y=mx+c\) where we have two changing parameters m and c where c is the bias term. Our bias works in the same way. We use dot product to multiply user factors and movies factors the add both the biases. This is how we obtain the values in white space.
jargon: Dot product is when you multiply the corresponding elements of two vectors and add them up
Step 3: The next step is to calculate the loss. We can use any loss function as long as it can be optimized. Remember at each step we compare the predicted value and then compare it with the actual value then used use stochastic gradient descent to make the parameters as better as possible. In this tutorial we use mean squared error since it works pretty fine. Other loss functions like mean absolute error can work perfect as well
Creating DataLoaders
Remember when we listed what was in our path, there were many things including the moivie names in u.item, I think it is better if we use this than using the ids. This makes our dataframe more human friendly. We can use pandas again to do this.
= pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
movies =(0,1), names=('movie','title'), header=None)
usecols movies.head()
movie | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
We can then use merge to combine our movies dataframe with the ratings
= ratings.merge(movies) ratings
ratings.shape
(100000, 5)
And wow, our dataframe looks nice, We can then build a DataLoaders
object from this table. By default, it takes the first column for the user, the second column for the item (here our movies), and the third column for the ratings. We need to change the value of item_name
in our case to use the title instead of the IDs:
= CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls dls.show_batch()
user | title | rating | |
---|---|---|---|
0 | 542 | My Left Foot (1989) | 4 |
1 | 422 | Event Horizon (1997) | 3 |
2 | 311 | African Queen, The (1951) | 4 |
3 | 595 | Face/Off (1997) | 4 |
4 | 617 | Evil Dead II (1987) | 1 |
5 | 158 | Jurassic Park (1993) | 5 |
6 | 836 | Chasing Amy (1997) | 3 |
7 | 474 | Emma (1996) | 3 |
8 | 466 | Jackie Chan's First Strike (1996) | 3 |
9 | 554 | Scream (1996) | 3 |
To represent collaborative filtering in PyTorch we can’t just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:
= len(dls.classes['user'])
n_users = len(dls.classes['title'])
n_movies = 5
n_factors
= torch.randn(n_users, n_factors)
user_factors = torch.randn(n_movies, n_factors) movie_factors
We can check the shape to ensure everything is running perfect
user_factors.shape, movie_factors.shape
(torch.Size([944, 5]), torch.Size([1665, 5]))
To calculate the result for a particular movie and user combination, we have to look up the index of the movie in our movie latent factor matrix and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors. But look up in an index is not an operation our deep learning models know how to do. They know how to do matrix products, and activation functions.
Fortunately, it turns out that we can represent look up in an index as a matrix product. The trick is to replace our indices with one-hot-encoded vectors.
Collaborative Filtering from Scratch
PyTorch already provides a Module class, which provides some basic foundations that we want to build on. We just add the name of this superclass after the name of the class that we are defining.
The final thing that you need to know to create a new PyTorch module is that when your module is called, PyTorch will call a method in your class called forward, and will pass along to that any parameters that are included in the call. Here is the class defining our dot product model:
If you haven’t seen object-oriented programming before, it can be a little bit intimidating but then don’t worry, its a few lines of code and can be easy to read when you do it line by line. But dont hesitate to google some tutorial and have a sense how object oriented programming works.
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res return sigmoid_range(res, *self.y_range)
We now have our architecture ready, Let’s train and see how everything goes
= DotProductBias(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.958848 | 0.956008 | 00:10 |
1 | 0.872096 | 0.876626 | 00:10 |
2 | 0.747695 | 0.839320 | 00:10 |
3 | 0.600863 | 0.822649 | 00:10 |
4 | 0.487104 | 0.822984 | 00:10 |
And wow, this looks fine with just 5 epochs. You can try iterating more and changing the learning rate to see wether you can make this better. The wd
stands for weight decay but most deeplearning practioners call it L2 regularization. So it’s better to understand such jargon since thats what most reseach papers use.
Actually we used pytorch’s embedding previously to initialize our parameters, we can decide not to do that and build our own. Let’s do that. We begin by creating a function that takes size as an argument and then returns a tensor if randomly initialized parameters.
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
We can then pass in the function instead of pytorch’s embedding.
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
= self.user_factors[x[:,0]]
users = self.movie_factors[x[:,1]]
movies = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
res return sigmoid_range(res, *self.y_range)
Let’s train again.
= DotProductBias(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.944513 | 0.958574 | 00:11 |
1 | 0.847001 | 0.873356 | 00:10 |
2 | 0.736139 | 0.838051 | 00:11 |
3 | 0.584478 | 0.821091 | 00:11 |
4 | 0.484529 | 0.821726 | 00:10 |
And wow, it even looks a little bit better than using pytorch’s embedding. This shows you the power of doing things from scratch.
Interpreting Embeddings and Biases
To be an expert deep learning practioner, you need to be able to interprete models, there are some insights you can’t get from the data till you first run the models. Forexample let’s try interpreting the bias. We are using argsort
to sort the values in ascending order and return the first five indices of those values. When we index them onto the title class, we get their names and print them.
= learn.model.movie_bias.squeeze()
movie_bias = movie_bias.argsort()[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['Children of the Corn: The Gathering (1996)',
'Lawnmower Man 2: Beyond Cyberspace (1996)',
'Robocop 3 (1993)',
'Home Alone 3 (1997)',
'Mortal Kombat: Annihilation (1997)']
The interpretation for this is that let’s say Robocop
in this list is an action movie (I am not sure of it being an action movie). People that like movies may not like it as well. This implys that this particular movie won’t be recommended to users.
We can do the same for those movies with a high bias by passing True to descending. This prints for us movies that are liked. Forexample say Titanic, a user might like it even if he or she doesn’t like romantic movies. So such a movie will be highly recommended for users to watch.
= movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['L.A. Confidential (1997)',
'Shawshank Redemption, The (1994)',
"Schindler's List (1993)",
'Titanic (1997)',
'Silence of the Lambs, The (1991)']
We can also do an interpretation on the embeddings, we have 50 factors and representing them on a 2D graph isnot possible. We can use PCA to just reduce the factors two just two dimensions. Surprisingly similars will tend to be clustered near each other, this is something that we didnt program. We didnt even do any feature engineering but the model does this for us. And as I said, these are insights you cannot get from the data untill you run the models.
= ratings.groupby('title')['rating'].count()
g = g.sort_values(ascending=False).index.values[:1000]
top_movies = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
top_idxs = learn.model.movie_factors[top_idxs].cpu().detach()
movie_w = movie_w.pca(3)
movie_pca = movie_pca.t()
fac0,fac1,fac2 = list(range(50))
idxs = fac0[idxs]
X = fac2[idxs]
Y =(12,12))
plt.figure(figsize
plt.scatter(X, Y)for i, x, y in zip(top_movies[idxs], X, Y):
=np.random.rand(3)*0.7, fontsize=11)
plt.text(x,y,i, color plt.show()
All this that we’ve written from from scratch can be done using a fastai’s framwork collab_learner
. Let’s try it and see whether we get similar results.
Using fastai Collab-Learner
= collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.962636 | 0.950697 | 00:10 |
1 | 0.851842 | 0.885249 | 00:12 |
2 | 0.757142 | 0.839390 | 00:11 |
3 | 0.572357 | 0.825587 | 00:11 |
4 | 0.490504 | 0.826604 | 00:11 |
We can also look into our model,
learn.model
EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1665, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1665, 1)
)
And surprisingly, when we check the movie with maximum bias weights. We get the same as our model built from scratch.
= learn.model.i_bias.weight.squeeze()
movie_bias = movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['L.A. Confidential (1997)',
'Silence of the Lambs, The (1991)',
'Titanic (1997)',
'Shawshank Redemption, The (1994)',
'Star Wars (1977)']
These days the cool kids are using deep-learning. Let’s try it out and see how it works as well.
Deep learning
Firstai has a function get_emb_sz
that recommends for as sizes to use for the factors, however you can use different sizes other than this.
= get_emb_sz(dls)
embs embs
[(944, 74), (1665, 102)]
And like other deep learning frameworks, we need a sequential layer That has both linear and ReLU activations that we pass in our data. In the forward function, we are concatenating the embeddngs for both the users and the movies and this is actually what we pass to the sequential layer. Then pytorch will calculate gradients for us and stochastic gradient descent as well.
class CollabNN(Module):
def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential(
1]+item_sz[1], n_act),
nn.Linear(user_sz[
nn.ReLU(),1))
nn.Linear(n_act, self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0]),self.item_factors(x[:,1])
embs = self.layers(torch.cat(embs, dim=1))
x return sigmoid_range(x, *self.y_range)
We can train as usual.
= CollabNN(*embs) model
= Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.01) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.921936 | 0.952098 | 00:12 |
1 | 0.868925 | 0.927091 | 00:10 |
2 | 0.885512 | 0.885665 | 00:11 |
3 | 0.850595 | 0.873478 | 00:11 |
4 | 0.761019 | 0.877106 | 00:11 |
Surprisingly all our previous models beat the deep learning approach. This shows us why deep-learning shouldn’t always be the first priority. At-times simpler models can work just perfect.
Thank you, Hope this tutorial will be helpful.