Presenting users with the most relevant information is an important task for any product to fulfill. To do this properly, you need to be able to extract their preferences from your raw data. Here’s a framework for you to start doing that.
Deducing interpretations from your raw data can be tricky, because to succeed you need to:
Understand what the users’ needs are: You will typically only have very limited, implicit data of what a user might be interested in. For instance, Netflix needs to infer their users’ preferences of movies based on the movies they have watched previously. The users won’t explicitly tell Netflix what they like.
Prioritise all matches: Even if a company like Netflix is able to satisfactorily model user preferences in movies, they still have a big problem: There are >50,000 movies out there of which thousands may fit with the user’s preferences. Which movies should Netflix recommend first?
As a data scientist at OfferZen I was recently involved in implementing a recommender system. Since everybody knows how Netflix works, we are going to explain the underlying concept at the example of movie recommendations. I’ll give you some guidance on how you can get started building your own and share some practical learnings from our own implementation.
There are two main data selection methods:
Collaborative-filtering: In collaborative-filtering items are recommended, for example movies, based on how similar your user profile is to other users’, finds the users that are most similar to you and then recommends items that they have shown a preference for. This method suffers from the so-called cold-start problem: If there is a new movie, no-one else would’ve yet liked or watched it, so you’re not going to have this in your list of recommended movies, even if you’d love it.
Content-based filtering: This method uses attributes of the content to recommend similar content. It doesn’t have a cold-start problem because it works through attributes or tags of the content, such as actors, genres or directors, so that new movies can be recommended right away.
Based on this, I’m going to introduce you to content-based filtering for a movie recommender system. I’ll use Python as the programming language for the implementation.
Step 1: Choosing your data
The first thing to do when starting a data science project is to decide what data sets are going to be relevant to your problem. This stage of the project is referred to as data selection and is highly important because if you choose the wrong data source, you won’t get successful performance.
Whenever you're dealing with content-based filtering, you’ll need to find those attributes of your content that you think are relevant to the problem. That way, you can later rank the content for your users or recommend relevant parts to them.
Here’s how this would look for our movie recommendation example:
I’m using the publicly available MovieLens data set. This data set consists of a sequence of tags such as actors, genres, moods, events or directors for each movie. These tags were generated using user-contributed content including ratings and textual reviews. We’ll collectively refer to the tags associated with a given movie as a document. For example, the movie Toy Story has 178 tags in our chosen data set, some of which are:
pixar animation bullying fun unusual plot structure happy ending action space destiny 3d loneliness
How do you extract data that is relevant for the content that you want to recommend? This depends on your specific problem and what data is available or can be collected: At OfferZen, for example, we used a company’s activity on our platform as the main data to build the indicator for company preference and what they are looking for.
The models used in data science are fundamentally mathematical in nature and thus require us to represent the data in vector format - an array of numbers stored in memory. These vectors are called feature vectors. In content-based recommender systems, the term content vectors is also used.
So how do we convert the above tags into a vector representation?
Step 2: Encoding your data
There are a number of popular encoding schemes but the main ones are:
For our example, we will use the term frequency–inverse document frequency (TF-IDF) encoding scheme.
The advantage of TF-IDF encoding is that it will weigh a term (a tag for a movie in our example) according to the importance of the term within the document: The more frequently the term appears, the larger its weight will be. At the same time, it weighs the item inversely to the frequency of this term across the entire dataset: It will emphasise terms that are relatively rare occurrences in the general dataset but of importance to the specific content at hand. That means that words such as ‘is’, ‘are’, ‘by’ or ‘a’ which are likely to show up in every movie description but aren’t useful for our user-recommendation, will be weighed less than words that are more unique to the content that we are recommending.
The formula used to calculate TF-IDF weight for term
i in document
w[i,j] = tf[i,j]*log(N/df[i])
tf is the term frequency,
df is the document frequency and
N stands for the total number of documents in the dataset.
A vector-encoded document will look like this when encoded:
array([ 1., 0.46036753, 0.16328608, ..., 0.29024403, 0.36014058, 0.23019143])
Each element in the vector represents a TF-IDF weight associated with a term in a document.
Step 3: Recommending content
Recommending content involves making a prediction about how likely it is that a user is going to like the recommended content, buy an item or watch a movie.
There is a large amount of methods and literature available on recommender systems. Popular methods include:
We are going to use a simple similarity-based method called
cosine similarity as it is easy to understand, but does a good job at illustrating the fundamental concept of making recommendations.
I’ll use Python and the numerical library Numpy for illustration where
y are two documents representing the feature vectors introduced in Step 1:
x = [2,0,1] y = [2,0,1]
Vectors have direction and magnitude. Because of this, we can calculate the angle between two vectors. A popular measure in data science is the cosine of this angle computed as follows:
cos(x,y) = dot(x,y)/|x||y|
This measure will equal 1 when the vectors are parallel (they point in the same direction) and 0 when when the vectors are orthogonal. Vectors that point in the same direction are more similar than vectors that are orthogonal.
Now we start to see how this can be helpful to us: For example, the movies Toy Story and Monsters, Inc have a cosine similarity of 0.74. We would have expected these movies to have a relatively high similarity. In contrast, the cosine similarity between the movies Toy Story and Terminator 2 is 0.28 - as expected much lower.
We can now recommend movies based on the movies that a user has already watched or rated using the cosine similarity. We would recommend movies with the largest similarity to the ones already highly rated by the user.
Generating user preference profiles
Instead of recommending movies based on specific movies that a user has already watched, we could also attempt to build profiles of the users' preferences.
This will allow us to gain an aggregate view of the users’ preferences and then recommend content based on their behaviour over time without skewing the recommendations by outliers.
Let’s take user #1 in the dataset. This user has rated the following movies from 1: dislike to 5: like.
title rating 0 Braveheart (1995) 1.0 1 Basketball Diaries, The (1995) 4.5 2 Godfather, The (1972) 5.0 3 Godfather: Part II, The (1974) 5.0 4 Dead Poets Society (1989) 5.0 5 Breakfast Club, The (1985) 4.0 6 Sixth Sense, The (1999) 4.5 7 Ferris Bueller's Day Off (1986) 5.0 8 Fight Club (1999) 4.0 9 Memento (2000) 4.0 10 Donnie Darko (2001) 5.0 11 Igby Goes Down (2002) 5.0 12 Batman Begins (2005) 4.0 13 Superbad (2007) 3.5 14 Dark Knight, The (2008) 4.0 15 Iron Man (2008) 5.0 16 Star Trek (2009) 5.0 17 Harry Potter and the Half-Blood Prince (2009) 5.0 18 Sherlock Holmes (2009) 5.0 19 Harry Potter and the Deathly Hallows: Part 1 (... 5.0 20 The Hunger Games (2012) 2.5 21 Sherlock Holmes: A Game of Shadows (2011) 5.0 22 Perks of Being a Wallflower, The (2012) 5.0 23 Hobbit: An Unexpected Journey, The (2012) 0.5 24 Django Unchained (2012) 4.0 25 Whiplash (2014) 5.0
How do we build a preference profile for this user?
There are many ways to build the preference profile. For simplicity, I will take a less principled approach and take the weighted mean of the user’s ratings and the TF-IDF vector representations of the respective movies. This simple weighted mean will then constitute the user’s preference profile.
All we have to do now, is to take the cosine similarity between the user profile vectors and content vectors to find their similarity. Now we can recommend the most similar items.
Based on this, user #1's top recommendations are:
The Shawshank Redemption Logan Stand by Me American Beauty 11.22.63 City of God The Usual Suspects Goodfellas
Based on the user’s rating of other movies, these appear to be good recommendations for this user.
A drawback of the weighted mean approach is that it will tend to give recommendations that are just that - the mean of preferred items. This can easily be a problem when our user’s interest sits on opposite sides of the spectrum. In order to address this issue, we could resort to using machine learning methods.
What to do if you don’t have explicit user ratings?
Following our example of using movie ratings to recommend content, you might have realised that we are implicitly assuming that the user ratings are available. However, frequently there is no such explicit data. What to do in this case? The solution is to determine implicitly when a user liked or disliked an item.
At OfferZen, we deal with this in two ways:
- We track whenever a job seeker profile has been skipped and when it is viewed. On Netflix, one would be able to track if a user has actually watched a movie all the way through or stopped after the first few minutes.
- We register interview invitations to job seekers as a stronger signal of intent than a profile view. This way, we can generate an implicit rating even though we don’t explicitly ask companies to rate their interest, which will frequently cause unnecessary cognitive load on the user. On Netflix, a user’s “saved movies list” might be considered less weighted than a “like”.
I’d love to hear your feedback and suggestions, are you currently or planning on implementing data science projects in your company?
- Github repo containing the source code for this article
- Good tutorial on TF-IDF
- Useful tutorial on cosine similarity
- Coursera specialisation on Recommender Systems
- The MovieLens dataset
Helge Reikeras is a Data Scientist at OfferZen. He has recently been involved in the implementation of a candidate recommender system at OfferZen. He is on a mission to democratise Machine Learning and Data Science and help make this new and exciting technology more accessible to people and companies around the world. You can best catch him on Twitter or Github.