You are on page 1of 1

B656: Web Mining, Prof.

Fil Menczer
Project proposal: Learning to Predict Movie Ratings From the Netflix Dataset

The Netflix Prize is a competition in which contestants are asked to devise a learning algorithm that
will accurately predict movie ratings based on users' past rating history. The prize was set up in 2006
by netflix.com to encourage the development of an algorithm which, compared to Netflix's own
Cinematch algorithm, will perform at least 10% better with regard to the root mean squared error
(RMSE) of predicted ratings against actual ratings. Since the start of the competition hundreds of teams
have entered it and the current leading solutions improve Cinematch by more than 9%.

In our project, we propose to use the Netflix dataset to investigate and compare the accuracy with
which supervised learning algorithms predict movie ratings by users. This will include implementing
classifiers such as Naïve Bayes and SVM (and potentially others as we do some more research), as well
as investigating metrics for similarity between users.

In addition, we intend to extend the original Netflix dataset with additional features. This will let us
compare how the algorithms perform on data of higher dimension, and it will also allow us to
investigate whether the learning algorithm or the availability of more background knowledge affects
the performance of the classification more. We suspect that using additional data features will do more
to improve our results than the choice of algorithm.

In more detail, the current Netflix data set consists of <UserID, Rating, Date> data items for each of
about 18,000 movies identified by movie ids. There is also a mapping of ids to the titles of the movies.
We plan to use the movie titles to crawl movie portal sites (such as IMDB or Rotten Tomatoes, this is
still to be decided), and gather additional features like the following: genre, director, writer, lead actor,
lead actress. The exact set of features might change slightly; we might consider adding more features if
we see fit. Additionally, if time allows, we would like to mine movie reviews in order to come up with
a list of adjectives (e.g. “boring”, “sad”, “sentimental”, etc.) that describe the movie, and use these with
our data as well.

Since the Netflix dataset is huge we will consider dividing our datasets into several groups according to
a related property (e.g., genre) and then build training/test sets by selecting elements of each group. In
addition, we will consider using Latent Semantic Index (LSI) to reduce the computation time of some
of our training algorithms. Since the dataset given by Netflix is in a plain text format, we will import
the data to postrgeSQL which we believe can provide good performance with a large size of data.

We consider approaching the initial implementation of the classifiers and the setting up of our crawling
infrastructure in parallel. Before we begin any implementation, we need to research and evaluate
existing APIs or open source projects to help with the following portions of our project:
– Classification algorithms
– Crawler
– HTML parser
– Some storage/indexing system

You might also like