Professional Documents
Culture Documents
Dan Cocuzzo
Stephen Wu
dcocuzzo@stanford.edu
shw@stanford.edu
Introduction
Feature films are a multibillion-dollar industry. Given
the sheer number of films produced as well as the level of
scrutiny to which they are exposed, it may be possible to
predict the success of an unreleased film based on publiclyavailable data. A large amount of data representing feature
films, maintained by the Internet Movie Database (IMDb),
was extracted and prepared for use in training several machine learning algorithms. The goal of this project is to build
a system that can closely predict average user rating and degree of profitability of a given movie by learning from historical movie data. Since there is a strong correlation between a
films budget and the gross US earnings, predicting raw gross
earnings is not particularly indicative of a films success. Instead, we transform the gross earnings of a film to a multiple
of its budget, which is a much more meaningful indicator of
a films success.
Data
Movie data is drawn from the Internet Movie Database
(IMDb), found at http://www.imdb.com/.
Access
IMDb makes its data publicly available for research purposes from which a local, offline database can be populated.
FTP servers sponsored and maintained by IMDb contain
stores of flat-format .list files that contain the same information found online through IMDbs web interface. For
this project data access and preparation was facilitated by the
help of two existing software systems: sqlite and imdbpy.
sqlite is a widely-used SQL implementation supporting
all standard SQL constructs and can be used to query all
information found in the database in a high-level, declarative manner. imdbpy is a freely-available Python package
designed for use with the IMDb database that implements
various functions to search through and obtain data. Python
scripts were devloped to automatically pull the required feature data from the local sqlite database.
Pruning
The full database contains nearly 3 million titles, of which
roughly 700,000 are feature films. Many of the titles found in
the database contain incomplete information or are inappropriate for the scope of this investigation. Thus, in an attempt
to both decrease training time and increase the accuracy of
the prediction, the full title list was pruned using a series of
SQL queries. The criteria by which IMDb titles were omitted
are as follows:
Titles which are not movies (e.g. TV, videogames, etc).
Adult films
Films missing budget data in US dollars
Films missing gross earnings data in US dollars
Films missing user rating data
Films not released in the United States
After pruning the entire database of nearly 3 million entries, only 4260 titles remain (less than 0.002% of the original database). While this quantity is a tiny fraction of the
overall database, the pruning constraints enforced are justifiable for the purposes of this prediction system; the pruned
titles include films which are not released in major theater
circuits, films we cannot generate labels for, and films not
released in the US.
Note that gross earnings reported in US dollars (our focus here) correspond to earnings from US theaters only, and
therefore the financial-based metrics for film success are
strictly an indicator domestic performance.
Features
Currently, the following features are drawn from each
training film:
cast list
director(s)
producer(s)
composer(s)
cinematographer(s)
distributor(s)
genre(s)
MPAA rating
budget
release month
runtime
AUTUMN 2013
Prediction. Prediction is performed by using the builtin libsvm prediction function, applied to the previouslyfitted model.
Parameter Tuning
Nave Bayes
Cast list length. The MAX_ACTORS parameter determines how far the model looks down the cast list in order
to train predict on an example. It is reasonable to assume
that the most important actors (i.e., the ones which receive
the most screentime and publicity) tend to appear high on
the cast list. Increasing this parameter improves accuracy
locally, but in the long term will lead to increased computational and storage complexity since we must store conditional probabilities for a greater number of actors. Additionally, the nave Bayes classifier is not cognizant of the
cast list ordering since it predicts based only on inclusion.
This means that extremely high values for this parameter will
cause minor cast members to heavily sway the prediction.
Several values of MAX_ACTORS were used to train and test
on the dataset (see Table 1); 10 was chosen as it yielded an
acceptable balance of prediction accuracy and complexity.
Output bin boundaries. Output bins for budgetmultiple were selected such that the distribution of movies
within the bins is relatively uniform; see Figure 5 for the
resulting distribution. This decreases the chance that a large
budget-multiple prior dominates the prediction. For example, using bins of size 0.25 leads to a relatively high prior for
the "0-0.25" budget-multiple bin (i.e. films that tank tend to
tank badly) causing most predictions to be placed into this
bin. While this performs respectably in terms of error rate,
this result is not very enlightening and leads to a high false
negative rate for strong movies.
Since the ratings prediction contains more bins, we did
not carry out this procedure for that model, as the resulting
gains would both be smaller and also render the output less
readable (e.g., it may not make sense to have a "3.5-4" rating
bin). Note that this means good movies tend to be underrated by the prediction system, and bad movies tend to be
overrated. We deemed this acceptable, as particularly strong
or weak movies will still stand out in the predicted ratings.
Support Vector Machine
There are two parameters to be chosen: C, the SVM
penalty parameter, and , the RBF kernel parameter. A gridsearch was performed on these parameters, varying them independently and exponentially from 22 to 27 . From this,
(Crating , rating ) and (Cbmult , bmult ) are chosen to minimize
prediction error. See Figure 1 and Figure 2 for the results
of this sweep. The values (C = 0.5, = 0.5) were found to
work well for both predictive models.
Results
The predictors performed moderately well on the test data.
Qualitatively, many of the rating and bmult predictions were
exactly correct, while ones that were incorrect were "close"
in the sense that the predicted value-bin was typically adjacent to the actual value-bin. For instance, an incorrectly
predicted movie with a true average-user rating of 7 is most
often a rating of 6. Figures 7, 8, 9, and 10 show the confusion
matrices of rating and bmult prediction results to illustrate
the distribution of misclassifications for rating and bmult. We
not only report absolute error (correct or incorrect classification) for each test sample, but also the error as a measure of
absolute distance from the true value. The absolute distance
error of rating and bmult predictions are a valuable indicator
of our systems performance, especially when misclassified
test samples are close the to true rating or bmult bin.
Note that this is similar to the approach taken by typical
regression problems, in which mean absolute error or mean
squared error is often the quality metric of choice. Indeed,
since our classification bins have a natural ordering for both
average user rating and bmult gross earnings, such a metric is
likely to be more accurate that a simple measure of error. If
not for the issues described earlier in constructing a meaningful SVM feature vector, it is likely that a regressive approach
may have yielded similar results.
Nave Bayes
The distribution of absolute distance error of average user
rating predictions is shown in Figure 3, and distribution of
absolute distance error of budget-multiple gross earnings
predictions for is shown in Figure 4. Both figures report error for a 70%/30% holdout test. These figures also report the
priors for both problems, though conditional probabilities for
the personas/attributes are omitted for brevity.
A 10-fold cross-validation was performed across the entire set of 4,260 movie titles, and the test error rates for rating
and bmult predictions along with random prediction performance are shown in Figure 11. A summary of testing and
tuned parameters is provided in Table 2.
Support Vector Machine
Alongside Nave Bayes model performance, Figures 4
and 6 show the distributions of absolute distance error of average user rating predictions and budget-multiple gross earnings predictions respectively for a 70%/30% holdout test.
10-fold cross-validation was also performed for SVM, the
results of which are shown in Figure 12.
Discussion
A few general points to take away from the results of this
experiment:
AUTUMN 2013
Rating error
BMult error
0
2
4
6
8
10
0.604
0.604
0.611
0.608
0.601
0.600
0.536
0.523
0.523
0.523
0.521
0.521
Figure 5.
Earnings
/
Figure 8. Confusion matrix for SVM rating prediction
Table 2
Example test results, 70/30 holdout validation
MAX_ACTORS
C
Nave Bayes
SVM
10
n/a
n/a
0.600
0.792
0.531
0.928
n/a
0.5
0.5
0.621
0.832
0.587
1.122