You are on page 1of 12

Data Mining & Unsupervised Clustering

on IMDb Movie Database


Alwin Bethel
Sheila Tupker

Abstract. With a plethora of merit-based awards that dictate the performance of


a movie (Oscars, Emmys, Academy Awards, Golden Globes, People’s Choice),
the question arises – what determines a “good” movie? To answer this question,
a dataset containing the Internet Movie Database’s (IMDb) top 250 movies were
analyzed to forge a meaningful relationship that may help develop insight. The
dataset was instantiated with machine learning models that focused on mining
and processing the data, feature selection, dimensionality reduction, and
unsupervised clustering. The techniques employed in this study include web
scraping, k-means clustering, the elbow method, text mining (component of
natural language processing (NLP)), and Principal Component Analysis (PCA).
As the dataset is dynamically changing, at a respective instance of the dataset
with the aforementioned algorithms applied, our results indicated that movies
which invoke an emotional response (according the text mining and Principal
Component Analysis) tend to perform better. Movies that contain well-known
actors or are created by acclaimed directors tend to perform better (according to
Principal Component Analysis). From our k-Means unsupervised clustering we
developed eight clusters that clustered together similar movies.

Keywords: unsupervised learning, machine learning, clustering, analytics, web


scraping, k-means, elbow method, text mining, principal component analysis,
natural language processing, IMDb, movie data

1 Introduction & Problem

Our attempt at discovering an interesting and viable project led to the feasible choice
in choosing movie rating data, both for its intrinsic value of significance, practicality,
and engagement of thought. As we began to muster ideas about how we can analyze
this data, the notion arose, “What determines a ‘good’ movie?” In the 2015 study,
McCullough and Conway state that award-winning films consistently tend to have a
lower state of integrative complexity [1]. The idea of integrative complexity is defined
as “a person’s ability to differentiate between the different but relevant perspectives of
a problem and, at higher levels, the ability to integrate those perspectives in some
coherent manner” [2]. These conclusions diverge from the realm of human behavior,
and the corresponding studies were conducted by psychologists. The conclusions
drawn from our study were conducted through the determination of trends and
analytical data from a corresponding IMDb dataset, which contains information on the
top 250 movies of all time, according to IMDb. The dataset consisted of variables that
naturally correspond to movie data, including but not limited to: movie title, release
year, parental rating, genre, director, actors, and box office results. Typically, in a
supervised learning model, a target variable is assigned, and algorithms are applied to
the corresponding data to actualize trends. With the IMDb top 250 movies of all time,
a target variable like “reception” or “rating” with corresponding values like “good” or
“bad” would be applied. However, in the case of this study, we focused on applying an
unsupervised clustering algorithm, naturally lacking a target variable, which allowed
freedom to actualize trends without a corresponding target to fulfill. We anticipate the
trends that are formed to present insight to our original question, “what determines a
‘good’ movie.” This project seeks to agglomerate these intrinsic qualities and structure
an informed prediction of the characteristics of a “good” movie. Moreover, the project
systematically employs machine learning algorithmic techniques via the conduit of
web scraping, k-means clustering, the elbow method, text mining, dimensionality
reduction (Principal Component Analysis (PCA)) and integrates minute components
of natural language processing (NLP). The intention to utilize predictive modeling
grounds its basis upon the ability to strategically determine, by means of common
characteristics within the data, why there is a likelihood of being considered a “good”
movie. This notion can be furthered to determine whether or not a movie would be
considered an award-winning movie but is beyond the scope of this project. To
reiterate, our initial questions are as follows: Given this particular data set, can we
predict why a movie would be considered a “good” movie? Based on our results, are
there identifiable attributes among these movies that are more likely to contribute to a
“good” movie? In a practical sense, choosing to answer these questions does not take
into consideration any influence from other realms of study, such as concepts from the
aforementioned applicable study in human behavior, and relies solely on machine
learning methods and statistical analysis.

2 Dataset
Our dataset is located on the Internet Movie Database (IMDb), and can be downloaded
through its API (see source file for API key). The dataset we analyzed contains the top
rated 250 movies, according to IMDb. The method of acquiring the top 250 movies
requires that the accessor of the data inputs a movie ID, and relevant content
concerning the movie is returned. Downloading the dataset occurs natively within
Python, and in our case the Jupyter Notebook. To view the raw list of the top 250
movies (not including relevant data that corresponds to each of the movies) please see
the following link: http://www.imdb.com/chart/top. As a note to future accessors of
the dataset: the dataset used is dynamically changing (as new information is added, the
database changes), and all data collected and analyzed within this document reflect a
respective instance of this dataset. The data that was imported into Python contains
250 rows and 9 original columns (programmatic addition caused this number to
change as categorical variables were converted to numerical and corresponding
dummy variables were added). Generally speaking, the dataset contains a higher
number of features when compared to the number of samples. We have created a data
dictionary, Table 1, which describes our raw dataset, includes details about the
features, and can be applied as a ledger for the methodologies in our research study.

Table 1. Data Dictionary for original data (sans programmatic addition) [3]. Contents of table
on following page. Legend is as follows: $ Plot of the movie displayed as string. $$ Metascore
is considered the rating of a film. Scores are assigned to movie's reviews of large group of the
world's most respected critics, and weighted average are applied to summarize their opinions
range.
3 Methodology
Our goal is to learn from the dataset why a movie is considered a “good” movie, and
to develop general trends or a prediction outcome. We endeavor to use unsupervised
learning models to establish natural clusters within the top 250 movies of all time,
using the several intrinsic features of the dataset. We also endeavor to apply pre-
processing and dimensionality reduction techniques to enhance and improve the
produced clusters.
We followed a loosely scientific approach for this project, using step-by-step
trial and error methodology and a plethora of noted resources. Although, we didn’t
formulate an explicit hypothesis, we used the initial questions posed in our proposal as
a guide. We resolved to experiment with what would work, what wouldn’t work, and
why. We continued finding solutions and improving our model after answering our
initial questions. Initially, we sought to find a viable solution by adhering to online
tutorials and guides. Ultimately, we applied k-means clustering, text mining, the elbow
method, and Principal Component Analysis as our solutions to answering our initial
questions. The following sub-sections describe the methods used from acquiring the
data to employing unsupervised learning to actualize solutions to our questions.

3.1 Collecting & Loading the Data

The initial process began with seeking out data that would correspond to a topic of
interest, choosing a predictive model, and applying Python code to execute it (see
beginning of methodology section for holistic methodology). The process of executing
our code began with the importing of various packages (and corresponding models)
that we deemed necessary for this process. These packages include, but are not limited
to: NumPy, Pandas, Requests, components of BeautifulSoup, components
of sklearn, json, datetime, nltk (the natural language processing library), and
Matplotlib. After importing our libraries, we imported the Top 250 Movies
database from IMDb. This was a relatively long process. The data was a direct
download from IMDb’s website and had to be accessed through an application
program interface (see Dataset section above for more details). The method of
acquiring the top 250 movies required that the accessor of the data input a movie ID,
and relevant content concerning the movie would be returned. Initially, we had to
scrape the top 250 movies page from IMDb to acquire a list of IMDb’s designated
movie IDs. Then the list of IMDb movie IDs was used to query the API. After
querying the API, a set of information was returned for each value queried. This
process was facilitated by the BeautifulSoup Python library, which parsed the
HTML code for the dataset and returned a finely transcripted dataset of information.
This information consisted of general data concerning each movie, such as: the plot,
parental rating, language, actors, title, genre, etc. Please see our data dictionary to
review features within our model and corresponding information (Table 1). Within this
set of information, we also queried the budget and revenue for each of the movies.
From there, we performed a general analysis on the data, using code to determine
unique values within a variable (unique function), discern natural trends within the
data (head and tail functions), and understand the shape of our data (shape
function).

3.2 Pre-Processing the Data

We then proceeded to an initial feature selection process that essentially created a


subset of the data (employing a new dataset, dataFrame, and adding and removing
components to it as the project progressed) selecting the following variables: Title,
Director, Actors, Genre, Production, Rated (Parental Rating), and Plot.
Irrelevant variables, such as IMDB Rating and IMDB ID were removed, as they
did not pertain to the analysis. Data was also re-classified to ensure that the models
employed were able to read the data to form a viable solution. For example, the Year
variable was converted to a categorical variable according to how recent the movie
was, with movies released after 2000 (recent movies) being labeled as 1 and movies
before 2000 (non-recent) being labeled as 0. We cleaned up the variable Runtime
and replaced it with MovLength, as was done to the Year variable. To remove a
respective variable, we used the drop function. We cleaned some variables, such as
Name, and used a function (splitting) to separate an actor’s name and split the
variable from a full name into a respective first and last name column, stored as
separate lists. Another function that was used to assist us later in the project
(freqSort), was used to take the input as the name of the column and return a sorted
list of the elements that occurred frequently in that respective column, in descending
order. The plotColumn function was used to plot a bar chart of a respective column
within the dataframe. It takes the name of the column as the input and determines
number of elements to display and returns a bar chart, contingent on the user-defined
number of top elements which occur frequently within that column.

3.3 Pre-Analysis by Visualization

Next, we used the aforementioned custom functions that were created to perform
analyses by visualization of the data and subsets of the data. Initially, we ran the
plotColumn function on the Director variable to discern top producing directors
(line 18 of source code). The function freqSort was used to further determine top
values in the Director variable. Unique values were extracted from the Genre
variable and displayed as a bar chart, using plotColumn (line 22 of source code).
Ratings was then modified to be manifested as a binary variable, opposed to a
categorical variable with several values. The splitting function was used to create
new columns that served as a binary operator using 0 and 1 as possible values.
Actors was analyzed by visualization and via the freqSort function, top
occurring actors were added to the dataFrame. A likewise procedure was performed
for the Production variable, and top production companies were added to our
dataFrame. These visualizations can be found in the source code starting on line 18.

3.4 Text Mining & Integrating Concepts of Natural Language Processing

For any of the columns that contained text-intense contents (for example variable
Plot), a process was employed to extract the most frequently occurring words and
create corresponding dummy columns. Variable Plot was a challenging column to
extract top words as it contained a number of different words that reflect natural
spoken language (as opposed to a generic list of information, categorical data, or small
descriptive clauses). From the IMDb top 250 data, the most commonly extracted
words, considering Plot contained data that reflects spoken language, tended to be
articles, conjunctions, prepositions and pronouns common to natural language, like
“the”, “a”, “and”, “for”, etc. (see Figure 1). Consequently, we compiled a list of the
top 21 words in occurring in this column that were considered significant, designated
to a new variable, wordsOfInterest. The words appropriated to this variable
were: “man”, “boy”, “him”, “woman”, “girl”, “her”, “love”, “war”, “journey”,
“murder”, “friendship”, “police”, “battle”, “beautiful”, “team”, “detective”, “fight”,
“death”, “crime”, “struggles”, and “family”. The wordsOfInterest variable was
then integrated into the existing dataFrame as a new variable, plotNew, using the
function discoverPlot and wordExist.

3.5 Dimensionality Reduction & Principal Component Analysis

At the halfway point of this project, we determined the shape of the dataFrame
and realized we had a dataframe containing 250 entries and 103 features, from the
amalgamation of data within the previous sections of the project. In the context of
modeling, a high number of features in relation to data points leads to a situation
which causes relevant information to be muted. According to the machine learning
researcher, Matthew Mayo, using irrelevant attributes, mixed in with powerful
predictors cause a negative effect on the resulting model [4]. Moreover, “…as the
feature set grows, the feature space grows exponentially, pushing the relative
separation between data points to reach parity. That means that the feature set ceases
to provide predictive power for grouping similar points because all points are equally
dissimilar” [5]. We recognized further feature selection tools would need to be
implemented to reduce dimensionality. After corresponding dummy variables were
added to each of the existing variables within the dataFrame, a final dataframe
called lastDataFrame was created containing said dummy variables, narrowed
down by feature selection. The variables now exist with binary operators designating
presence (or not) of a respective value.
Before employing our clustering algorithm, we reduced the number of
features within our model by iterating a Principal Component Analysis. We began by
instantiating lastDataFrame as a NumPy matrix, appropriated as variable X. We
instantiated the StandardScaler() function from sklearn.preprocessing,
and the X variable into the scaler.fit_transformed() function. We then used
the pca.fit() function and employed the aforementioned X variable to employ the
Principal Component Analysis. After reviewing the results of the initial PCA, we
decided to additionally analyze the first component and determine which variables
were most influential to the first component.

3.6 Employing a Clustering Algorithm & The Elbow Method

The last phase of the project entailed ascertaining which clustering model to employ
on the processed data and fitting the respective algorithm to the model. The
determination process was extensive, reflecting a trial and error process, and arrived at
the conclusion that k-Means clustering was the most optimal solution for this model,
much because the data was already pre-processed. For the k-means clustering
algorithm we introduced a new variable, Xpca, a derivate of the variable X, which
also employs the pcaComponents variable, from the Principal Component
Analysis. The parameters that were employed within our model include:
n_clusters, init, and random_state. The n_clusters parameter allows the
user to indicate “the number of clusters to form as well as the number of centroids to
generate” [6]. In our model, we indicated n_clusters=i, a range, which will help
us implement the elbow method to determine the optimal number of clusters. The
Elbow Method and its findings are further discussed in the k-Means section of the
results. The init parameter in our model was set to k-means++, which “selects
initial cluster centers for k-means clustering in a smart way to speed up convergence”
[6]. Random_state in a model “determines random number generation for centroid
initialization” [6], and for our model, was set to 0. Following the establishment of the
optimal number of clusters, by the Elbow Method, a subsequent k-Means model was
run using MiniBatchKMeans. We employed this method as it reduces
computational cost, while maintaining the quality of the output. Scikit-learn
documentation demonstrates the high-quality output of MiniBatchKMeans, when
compared to a normal instantiation of k-Means [7]. Within this iteration, we used the
following parameters and respective arguments: init=k-means++
(aforementioned), max_iter=500 (“Maximum number of iterations over the
complete dataset before stopping independently of any early stopping criterion
heuristics [8]”), n_init=1000 (“Number of random initializations that are tried. In
contrast to KMeans, the algorithm is only run once, using the best of the n_init
initializations as measured by inertia [8]”), init_size=1000 (“Number of
samples to randomly sample for speeding up the initialization [8]”, and
batch_size=1000 (indicates “size of the mini batches [8]”).

Figure 1. [From source file.] Illustration of code employed to extract the most frequently
occurring words from Plot variable and create corresponding dummy columns. Full list is
shown in output 149. Input 151 is new variable, wordsOfInterest created using words
that were significant, omitting parts of speech found in natural spoken language.

4 Experimental Results
We have broken down our results into corresponding phases of our project, as
designated by the following sections. Each section contains objective experimental
results with minor amounts of reasoning or interpretation. A subsequent Interpretation
and Conclusion section follows the respective phases of our project and helps develop
subjective rationales for our findings and concludes our initially posed questions.

4.1 Phase 1: Text Mining Results

Text mining and components of natural language processing were applied to find
insight in text-intense variables like Plot. From the IMDb top 250 data, the most
commonly extracted words, considering Plot contained data that reflects spoken
language, tended to be articles, conjunctions, prepositions and pronouns common to
natural language, like “the”, “a”, “and”, “for”, etc. Consequently, we compiled a list of
the top 21 words in occurring in this column that were considered significant,
designated to a new variable, wordsOfInterest. The words appropriated to this
variable were: “man”, “boy”, “him”, “woman”, “girl”, “her”, “love”, “war”, “journey”,
“murder”, “friendship”, “police”, “battle”, “beautiful”, “team”, “detective”, “fight”,
“death”, “crime”, “struggles”, and “family”. (See Section 5, Interpretation &
Conclusion).

4.2 Phase 2: Dimensionality Reduction Results

“Principal component analysis is a method that rotates the dataset in a way such that
the rotated features are statistically uncorrelated” [9]. The algorithm progresses by
determining the direction of maximum variance (contains the most information), then
determines the direction that contains the most information while being orthogonal
(oriented at a right angle) to the first direction [9]. In our application of Principal
Component Analysis, with intention of dimensionality reduction, we found that (for
this instance of the dataset) 40 components explained 72.26% of the variance in the
data, with the first component being attributed to 5% of the variance (line 55 of source
code). We then took the components greater than 1% (to determine which of the
features has the greatest influence on our first component) and printed these most
influential features. Out of the 141 original features, we printed the first 40 as they
accounted for most of the variance. (See Section 5, Interpretation & Conclusion).

4.3 Phase 3: Clustering Results

From the Principal Component Analysis, we employed the Elbow Method to help us
determine the optimal number of clusters to include in our k-Means model (line 57 of
source code). The Elbow Method employs the within clusters sum of squares to return
this information [10]. We employed a MiniBatchKMeans model with parameter
n_clusters=8 on the variable Xpca and coupled it k-Means labels to determine
how many movies were in each cluster. The first cluster had 182 movies in it, while
the eighth cluster had four in it. We initially tried to visualize the clusters and their
respective centroids but found that due to the dimensionality of the dataset, were not
able to strategically and intuitively display the visualization. Instead, we opted to list
each cluster with its associated features (line 66 of source code). Notice that Cluster 0
contained the following words: genre: Romance, genre: Sci-Fi, Director: Akira
Kurosawa, genre: Adventure, genre: Music, while the titles include movies such as:
The Shawshank Redemption, The Godfather, The Godfather: Part II, 12 Angry Men,
Schindler's List, Pulp Fiction.

5 Interpretation & Conclusion


As stated, we took a scientific approach to this research study, and therefore, our
models were able to expand and progress over the course of our project. We looked for
innovative solutions for our questions and applied conceptual strategies from a number
of different sources. Based on the results from this research project, we can conclude
that the specific characteristics intrinsic to this dataset can be used to predict why a
movie is considered a “good” movie and what those defining characteristics are. We
want to be sure to state: this project was purely based on analytics from the raw IMDb
dataset, and does not include influence from other realms of thinking like psychology
of movie data, theatrical appeal from consumers, or any additional influential trains of
thought; the project was purely scientific and loosely followed the scientific method.
From the text mining phase of our study, we saw that the most frequently
used words of significance (after removal of common parts of speech like “a”, “the”,
etc.) included positive words like, “love”, “beautiful”, “friendship”, and “family”. In
the context of entertainment, these words and concepts are full of virtue and possibly
contribute to those “feel-good” moments from a movie. Surprisingly, there were a
plethora of negative words as well, like: “war”, “murder”, “battle”, “fight”,
“struggles”, “crime”, and “death”. These words possibly contribute to plot conflicts
within these top 250 movies. Coupling with the aforementioned positive words may be
what contributes to the movie being considered a “good” movie. This might be due to
a coupled conflict and resolution relating to what is called “drama”, which can
potentially invoke an emotional response for the viewer. Another thing we noticed
about the words of interest in our Plot variable was that there tended to be a greater
number of words that are associative of males like, “man”, “boy”, and “him”, opposed
to words that associative of females like, “woman”, “girl”, and “her”. This may be
indicative of the existence of a higher number of movies with a male lead character,
opposed to a female lead character. An interesting potential project in the future may
be to analyze movie ratings and their association with the presence of positive and
negative words in the plot, or if movies with a stronger male or female presence tend
to perform better.
The dimensionality reduction phase of our project, which incorporated a
Principal Component Analysis, indicated 40 components explained 72.26% of the
variance in the data, and that the first component contributed 5% of the variance.
When analyzing this information, we see that the first component is Rating:PG-13,
and most of the top 40 components consist of certain parental ratings and genres.
Other contributing factors indicate that movies with certain actors, like Leonardo
DiCaprio and Jack Nicholson, and/or led by directors, like Quentin Tarantino and
Alfred Hitchcock, tend to show a degree of influence. Interpreting this data might have
us explain why this might be the case, but we think the answer is intuitive. A movie-
goer will naturally be inclined to watch a movie that has a favorite actor or actress in
it, is created by a favorite director, or aligns with his or her favorite genre. The
influence of parental ratings might be indicative of the notion that movies are
prioritized if they are family friendly or approved for general audiences. Another
interesting thing to note is that action and crime movies tended to have more influence
than family movies. This might further verify our findings from the text mining phase
of our project, that movies which invoke an emotional response from a viewer, might
have a higher chance of being a “good” movie. One thing to note is that Principal
Component Analyses may not be the most optimal solutions for binary data [9], so a
future iteration of this project might include a more efficient algorithm like a Multiple
Correspondence Analysis.
Applying k-Means clustering was the final phase of our project, and
essentially the main requirements for this project. We successfully applied k-Means
clustering to unlabeled raw data using unsupervised learning algorithms for pre-
processing, dimensionality reduction, and clustering. With this final phase, we saw
that in our eight clusters, the first cluster contained 182 movies, while the eighth
cluster contained four. Analyzing the first cluster, we see that associative words with
this cluster include specific emphases on genre and directors. The model clustered
together similar movies and we saw this clearly in the output (line 66 of source code).
In conclusion, we set out to actualize answers to the questions, “Given this
particular data set, can we predict why a movie would be considered a ‘good’ movie?
Based on our results, are there identifiable attributes among these movies that are more
likely to be considered a ‘good’ movie?” The answer is yes, by means of unsupervised
machine learning, focusing on clustering, we have determined there are clearly trends
intrinsic to the IMDb dataset. Some of the identifiable attributes that are present
include data that suggests that movies which tend to invoke an emotional response
might perform better. Movies of a popular genre (those which tend to contain drama),
with popular actors, or created by popular directors naturally tend to perform well.
From our k-Means unsupervised clustering we developed eight clusters that clustered
together similar movies.
References
1. Mccullough H, Conway LG (2018) “And the Oscar goes to . . .”: Integrative
complexity’s predictive power in the film industry. Psychology of Aesthetics,
Creativity, and the Arts 12:392–398. doi: 10.1037/aca0000149

2. Suedfeld P, Tetlock P (1977) Integrative Complexity of Communications in


International Crises. Journal of Conflict Resolution 21:169–184. doi:
10.1177/002200277702100108

3. The Internet Movie Database (2018) Top Rated Movies: Top 250 as Rated by IMDb
Users. https://www.imdb.com/chart/top

4. Mayo M (2017) The Practical Importance of Feature Selection. In: KDnuggets.


https://www.kdnuggets.com/2017/06/practical-importance-feature-selection.html.
Accessed 1 Dec 2018

5. Man H (2017) Data Clustering Using Unsupervised Learning- What type of movies
are in the IMDB top 250? In: medium.com. https://medium.com/hanman/data-
clustering-what-type-of-movies-are-in-the-imdb-top-250-7ef59372a93b. 2018

6. SciKitLearn (2018) sklearn.cluster.KMeans. In: 1.4. Support Vector Machines - scikit-


learn 0.19.2 documentation. https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. Accessed 1 Dec
2018

7. SciKitLearn (2018) Comparison of the K-Means and MiniBatchKMeans clustering


algorithms¶. In: 1.4. Support Vector Machines - scikit-learn 0.19.2 documentation.
https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html.
Accessed 1 Dec 2018

8. SciKitLearn (2018) sklearn.cluster.MiniBatchKMeans. In: 1.4. Support Vector


Machines - scikit-learn 0.19.2 documentation. https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html. Accessed
1 Dec 2018

9. Müller Andreas C., Guido S (2017) Introduction to machine learning with Python: a
guide for data scientists. OReilly, Beijing

10. Naik K (2018) K means clustering in python- Machine Learning Tutorial with Python
and R-Part 12. In: YouTube. https://www.youtube.com/watch?v=tAY6jtFoNEA.
Accessed 2 Dec 2018

You might also like