You are on page 1of 11

Gerald Gutwein

Movie Analysis
Econ4400w - Term Paper Spring 2015
Professor Chun Wang

Introduction
i) This paper is in essence a rudimentary analysis of the movie business based on a data set of
close to 300 movies including large budget to small budget films. The original data was
compiled for a case study done at New York University's School of Business. However, for some
of the movies in the data set certain figures were missing, to fill those gaps I used data from
reliable internet sources.

While the data set contains several variables that could make for an interesting dependent
variable I decided to use the total domestic gross figure as the Y variable. The simple reason
being that this number would be the most valuable to movie producers to be able to predict.
Concerning regressors, I focused primarily on first weekend gross, screens shown on, budget as
the predictors of success. Initially, I thought that the budget of the movies in the data set should
be a greatly important indicator, or at least taken into account in some of the regression models,
however I realized that budget was a largely unimportant factor and I will expound on the
reasons for this later in the paper.

Analysis
The movie industry is a multibillion dollar enterprise in the US. Usually, for the companies that
produce the movies the profit is made on the back end, like overseas distribution and licensing
deals. The main factor for determining whether a movie is likely to achieve these backend
royalties is the domestic gross revenue, as such it would be of great interest for a movie studio to
be able to predict this figure as early as possible. The earliest indicator available for this is the
opening weekend revenue. Using gross domestic revenue as the dependent variable, and opening
weekend revenue as the regressor, I ran a bivariate regression in order to see whether there was a
correlation between the two variables, and also to analyze the potential value of a model that
contains statistical significance.

First I checked the summary statistics to check for any abnormalities.


The Summary Statistics
Y -Total domestic gross
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

31582964
3327650
8070311
#N/A
57347684
3.29E+15
17.90563
3.697367
4.36E+08
14811
4.36E+08
9.38E+09
297

X1 - First weekend gross


Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

9135000
975888.4
618674
#N/A
16818156
2.83E+14
13.69186
3.299113
1.16E+08
3202
1.16E+08
2.71E+09
297

The data contains a very large variance, and a very high skewness;

After running an initial regression using these variable I found that the intercept coefficient was
too large, essentially rendering the model useless for the purposes of this analysis. To help
correct for this problem and normalize the data I used a log:log model for the regression.
Where
log ^y =b^ 0+ log b^ 1 x 1

Here are the results:

At first the regression seems fine, we can clearly see there is some sort of positive linear
relationship in the regression as shown by the R square value = 0.77, and the coefficients are
statistically significant as shown by the corresponding p-values. The resulting equation from the
model is Log (TDG) = 2.48 + 0.72log (FWG)
(0.13) (0.023)

The interpretation of this is, that as shown by the slope coefficient a 1% increase in first weekend
gross is associated with a 0.72% increase in the total domestic gross. The R squared value tells us
that 77% of the variability in the log total domestic gross is explained by the variability in the
value for log first weekend gross.
However, the seemingly good fit of this model is misleading. We can still see from the scatter
plot that there is far greater variability in total domestic gross in the region where first weekend
gross is small, in other words our model cannot be predictive for smaller movies. We can see this
from the standard error for the regression of 0.48, this means that at a 95% confidence level the
prediction interval for log domestic product is 2 ^ 1 meaning that even within the rejection
tails we can only be confident of the prediction to a factor of ten (since log base 10 was used in
the variables); therefore our prediction for any given input could be as much as 10 times as much
or as little as 1/10 as much of the true domestic gross. Obviously there is not much predictive
value in this model.
At first I thought that the best way to resolve this issue and construct a better model would be
to somehow incorporate the budgets of the movies, in the regression. My reasoning being, I
thought it would help correct for the major discrepancies between small and large movies which
corrupted the first regression. I tried several different methods including using the budget as an
additional regressor, dividing the variables by the budgets, but none of the models were even
close to being either individually statistically significant or jointly statistically significant. The
reason is that as regressors, first weekend gross and budget are too highly correlated, which is
not surprising therefore using the budget was not the answer I needed.
Then I looked at the number of screens on which a movie was shown during its first week. I
ran a multiple regression log ^y =b^ 0+ log b^ 1 x1 + b^ 2 x 2 : where

x 1 is the log first weekend

gross , x 1 is first weekend number of screens, with the dependent variable being the log
domestic gross.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.880113
R Square
0.774598
Adjusted R Square
0.773049
Standard Error
0.479984
Observations
294
ANOVA
df
Regression

Residual

291

Total

293
Coefficie
nts

Intercept
First weekend
number of screens
Log 1st weekend

2.474308
-3.7E-06
0.721171

SS
230.390
8
67.0418
8
297.432
7
Standar
d Error
0.27643
2
5.04E05
0.05586
8

And correlation table:


First
weeke
nd
numbe
r of
screen
s
First weekend number
of screens
Log 1st weekend

1
0.9138
51

Log
1st
weeke
nd

MS
115.19
54
0.2303
84

F
500.01
38

t Stat
8.9508
67
0.0743
6
12.908
49

P-value
4.26E17
0.9407
72
1.9E30

Significa
nce F
7.17E-95

Lower
95%
1.930248
-0.0001
0.611215

Upper
95%
3.0183
67

Lower
95.0%
1.9302
48

Upper
95.0%
3.0183
67

9.54E05
0.8311
28

0.0001
0.6112
15

9.54E05
0.8311
28

We can see two problems, the correlation of the x variables is dangerously high, and the slope
coefficient for the number of screens shown is not statistically significant as shown by the
corresponding p-value of 0.94 being greater than the 0.05 alpha. Furthermore, the adjusted R
square is 0.77, as such the multivariate model would not have added any value regardless.
I then realized that perhaps the data needed to be split up into two different sets. One for
movies that had first weekend grosses of over 2,000,000 and one for movies grossing less than
2,000,000 in the first weekend. I used 2,000,000 as a cutoff point based on the earlier scatter plot
of the log: log bivariate model, which showed that at around 2,000,000 (for the x variable) and
under, the residuals are much greater than for samples over 2,000,000. The data split quite evenly
with 156 samples below 2,000,000 and 140 sample above.

I then ran a bivariate regression on each using the same log: log method. First let us look at the
under 2,000,000 set.

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.50667
0.25671
R Square
4
Adjusted R
0.25188
Square
8
Standard
0.63991
Error
7
Observatio
ns
156
ANOVA
df
Regression
Residual

1
154

SS
21.7802
5
63.0621

MS
21.78
025
0.409

F
53.18
817

Significa
nce F
1.49E-11

Total

Intercept
Log 1st
weekend

155

2
84.8423
8

Coeffici
ents
2.73984
9
0.66772
3

Standar
d Error
0.44621
7
0.09155
6

494

t Stat
6.140
177
7.293
022

Pvalue
6.71E09
1.49E11

Lower
95%
1.85835
3
0.48685
4

Upper
95%
3.621
345
0.848
592

Lower
95.0%
1.8583
53
0.4868
54

Upper
95.0%
3.6213
45
0.8485
92

We can tell right away that although the coefficients are statistically significant R squared falls
drastically to 0.25, and the standard error is 0.63, not good for a log: log model.

However, lets look the above 2,000,000 regression:

SUMMARY OUTPUT
Regression Statistics
0.95954
Multiple R
5
0.92072
R Square
7
Adjusted R
0.92015
Square
7
Standard
0.12616
Error
3
Observatio
ns
141
ANOVA
df
Regression

Residual

139

Total

140

Intercept

SS
25.6971
9
2.21247
4
27.9096
6

MS
25.69
719
0.015
917

F
1614.
441

Significa
nce F
2.17E-78

Coeffici
ents

Standar
d Error

t Stat

Pvalue

Lower
95%

0.13519

0.19189
8

0.704
47

0.482
319

-0.5146

Upper
95%
0.244
231

Lower
95.0%

Upper
95.0%

0.5146

0.2442
31

X Variable
1

1.08436
7

0.02698
8

40.18
011

2.17E78

1.03100
8

1.137
727

1.0310
08

1.1377
27

We see a drastic improvement, the slope coefficient is statistically significant, R square is 0.92
showing that 92% of the variation in the log of domestic gross is explained by the variation in the
log of first weekend gross. Also, the standard error is reduced to 0.126 so that at a 95%
confidence level the prediction interval for log domestic product is 2 ^ 0.25

a marked

improvement since now we can expect our prediction to be within 10.25=1.78

of the true

domestic gross.

This leaves us with the regression equation:


log domestic=0.135+1.08 log first weekend

Conclusions
We can see that there is much greater predictive value in a model for larger scale movies. This
is because in smaller movies the variation in the residuals is too large, there may be a couple of
reasons for this. Smaller movies are much more hit or miss in terms of success as opposed to
larger movies, kind of like penny stocks compared with blue chip companies, we can expect that
one will be much more volatile and harder to predict than the other. So for production companies
it would probably be advisable to only use a model based on early performance for larger scale
movies, as we have seen.

You might also like