You are on page 1of 13

Media Sentiment and Stock

Market Indices
Shadi Taha and Abhishek Yada

Abstract
Sentiment analysis on NYT articles was combined with stock market data to see if
they could be used to improve the accuracy of stock market predictive ARIMA
models. While results look promising at a glance, a deeper dive into the composition
of the models shows that accuracy increases are likely due to statistical noise, and
the null hypothesis (no model aiding relationship) cannot be rejected.

Introduction
With the US Election having recently coming to a close, we wanted to see what its
impact was on the stock market. The stock market experienced a dip and then
subsequent rise in the 24 hours after Trumps victory, so our question was simplewas there a relationship between the general media optimism about the election
and the value of stocks?
The idea is simple- stock markets are speculative, and optimism coming from
trustworthy sources could make investors more optimistic- and potentially raise the
value of the indices. The reverse effect would then also be intuitive.
There are two parts to our modeling process- the evaluation of sentiment, and then
its combination with time series models. ARIMA was chosen as the time series
model to be used, as it is commonly employed and would be a new tool to learn.

Method
Abhishek and I worked fairly independently of each other and simultaneously, but
my final output depended on his and as a result I will discuss his work first.
Abhishek collected New York Times data through the API and attempted a variety of
clustering and sentiment analysis techniques before ultimately deciding the results
werent particularly good and switching to AFINN. AFINN scores words on a -5 to 5
scale, with higher numbers indicating positivity and low numbers negativity.
The AFINN scores were generated for the NYT articles, then passed over to me. I
build ARIMA models to try a basic prediction of the NASDAQ, the SP500, and the
Dow Jones. Then, I built additional models- incorporating the new information, and
compared their MAPE values (Mean Absolute Percentage Error.) The results were
then finally validated by doing a deep dive into the p values of the different model
elements.

Process

ABHISHEK INCLUDE YOUR


CLUSTER INFORMATION AND
SENTIMENT MINING HERE

After the sentiment was scored, it was then passed onto a new Notebook.
This Notebook initially visualized the financial data in order to get an idea of what
the data looked at. It was immediately apparent that the indices all had very similar
albeit unidentical shapes. The graph of the NASDAQ index is below:

This data encompasses January through December 2016. We see a slow trend
downwards at the end of the year, after the holiday season, but then see recovery
as the market steadily increases over time. The other graphs were similar enough
that attaching them to this report would be redundant, however theyre viewable
through the PyNotebook.
The similarity of the data shapes was fairly expected- we expect the indices to react
to the same or similar set of causal factors. Still, we wanted to use multiple indices
to avoid playing into or against the niche of a particular index. Any differences in
scope of impact, considering we only had NY Times sentiment data, would then be
fascinating enough to look into.
Plotting the data against each other would seem like a good idea to get an
understanding of differences- but of course the y axes for these indices are very
different and so the patterns disappear. To understand how they stack up againste
ach other, however, we can plot the returns:

Here the pattern similarity is more obvious. While theres variations in the return
levels and slopes of the lines, the dips and peaks and relative slope changes are all
fairly similar, so our initial expectation would be to see fairly similar effects using
the same methods.
Well want to be able to gauge the seasonal (weekly, in this case) impacts as well as
the year over year impacts (the trend line) so we used a technique called
decomposition, which is present in the statsmodels time series analysis package,
that comes with Anaconda.

These sets of graphs show the effects of our decomposition. The first graph is the
observed trend- its the blue NASDAQ line from before squished to fit a smaller y

scale. The second graph shows the general trend line- which we can see is
increasing over time. This is essentially a moving average. The third graph shows
seasonal variation, and the fourth graph shows the residuals.
This decomposition is important- we want to be able to control for the trend when
performing our analysis, as well as the seasonal variation. Our hope is also that
residuals are centered on zero. If we can control for these effects, with residuals
centered on zero, then well have fulfilled a lot of the assumptions necessary for
utilizing regression models.
This is a more statistical approach than a typical machine learning one- but it serves
as an introduction to the subject which were fairly unfamiliar with, so we decided
that the foundation and application of said foundation was a good place to start.
We can test against the trend line, the rolling mean, using a stationarity test. The
Dickey-Fuller test is the most commonly used statistical test. Its null hypothesis is
that the data is not stationary- therefore the goal is to reject that null hypothesis
before moving on with your analysis. We can visualize the test by graphing the data
and rolling averages, which we show below:

Here we see a p-value of roughly 0.71, therefore its highly likely that the data is not
stationary. Knowing this, well have to transform the data to continue.
Theres a few different transformations that are commonly used in time series
analysis. We settled on a fairly basic approach- the First Difference approach. The
approach is simple:
X = X X.shift(1)
In other words, youre calculating the difference between this point and the next
point, and doing that for every point. Your transformation is to not consider the
values, but the scope of change.
In an exponentially changing data set, this may not work. However, our results
seemed to be rather linear in nature, so we thought it might. Our findings are as
follows:

There appears to be some variation in the rolling average, and at first glance it may
appear the data is not stationary- but its still centered on zero. Simply because
theres variance doesnt mean its not centered around the same point, and we can
see this is correct when we look at the p-value, which is incredibly low.

With this improvement, we decided to move forward. There are interesting


questions (whats happening in July?) that can be raised, but time was limited.
We quickly checked if we could get a better difference set using weighted
exponential difference- similar to first differencing but with weights on the different
shift values. Since we were only using a first difference, however, the effect was
small, and not an improvement, so we did not use it.
The next step was to check our autocorrelation values. This is because our plan was
to use an ARIMA time series model- which uses both AR terms (Auto Regressive)
and MA terms (Moving Average). The autocorrelation functions measure how much
a variable is correlated with lagged versions of itself. Ultimately, we want to include
these terms in our model as AR or MA terms, depending on the autocorrelation
function used. Theres the standard ACF (Autocorrelation Function) and PACF (Partial
Autocorrelation Function) which controls for the effects of other lagged variables
(similar to how a multiple regression controls for other variables.)
We set up what is essentially a t-test around 0, then test our correlations. When the
correlation passes the confidence interval, we record that value as a parameter we
will pass into our ARIMA function. Visually, ACFs look like this:

In our case, we pass into our confidence band at a lag of 1- so we record that 1 and
pass it into our model. Our PACF looked similar, and also passed the band at 1.

Once we have our parameters, we then create our model- of these differences. We
create the model and then find our RSS so we have a general baseline for error.
Visually, the model looks like this:

It looks to be like just a jumbled mess of blue and green, and its hard to see how
good these results really are. Is our RSS terrible? The number is high, but RSS never
decreases with data points, and we have a lot of data points. Additionally, it
increases with scale of data, and our differences are in the hundreds how do we
know?
Well have to scale it back. Our conversion was simple. We created a cumulative
sum of these differences over time, then we re-applied them to our very first value.
This would create a new forecast altogether, and then we could use more
interpretable error metrics to evaluate our model and get a clearer idea of where we
stood.

The model looks horrible at a glance- and it is generally bad, but not as bad as it
initially appears. The y axis is 3500 to 5500, so while its good for visualizing the
gap between the points it does blow them out of proportion. Still, we want to see
this level because we want to understand the general shape of our model.
We have a Mean Absolute Percentage Error (MAPE) of 0.2068, or roughly 21% error
on average. Not horrible, but the shape of the model looks a bit like an inversion of
the actual forecast. It isnt quite- you can see that the model begins to move in the
same direction as the actual results around mid-August 2016, but its an interesting
find.
Overall, the model doesnt seem particularly useful. It seems to be picking up the
same effects as the initial data- it has similar spikes-but it doesnt seem centered
properly and it doesnt trend in the proper direction.
Moving forward the question was well what happens when we add our previously
created sentiment data?
We wont repeat all the steps again, but rather Ill show you the graph of the new
model:

The 9.5% improvement in MAPE means we brought the error down to 0.1118 from
0.2068, so our percentage reduction in total error was actually about 46%- which is
a massive error reduction! Here we can see the model trends in the same direction
much more often, and even overlaps in places. It is, visually speaking, a much
better model overall.

Results
Of course, if you inspected the X-axis on the second graph you would see the scale
is different. We didnt have sentiment data for the same time period, so we
computed our final results on a Time Adjusted model (the prior model with a
smaller date range.) Our results are as follows:

Here we see improvements to accuracy across all three indices of 4.7 5.0%. This is
fantastic, as were cutting out over a third of the error in all three cases, and even
visually the model is much better.
But can we see the improvement is truly due to the addition of sentiment? We
evaluated the terms composing the model and found:

Our Autoregressive terms were irrelevant, and we could verify this by changing the
parameters we passed into our ARIMA (and it is something we did test, although
that isnt present in the notebook because looking at the same graph twice is
unexciting.) The MA terms were highly relevant, but of interest is our sentiment
term- it has a p-value of 0.292. In other words, it is statistically insignificant.
So whats going on?
Its possible that our sentiment values was somehow correlated with our MA or AR
terms and thus they were explaining some of the same variance. Generally, we find
that unlikely. We think what is more probable is that there is either:
1) Sentiment is correlated with another variable, but not perfectly- and that
variable is driving our higher model accuracy.
2) The accuracy results are simple overfitting- ARIMA is a regressive models and
regressive models cannot lose accuracy by including additional terms.
As a result, we fail to reject the null hypothesis and find no evidence that
optimism about the political landscape in the NY Times has an effect on any of the
Stock Market Indices that we measured.

Summary
Our process was an involved one- it involved a lot of data processing and prep to
compare different data sets, and then combine them for modeling purposes. In the

end, we did not find evidence of a connection between the NYT sentiment scoring
and these indices.
That is not to say there is not a connection. In the future, we would want to expand
to include other media outlets and different measures of sentiment. Additionally,
different model types would likely be appropriate- as were using relatively simple
statistical models that have very likely been improved upon in the machine learning
space. Using the model as an ensemble with a deep learner, for example, may
prove promising.

Code Instructions

ABHISHEK INCLUDE YOUR


INSTRUCTIONS HERE
Final Step: Run Final Project.ipynb. You will want to run it in the same directory
youve stored NASDAQ Composite.csv as well as DowJones_IndustrialAvg.csv,
SP500.csv and the output from AFINN_Sentiment_Analysis.ipynb if one was
generated. If not, run the notebook with the camp_news_mood.csv file in the
same directory as well- one has been supplied with the rest of the attached
materials.

References
Some code was taken from the following tutorials and introductions to time series
analysis.
1. http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-inpython/
2. https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codespython/
Other tutorials and resources used in the creation of our project include:

ABHISHEK INCLUDE YOUR


RESOURCES HERE

You might also like