Professional Documents
Culture Documents
Table of Contents
Foreword ...................................................................................................................... 3
R6 Fintech in Investment Management ..................................................... 4
R7 Correlation and Regression ....................................................................... 7
R8 Multiple Regression and Machine Learning ..................................12
R9 Time-Series Analysis....................................................................................21
R10 Simulations .....................................................................................................30
Foreword
BONUS:
High-Yield Q-Bank®: We have identified the most important
practice problems from the curriculum that you must do. Ideally you
should do all practice problems, but if you are time constrained you
Thank you for trusting IFT to help you with your exam preparation.
regression equation.
Yi = b0 + bi X i
Confidence interval for the predicted value of the dependent
variable (Y)
̂ ± t c × sf where: sf is the
The prediction interval is given by: Y
standard error of the forecast.
Analysis of variance (ANOVA)
Analysis of variance is a statistical procedure for dividing the
variability of a variable into components that can be attributed to
different sources. We use ANOVA to determine the usefulness of the
independent variable or variables in explaining variation in the
dependent variable.
ANOVA table
Source of Degrees of Sum of Mean sum of
variation freedom squares squares
Regression
RSS
(explained k RSS MSR =
k
variation)
Error
SSE
(unexplained n-2 SSE MSE =
n−k−1
variation)
Total variation n–1 SST
• Omitted variables
• Not transforming the variables before using in a regression
• Pooling data from different samples that should not have been
pooled
Qualitative dependent variables
Qualitative dependent variables are dummy variables used as
dependent variables instead of independent variables. For example,
bankrupt or not bankrupt.
Probit (based on normal distribution) and Logit (based on logistic
distribution) models estimate the probability of a discrete outcome
given the values of the independent variables used to explain that
outcome.
Machine learning and distinction between supervised and
unsupervised learning
Machine learning (ML) is a subset of artificial intelligence (AI),
where machines are programmed to improve performance in
specified tasks with experience.
Formal definition: A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves
with experience E. (Mitchell, 1997)
Supervised learning is machine learning that makes use of labeled
training data.
Formal definition: “Supervised learning is the process of training an
algorithm to take a set inputs X and find a model that best relates
them to the output Y.”
Unsupervised learning is machine learning that does not make use
Penalized regression
• It is a computationally efficient technique used in prediction
problems.
• The regression coefficients are chosen to minimize sum of
squared residuals plus a penalty term that increases with the
number of independent variables.
• Because of this penalty, the model remains parsimonious and
only the most important variables for explaining Y remain in
the model.
CART
• It can be applied to predict either a categorical or continuous
target variable.
• If we are predicting a categorical target variable, then a
classification tree is produced.
• Whereas, if we are predicting a continuous outcome, then a
regression tree is produced.
Random forests
• A random forest classifier is a collection of classification trees.
• Instead of just one classification tree, several classification
trees are built based on random selection of features.
R9 Time-Series Analysis
Time series
A time series is a set of observations on a variable measured over
different time periods. A time series model allows us to make
predictions about the future values of a variable.
Linear vs log-linear trend models
• When the dependent variable changes at a constant amount with
time, a linear trend model is used.
The linear trend equation is given by yt = b0 + b1 t + εt , t =
1, 2, … , T
• When the dependent variable changes at a constant rate (grows
exponentially), a log-linear trend model is used.
The log-liner trend equation is given by ln yt = b0 + b1t, t = 1, 2,
…, T
• A limitation of trend models is that by nature they tend to exhibit
serial correlation in errors, due to which they are not useful.
• The Durban-Watson statistic is used to test for serial
correlation. If this statistic differs significantly from 2, then we
can conclude the presence of serial correlation in errors. To
overcome this problem, we use autoregressive time series (AR)
models.
Random walk
A random walk is a time series in which the value of the series in
one period is the value of the series in the previous period plus an
unpredictable random error.
The equation for a random walk without a drift is:
xt = xt−1 + εt
The equation for a random walk with a drift is:
xt = b0 + xt−1 + εt
They do not have a mean reverting level and are therefore not
covariance stationary. For example, currency exchange rates.
Unit root
• For an AR (1) model to be covariance stationary, the absolute
value of the lag coefficient b1 must be less than 1. When the
error terms, then the model has been corrected for seasonality.
Autoregressive conditional heteroskedasticity (ARCH)
• If the variance of the error in a time series depends on the
variance of the previous errors than this condition is called
autoregressive conditional heteroskedasticity (ARCH).
• If ARCH exists, the standard errors for the regression parameters
will not be correct. We will have to use generalized least squares
or other methods that correct for heteroskedasticity.
• To test for first- order ARCH we regress the squared residual on
the squared residual from the previous period. ε̂2t = a0 +
a1 ε̂2t−1 + ut
If the coefficient in our model is statistically significant, the time-
series model has ARCH(1) errors.
• If a time-series model has significant ARCH, then we can predict
the next period error variance using the formula:
̂2t+1 = â0 + â1 ε̂2t
σ
Working with two time series
If a linear regression is used to model the relationship between two
time series, a test such as the Dickey-Fuller test should be
performed to determine whether either time series has a unit root.
• If neither of the time series has a unit root, then we can safely
use linear regression.
• If one of the two time series has a unit root, then we should not
use linear regression.
• If both time series have a unit root and they are cointegrated
(exposed to the same macroeconomic variables), we may
safely use linear regression.
• If both time series have a unit root but are not cointegrated,
then we cannot not use linear regression.
The Engle-Granger/Dicky-Fuller test is used to determine if a time
series is cointegrated.
Selecting an appropriate time- series model
Section 12 from the curriculum provides a step-by-step guide on
selecting an appropriate time-series model.
1. Understand the investment problem you have, and make an
initial choice of model. One alternative is a regression model that
predicts the future behavior of a variable based on hypothesized
causal relationships with other variables. Another is a time-series
model that attempts to predict the future behavior of a variable
based on the past behavior of the same variable.
2. If you have decided to use a time-series model, compile the time
series and plot it to see whether it looks covariance stationary.
The plot might show important deviations from covariance
stationarity, including the following:
• a linear trend;
• an exponential trend;
• seasonality; or
• a significant shift in the time series during the sample period
(for example, a change in mean or variance).
3. If you find no significant seasonality or shift in the time series,
then perhaps either a linear trend or an exponential trend will be
sufficient to model the time series. In that case, take the following
steps:
• Determine whether a linear or exponential trend seems most
reasonable (usually by plotting the series).
R10 Simulations
Steps in running a simulation
The four major steps used to run a simulation are as follows:
1. Determine probabilistic variables.
2. Define probability distributions for these variables.
3. Check for correlation across variables.
4. Run the simulation.
Three ways to define the probability distributions for a
simulation’s variables
Three ways to define a probability distribution for a simulation’s
variables are:
1. Historical data
2. Cross sectional data
3. Statistical distribution and parameters (subjective estimation)
How to treat correlation across variables in a simulation
If the input variables are correlated with each other, there are two
ways to treat the correlation:
• Pick the input that has the greatest impact on value and drop