Professional Documents
Culture Documents
Quantitative Methods
CFA二级课程框架
Total: 100
2-79
Summary of Readings and Framework
SS 3
3-79
Framework
Correlation and Regression
1. Scatter Plots
2. Covariance and Correlation
3. Interpretations of Correlation Coefficients
4. Significance Test of the Correlation
5. Limitations to Correlation Analysis
6. The Basics of Simple Linear Regression
7. Interpretation of regression coefficients
8. Standard Error of Estimate & Coefficient of Determination (R2)
9. Analysis of Variance (ANOVA)
10. Regression coefficient confidence interval
11. Hypothesis Testing about the Regression Coefficient
12. Predicted Value of the Dependent Variable
13. Limitations of Regression Analysis
4-79
Scatter Plots
5-79
Covariance and Correlation
Covariance:
Covariance measures how one random variable moves with another
random variable. ----It captures the linear relationship.
n
Cov( X , Y ) ( X i X )(Yi Y ) /( n 1)
i 1
6-79
Interpretations of Correlation Coefficients
The correlation coefficient is a measure of linear association.
It is a simple number with no unit of measurement attached, so the
correlation coefficient is much easier to explain than the covariance.
7-79
Interpretations of Correlation Coefficients
8-79
Significance Test of the Correlation
r n-2
t = , df = n-2
2
1-r
Two-tailed test
Decision rule: reject H0 if +t critical <t, or t<- t critical
9-79
Significance Test of the Correlation
Example: An analyst is interested in predicting annual sales for XYZ Company,
a maker of paper products. The following table reports a regression of the annual
sales for XYZ against paper product industry sales. The correlation between
company and industry sales is 0.9757. The regression was based on five
observations.
Coefficient Standard error of the coefficient
Intercept -94.88 32.97
Slope (industry sales) 0.2796 0.0363
r n2 0.9757 3
t= 7.72
1 r2 1 0.952
From the t-table, we find that with df = 3 and 95% significance, the two-tailed
critical t-values are ±3.182 (recall that for the t-test the degrees of freedom = n -
2). Because the computed t is greater than +3.182, the correlation coefficient is
significantly different from zero.
10-79
Limitations to Correlation Analysis
Outliers (异常值)
Outliers represent a few extreme values for sample observations.
Relative to the rest of the sample data, the value of an outlier may be
extraordinarily large or small.
Outlier can result in apparent statistical evidence that a significant
relationship exists when, in fact, there is none, or that there is no
relationship when, in fact, there is a relationship.
11-79
Limitations to Correlation Analysis
12-79
Limitations to Correlation Analysis
Nonlinear relationships
Correlation only measures the linear relationship between two
variables, so it dose not capture strong nonlinear relationships
between variables.
For example, two variables could have a nonlinear relationship such
as Y= (1-X) 3 and the correlation coefficient would be close to zero,
which is a limitation of correlation analysis.
13-79
The Basics of Simple Linear Regression
Linear regression allows you to use one variable to make predictions
about another, test hypotheses about the relation between two variables,
and quantify the strength of the relationship between the two variables.
Linear regression assumes a linear relation between the dependent and
the independent variables.
The dependent variable (y) is the variable whose variation is
explained by the independent variable. The dependent variable is
also refer to as the explained variable, the endogenous variable,or
the predicted variable.
The independent variable (x) is the variable whose variation is used
to explain the variation of the dependent variable. The independent
variable is also refer to as the explanatory variable, the exogenous
variable, or the predicting variable.
14-79
The Basics of Simple Linear Regression
The simple linear regression model
Yi b0 b1 X i i , i 1,..., n
Where,
Yi = ith observation of the dependent variable, Y
Xi = ith observation of the independent variable, X
b0 = regression intercept term
b1 = regression slope coefficient
εi= the residual for the ith observation (also referred to as the disturbance
term or error term)
15-79
The Basics of Simple Linear Regression
16-79
Interpretation of regression coefficients
Interpretation of regression coefficients
The estimated intercept coefficient ( b̂0 ) is interpreted as the value
of Y when X is equal to zero.
The estimated slope coefficient ( b̂1 ) defines the sensitivity of Y to
a change in X .The estimated slope coefficient ( b̂1 ) equals
covariance divided by variance of X.
n
Cov( X , Y )
(X i X )(Yi Y )
b1 i 1
n
i
Var ( X )
( X X ) 2
i 1
b0 Y b1 X
17-79
Interpretation of regression coefficients
Example
An estimated slope coefficient of 2 would indicate that the dependent
variable will change two units for every 1 unit change in the
independent variable.
The intercept term of 2% can be interpreted to mean that the
independent variable is zero, the dependent variable is 2%.
18-79
An example: calculate a regression coefficient
The individual observations on countries' annual average money supply growth
from 1970-2001 are denoted Xi, and individual observations on countries' annual
average inflation rate from 1970-2001 are denoted Y.
Squared Squared
Money Supply Inflation
Cross-Product Deviations Deviations
Growth Rate Rate
Xi Yi
Country (Xi - X )( Yi- Y ) (Xi - X )2 (Yi - Y)2
Australia 0.1166 0.0676 0.000169 0.000534 0.000053
Canada 0.0915 0.0519 0.000017 0.000004 0.000071
New Zealand 0.1060 0.0815 0.000265 0.000156 0.000449
Switzerland 0.0575 0.0339 0.000950 0.001296 0.000697
United Kingdom 0.1258 0.0758 0.000501 0.001043 0.000240
United States 0.0634 0.0509 0.000283 0.000906 0.000088
Sum 0.5608 0.3616 0.002185 0.003939 0.001598
Average 0.0935 0.0603
Covariance 0.000437
Variance 0.000788 0.000320
Standard deviation 0.028071 0.017889
19-79
An answer: calculate a regression coefficient
Y -Y X -X
i i
0.000437
bˆ 1 = i=1
n
= = 0.5545, and
X -X
2 0.000788
i
i=1
bˆ 0 Y bˆ 1 X 0.0603-0.5545(0.0935) 0.0084
Y 0.0084 0.5545 X
20-79
Standard Error of Estimate & Coefficient of
Determination (R2)
Standard Error of Estimate (SEE) measures the degree of variability of
the actual Y-values relative to the estimated Y-values from a regression
equation.
The SEE gauges the “fit” of the regression line. The smaller the standard
error, the better the fit.
The SEE is the standard deviation of the error terms in the regression.
21-79
Standard Error of Estimate & Coefficient
Determination (R2)
The Coefficient Determination (R2) is defined as the percentage of
the total variation in the dependent variable explained by the
independent variable.
22-79
ANOVA Table
ANOVA Table
df SS MSS
Regression k=1 RSS MSR=RSS/k
Error n-2 SSE MSE=SSE/(n-2)
Total n-1 SST -
SSE
Standard error of estimate SEE MSE
n2
Coefficient of determination (R²)
SSR SSE
R2 1
SST SST
explained variation unexplained variation
=1-
total variation total variation
For simple linear regression, R²is equal to the squared correlation
coefficient (i.e., R²= r²)
23-79
Example:
An analyst ran a regression and got the following result:
df SS MSS
Regression 1 7000 ?
Error ? 3000 ?
Total 41 ? -
If the confidence interval at the desired level of significance dose not include
zero, the null is rejected, and the coefficient is said to be statistically different
from zero.
sb̂ is the standard error of the regression coefficient. As SEE rises, sb̂ also
1 1
increases, and the confidence interval widens because SEE measures the
variability of the data about the regression line, and the more variable the data,
the less confidence there is in the regression model to estimate a coefficient.
25-79
Hypothesis Testing about the Regression Coefficient
bˆ1 b1
t df=n-2
sbˆ
1
26-79
Predicted Value of the Dependent Variable
27-79
Limitations of Regression Analysis
28-79
Summary of Readings and Framework
SS 3
29-79
Framework
Multiple Regression
1. The Basics of Multiple Regression
2. Interpreting the Multiple Regression Results
3. Hypothesis Testing about the Regression Coefficient
4. Regression Coefficient F-test
5. Coefficient of Determination (R2)
6. Analysis of Variance (ANOVA)
7. Dummy variables
8. Multiple Regression Assumptions
9. Multiple Regression Assumption Violations
10. Model Misspecification
11. Qualitative Dependent Variables
30-79
The Basics of Multiple Regression
31-79
Interpreting the Multiple Regression Results
The intercept term is the value of the dependent variable when the
independent variables are all equal to zero.
32-79
Hypothesis Testing about the Regression Coefficient
Significance test for a regression coefficient
H0: bj=0
bˆ j
t
sbˆ df=n-k-1
j
p-value: the smallest significance level for which the null hypothesis can be
rejected
Reject H0 if p-value<α
Fail to reject H0 if p-value>α
Regression coefficient confidence interval
ˆ
b j tc sbˆ j
Estimated regression coefficient ±(critical t-value) (coefficient standard
error)
33-79
Regression Coefficient F-test
34-79
Regression Coefficient F-test
Decision rule
reject H0 : if F (test-statistic) > F c (critical value)
Rejection of the null hypothesis at a stated level of significance
indicates that at least one of the coefficients is significantly different
than zero, which is interpreted to mean that at least one of the
independent variables in the regression model makes a significant
contribution on the explanation of the dependent variable.
The F-test here is always a one-tailed test
The test assesses the effectiveness of the model as a whole in
explaining the dependent variable
35-79
Coefficient of Determination (R2)
Interpretation
The percentage of variation in the dependent variable that is collectively
explained by all of the independent variables. For example, an R2 of 0.63
indicates that the model, as a whole, explains 63% of the variation in the
dependent variable.
Adjusted R2
R2 by itself may not be a reliable measure of the explanatory power of the
multiple regression model. This is because R2 almost always increases as
variables are added to the model, even if the marginal contribution of the
new variables is not statistically significant.
36-79
Analysis of Variance (ANOVA)
ANOVA Table
d.f. SS MSS
Regression k RSS MSR=RSS/k
Error n-k-1 SSE MSE=SSE/(n-k-1)
Total n-1 SST -
dummy variables
38-79
Dummy variables
Interpreting the coefficients
Example: EPSt = b0 + b1Q1t + b2Q2t + b3Q3t + t
EPSt = a quarterly observation of earnings per share
y x1 x2 x3
Q1t =1 if period t is the first quarter, Q1t =0 otherwise
EPSt Q1 Q2 Q3
Q2t =1 if period t is the second quarter, Q2t =0 EPS09Q4 0 0 0
otherwise EPS09Q3 0 0 1
Q3t =1 if period t is the third quarter, Q3t =0 EPS09Q2 0 1 0
otherwise EPS09Q1 1 0 0
EPS08Q4 0 0 0
The intercept term, represents the average value of
EPS08Q3 0 0 1
EPS for the fourth quarter.
EPS08Q2 0 1 0
The slope coefficient on each dummy variable EPS08Q1 1 0 0
estimates the difference in earnings per share (on … … … …
average) between the respective quarter (i.e.,
quarter 1, 2, or 3) and the omitted quarter (the
fourth quarter in this case).
39-79
Unbiased and consistent estimator
40-79
Multiple Regression Assumptions
The assumptions of the multiple linear regression
A linear relationship exists between the dependent and independent
variables
The independent variables are not random ( OR X is not correlated
with error terms). There is no exact linear relation between any two or
more independent variables
The expected value of the error term is zero (i.e., E(εi)=0 )
The variance of the error term is constant (i.e., the error terms are
homoskedastic)
The error term is uncorrelated across observations (i.e., E(εiεj)=0 for
all i≠j)
The error term is normally distributed
41-79
Multiple Regression Assumption Violations
Heteroskedasticity 异方差
Heteroskedasticity refers to the situation that the variance of the error
term is not constant (i.e., the error terms are not homoskedastic)
Unconditional heteroskedasticity occurs when the heteroskedasticity
is not related to the level of the independent variables, which means
that it dose not systematically increase or decrease with the change in
the value of the independent variables. It usually causes no major
problems with the regression.
Conditional heteroskedasticity is heteroskedasticity, that is, variance
of error term is related to the level of the independent variables.
Conditional heteroskedasticity dose create significant problems
for statistical inference.
42-79
Multiple Regression Assumption Violations
Effect of Heteroskedasticity on Regression Analysis
Not affect the consistency of regression parameter estimators
Consistency: the larger the number of sample, the lower
probability of error.
ˆ
The coefficient estimates (the b j ) are not affected.
The standard errors are usually unreliable estimates.
If the standard errors are too small, but the coefficient
estimates themselves are not affected, the t-statistics will be too
large and the null hypothesis of no statistical significance is
rejected too often (一类错误).
The opposite will be true if the standard errors are too large.
(二类错误)
The F-test is also unreliable.
43-79
Multiple Regression Assumption Violations
Detecting Heteroskedasticity
Two methods to detect heteroskedasticity
(1) residual scatter plots (residual vs. independent variable)
(2) the Breusch-Pagen χ² test
H0: No heteroskedasticity
BP = n×Rresidual² , df=k one-tailed test 注意:以误差项
squred residuals和X做回归,是此回归的决定系数
Decision rule: BP test statistic should be small (χ²分布表)
Correcting heteroskedasticity
robust standard errors (also called White-corrected standard errors)
generalized least squares CFA考试不要求了解如何修正方程,只要
知道如果有异方差问题,用robust
standard error计算t-statistics
44-79
Multiple Regression Assumption Violations
45-79
Multiple Regression Assumption Violations
Effect of Serial correlation on Regression Analysis
Positive serial correlation → Type I error & F-test unreliable
Not affect the consistency of estimated regression coefficients.
Because of the tendency of the data to cluster together from observation to
observation, positive serial correlation typically results in coefficient
standard errors that are too small, which will cause the computed t-statistics
to be larger.
Positive serial correlation is much more common in economic and financial
data, so we focus our attention on its effects.
Negative serial correlation → Type II error (考试不做要求)
Because of the tendency of the data to diverge from observation to
observation, negative serial correlation typically causes the standard errors
that are too large, which leads to the computed t-statistics too small.
46-79
Multiple Regression Assumption Violations
47-79
Durbin-Watson test
Decision rule
Reject H0,
conclude
positive serial Inconclusive Fail to reject null hypothesis of no
correlation positive serial correlation
0 d1 dU
48-79
Multiple Regression Assumption Violations
49-79
Multiple Regression Assumption Violations
Multicollinearity 多重共线性
Multicollinearity refers to the situation that two or more independent
variables are highly correlated with each other
In practice, multicollinearity is often a matter of degree rather than of
absence or presence.
Two methods to detect multicollinearity
(1) t-tests indicate that none of the individual coefficients is
significantly different than zero, while the F-test indicates overall
significance and the R²is high
(2) the absolute value of the sample correlation between any two
independent variables is greater than 0.7 (i.e., ︱r︱>0.7)
Methods to correct multicollinearity: omit one or more of the
correlated independent variables
50-79
Multiple Regression Assumption Violations
Summary of assumption violations
51-79
Model Misspecification
There are three broad categories of model misspecification, or ways in which the
regression model can be specified incorrectly, each with several subcategories:
1. The functional form can be misspecified.
Important variables are omitted.
Variables should be transformed.
Data is improperly pooled.
2. Explanatory variables are correlated with the error term in time series
models.
A lagged dependent variable is used as an independent variable.
A function of the dependent variable is used as an independent variable
("forecasting the past").
Independent variables are measured with error.
3. Other time-series misspecifications that result in nonstationarity.
Effects of the model misspecification: regression coefficients are biased and/or
inconsistent
52-79
Qualitative Dependent Variables
53-79
Summary of Readings and Framework
SS 3
54-79
Framework
Time-Series Analysis
1. Trend Models
2. Autoregressive Models (AR)
3. Random Walks
4. Autoregressive Conditional Heteroskedasticity (ARCH)
5. Regression with More Than One Time Series
6. Steps in Time-Series Forecasting
55-79
Trend Models
yt=b0+b1t+εt
yt
56-79
Trend Models
Log-linear trend model
yt=e(b0+b1t)
Ln(yt ) =b0+b1t+εt
Model the natural log of the series using a linear trend
Use the Durbin Watson statistic to detect autocorrelation
57-79
Trend Models
Factors that Determine Which Model is Best
A linear trend model may be appropriate if the data points appear to
be equally distributed above and below the regression line (inflation
rate data).
A log-linear model may be more appropriate if the data plots with a
non-linear (curved) shape, then the residuals from a linear trend
model will be persistently positive or negative for a period of time
(stock indices and stock prices).
Limitations of Trend Model
Usually the time series data exhibit serial correlation, which means
that the model is not appropriate for the time series, causing
inconsistent b0 and b1
The mean and variance of the time series changes over time.
58-79
Autoregressive Models (AR)
59-79
Autoregressive Models (AR)
60-79
Autoregressive Models (AR)
No autocorrelation
Covariance-stationary series
No Conditional Heteroskedasticity
61-79
Autoregressive Models (AR)
Detecting autocorrelation in an AR model
Compute the autocorrelations of the residual
t-tests to see whether the residual autocorrelations differ significantly
from 0,
,
t statistics t t k
1/ n
n is the number of observations in the time series.
If the residual autocorrelations differ significantly from 0, the model
is not correctly specified, so we may need to modify it (e.g.
seasonality)
Correction: add lagged values
62-79
Autoregressive Models (AR)
Covariance-stationary series
Statistical inference based on OLS estimates for a lagged time series
model assumes that the time series is covariance stationary
Three conditions for covariance stationary
Constant and finite expected value of the time series
Constant and finite variance of the time series
Constant and finite covariance with leading or lagged values
Stationary in the past does not guarantee stationary in the future
All covariance-stationary time series have a finite mean-reverting
level.
63-79
Autoregressive Models (AR)
Mean reversion
A time series exhibits mean reversion if it has a tendency to move
towards its mean
b0
For an AR(1) model, the mean reverting level is: xt =
(1 b1 )
b0
If xt the model predicts that x t+1 will be lower
(1 b1 )
b0
than x t, and if xt the model predicts that x t+1 will be
(1 b1 )
higher than x t
Autoregressive model 如果没有
mean reverting level说明follow
random walk.
64-79
Autoregressive Models (AR)
Models estimated with shorter time series are usually more stable
than those with longer time series
65-79
Compare forecasting power with RMSE
66-79
Random Walks
Random walk
A special AR(1) model with b0=0 and b1=1
Simple random walk: xt =xt-1+εt
The best forecast of xt is xt-1
Random walk with a drift
xt=b0+b1 xt-1+εt
b0≠0, b1=1
The time series is expected to increase/decrease by a constant amount
67-79
Random Walks
Covariance stationary
A random walk has an undefined mean reverting level
A time series must have a finite mean reverting level to be covariance
stationary
A random walk, with or without a drift, is not covariance stationary
The time series is said to have a unit root if the lag coefficient is equal
to one
68-79
Random Walks
The unit root test of nonstationarity
The time series is said to have a unit root if the lag coefficient is equal to
one
A common t-test of the hypothesis that b1=1 is invalid to test the unit
root
Dickey-Fuller test (DF test) to test the unit root
Start with an AR(1) model xt=b0+b1 xt-1+εt
Subtract xt-1 from both sides xt-xt-1 =b0+(b1 –1) xt-1+εt
xt-xt-1 =b0+g xt-1+εt
H0: g=0 (has a unit root and is nonstationary) Ha: g<0 (does not
have a unit root and is stationary)
Calculate conventional t-statistic and use revised t-table
If we can reject the null, the time series does not have a unit root and
is stationary
69-79
Random Walks – if a time series appears to have a unit
root
If a time series appears to have a unit root, how should we model
it ???
One method that is often successful is to first-difference the time
series (as discussed previously) and try to model the first-
differenced series as an autoregressive time series
e.g. 2, 5, 10, 17, ? , 37
First differencing
Define yt as yt = xt - xt-1
This is an AR(1) model yt = b0 + b1 yt-1 +εt , where b0=b1=0
The first-differenced variable yt is covariance stationary
70-79
Autoregressive Models (AR)
71-79
Autoregressive Conditional Heteroskedasticity (ARCH)
t2 a0 a1 t21 ut
If the coefficient a1 is significantly different from 0, the time series is
ARCH(1)
Generalized least squares must be used to develop a predictive
model
Use the ARCH model to predict the variance of the residuals in
following periods
72-79
Regression with More Than One Time Series
In linear regression, if any time series contains a unit root, OLS may be
invalid
Use DF tests for each of the time series to detect unit root, we will have 3
possible scenarios
None of the time series has a unit root: we can use multiple regression
At least one time series has a unit root while at least one time series does
not: we cannot use multiple regression
Each time series has a unit root: we need to establish whether the time
series are cointegrated.
If conintegrated, can estimate the long-term relation between the two
series (but may not be the best model of the short-term relationship
between the two series).
73-79
Regression with More Than One Time Series
74-79
Summary of Readings and Framework
SS 3
75-79
Simulation
Steps in Simulation
Determine “probabilistic” variables
Define probability distributions for these variables
Historical data
Cross sectional data
Statistical distribution and parameters
Check for correlation across variables
When two variables are strong correlated, one solution is to pick only
one of the two inputs; the other is to build the correlation explicitly into
the simulation.
Run the simulation
76-79
Simulation
Advantage of using simulation in decision making
Better input estimation
A distribution for expected value rather than a point estimate
Simulations with Constraints
Book value constraints
Regulatory capital restrictions
Financial service firms
Negative book value for equity
Earnings and cash flow constraints
Either internally or externally imposed
Market value constraints
Model the effect of distress on expected cash flows and discount rates
77-79
Simulation
78-79
Comparing the Approaches
Choose scenario analysis, decision trees, or simulations
Selective versus full risk analysis
Type of risk
Discrete risk vs. Continuous risk
Concurrent risk vs. Sequential risk
Correlation across risk
Correlated risks are difficult to model in decision trees
Risk type and Probabilistic Approaches
Discrete/ Correlated/ Sequential/
Risk approach
Continuous Independent Concurrent
Discrete Correlated Sequential decision trees
scenario
Discrete Independent Concurrent
analysis
Continuous Either Either simulations
79-79