Notes

Statistical Analysis:
1. Regression
2. Classification
3. Clustering
Regression:
1. Linear regression
a. Simple linear regression (lm.fit =lm(medvlstat ,data=Boston ))
b. Multiple linear regression (lm.fit =lm(medv.,data=Boston ))
CEO salary. inference. predictors: age, industry experience, industry,

years of education. response: salary.
car part replacement. inference. response: life of car part. predictors: age
of part, mileage used for, current amperage.
illness classification, prediction, response: age of death,
input: current age, gender, resting heart rate, resting breath rate, mile run
time.
Classification:
1. Logistic regression
(glm.fit=glm(DirectionLag1+Lag2+Lag3+Lag4+Lag5+Volume
,data=Smarket ,family =binomial ))
2. LDA (Linear discriminant analysis) (lda.fit=lda(DirectionLag1+Lag2
,data=Smarket ,subset =train))
3. QDA (Quadratic discriminant analysis)
(qda.fit=qda(DirectionLag1+Lag2 ,data=Smarket ,subset =train))
4. KNN (K-Nearest Neighbor) (knn.pred=knn (train .X,test.X,train
.Direction ,k=1))
5. Decision trees (tree.carseats =tree(High.-Sales ,Carseats ,subset
=train ))
a. Regression trees ()
b. Classification trees ()
Examples:
stock market price direction, prediction, response: up, down,
input: yesterday's price movement % change, two previous day price movement %
change, etc.
illness classification, inference, response: ill, healthy, input: resting
heart rate, resting breath rate, mile run time
car part replacement, prediction, response: needs to be replace, good,

input: age of part, mileage used for, current amperage
Clustering:
1. K-means (km.out =kmeans (x,2, nstart =20))
2. Hierarchical (hc.complete =hclust (dist(x), method =" complete "))
cancer type clustering. diagnose cancer types more accurately.

Netflix movie recommendations. recommend movies based on users who have
watched and rated similar movies.
marketing survey. clustering of demographics for a product(s) to see which
clusters of consumers buy which products.
Statistical Learning and R Programming:

Y=f(X) + E
F represents Systematic information that X provides about Y
E is a random error term
Prediction:
Y=f(X)
f black box
Reducible error: Normal error
Irreducible error: we cannot reduce irreducible error by introducing E
Why is the irreducible error larger than zero?
E may contain unmeasured variables that are useful in predicting Y, f cannot use
them for its prediction. E may also contain unmeasurable variance.
Inference:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between Y and each predictor be adequately summarized using
a linear equation, or is the relationship more complicated?
Linear Models allow for relatively simple and interpretable inference, but may not
yield accurate predictions.
Training data: observations we use to train or teach our method to estimate f.

How do we estimate f?
A function f such that Y=f(X) for any observation (X,Y)
Statistical learning methods:
1. Parametric
2. Non-parametric
Parametric Methods: a two step model based approach
1. Make an assumption about the function form, or shape, of f
Example: Linear model
F(X)=Bo+B1X1+B2X2+.+BpXp
P+1 coef
2. After a model has been selected, we need a procedure that uses the training
data to fit or train the model.
Flexible model requires a greater number of parameters. This lead to overfitting
the data, which essentially means they follow the errors, noise.
Non-parametric approach:
Do not make explicit assumptions about the functional for of f. instead they seek an
estimate of f that gets as close to the data points as possible without being too
rough or wiggly.
Think plate spines is more flexible because they generate much wider range of
possible shapes to estimate f.
Why would we ever choose to use a more restrictive method instead of a
very flexible approach?
For inference restrictive models are much more interpretable (linear model)
Statistical learning problems:
1. Supervised
2. Unsupervised
Supervised:
For each observation of the predictor measurement xi, there is an associated
response measurement yi.
Ex: linear regression, logistic regression
Unsupervised:
For every observation xi, but no associated response yi.
Cluster Analysis: grouping observations (distinct)
Semi-Supervised learning: some observations has response and the remaining

doesnt.
Variables:
1. Quantitative takes on numerical values
2. Qualitative (categorical) takes on values in one of K different classes
Regression problems deals with Quantitative response
Ex: least square linear regression
Classification problems deals with Qualitative response
Ex: logistic regression
KNN and boosting can be either.
R Programming:
Vector: x=c(1,2,3,4);
x<-c(1,2,3,4)
Length: length(x)
ls() list out all of the objects (data and functions) that we have saved so far
rm() to remove the saved objects
rm(list=ls()) to remove all saved objects
matrix() to create a matrix
x= matrix(data=c(1,2,3,4),nrow=2, ncol=2)
or
x=matrix(c(1,2,3,4),2,2)
Output:
1
2
X=matrix(c(1,2,3,4), 2, 2, byrow=TRUE)
Output:
1
3
Sqrt() returns square root

X^2 returns square
rnorm() generates a vector of random normal variables
cor() to compute the correlation between numbers
x=rnorm(50)
y=x+rnorm(50, mean=50,sd=.1)
cor(x,y)
set.seed(1) to generate same set of random numbers
mean() to find mean
var() to find variance
sd() to find standard deviation = square root of variance
plot(x,y), plot(x,y,xlab=,ylab=,main=) to scatterplot of numbers x versus y
pdf() to create pdf
jpeg() to create a jpeg
dev.of() to indicate the plot end
seq(a,b), seq(a,b,length) to create a sequence
image() and contour() to produces a color-coded plot. To produce heatmap
persp() to produce 3-d plot
A=matrix(1:16,4,4)
A[2,3] 2nd row 3rd column
A[c(1,3),c(2,4)] 1st row and 3rd row, 2nd and 4th column
A[1:3,2:4] 1st,2nd,and 3rd row and 2nd,3rd,and 4th column
A[1:2,] first 2 rows and all columns
A[,1:2] all columns and first 2 rows
A[-c(1,3),] 2nd and 4th row and all columns
Dim(A) matrix dimension ( rows - 4 cols - 4)
Read.table() to load a data set
Read.csv()
Write.table() to export data
Auto=read.table(Auto.data)
fix(Auto) to render output similar to excel (data frame)
Auto=read.table(Auto.data,header=T,na.strings=?)
Fix(Auto)
Na.strings to point out missing string
Names() to check the variable names
Attach(Auto) to make the variables in this data frame available by name
As.factors() to convert quantitative variables into qualitative variables

Hist() to plot a Histogram
Pairs() to create a scatterplot matrix
Identify() to identify the value of a particular variable for points on a plot
Summary() to produce numerical summary of a variable in a particular data set
Q() to quit or shut down
Savehistory() to save a record of all commands
Loadhistory() to load the saved history
What are the advantages and disadvantages of a very flexible (versus

a less flexible) approach for regression or classification? Under what
circumstances might a more flexible approach be preferred to a less
flexible approach? When might a less flexible approach be preferred?
The advantages for a very flexible approach for regression or classification
are obtaining a better fit for non-linear models, decreasing bias.
The disadvantages for a very flexible approach for regression or
classification
are requires estimating a greater number of parameters, follow the noise too
closely (overfit), increasing variance.
A more flexible approach would be preferred to a less flexible approach when
we
are interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when
we
are interested in inference and the interpretability of the results.
Describe the diferences between a parametric and a non-parametric
statistical learning approach. What are the advantages of a parametric
approach to regression or classification (as opposed to a nonparametric
approach)? What are its disadvantages?
parametric approach reduces the problem of estimating f down to one of
estimating a set of parameters because it assumes a form for f.
A non-parametric approach does not assume a functional form for f and so
requires a very large number of observations to accurately estimate f.
The advantages of a parametric approach to regression or classification are
the
simplifying of modeling f to a few parameters and not as many observations are
required compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification
are a potential to inaccurately estimate f if the form of f assumed is wrong

or
to overfit the observations if more flexible models are used.
Simple linear regression:

To predict quantitative response Y on the basis of a single predictor variable X.
Y=Bo+B1X; B0 intercept, B1 slope
Bo, B1 coefficients or parameters
Residual: diference between ith observed response value and ith response value
that is predicted by Linear Model
Residual sum of squares (RSS) = e1^2+e2^2++en^2
B1=Sigma((xi-x)(yi-y))/sigma((xi-x)^2)
B0=y-B1x
Unbiased: sample mean u^ to estimate u
Standard error of u^: Var(u^)=SE(u^)^2= sigma^2/n
Residual standard error (RSE): sqrt(RSS/(n-2))
Standard errors can be used to calculate to compute confidence intervals
Standard errors can also be used to perform Hypothesis tests on coefficients
Most common hypothesis test involves testing the null hypothesis of
Ho: There is no relationship between X and Y (B1=0)
Versus the alternative hypothesis
Ha: There is some relationship between X and Y (B1 not equal to 0)
t-statistic measures the number of standard deviations that B1 is away from 0.
p-value probability of
observing any value equal to |t| or larger (where B1=0)
small p-value indicates that there is an association between predictor and the
response.
We reject the null hypothesis, that is, we declare a relationship to exist between X
and Y if the p-value is small enough.
p-value cutoffs: for rejecting the null hypothesis are 5 or 1%
RSE: is an estimate of the standard deviation E
R^2 statistic provides an alternative measure of fit
R^2= 1- RSS/TSS
Total Sum of Squares (TSS)= Sigma(yi-y)^2

Correlation is also a measure of the linear relationship between X and Y
The squared Correlation = R^2 statistic
Multiple Linear Regression:
Predicting with multiple predictor variables
Y=Bo+B1X1+B2X2+..+BpXp+E
Is there a relationship between the response and predictors?
Check B1=0 for all predictors then check null hypothesis then find F-statistic value
If F-statistic ~1 then we can say that there is no relationship between response and
predictors
The association between response and predictor works when p is relatively small
(<0.05)
The sqrt(t-statistic) = F-statistic
Deciding on Important Variables
Consider all predictors whose p-value< 0.05
Three classical approaches to choose a smaller set of models to consider:
1. Forward selection: start with null model and then add variables
2. Backward selection: start with all variables, then remove variables with large
p values
3. Mixed selection: combination of above
Model Fit:
RSE and R^2
Predictions:
1. Least squares plane
2. Model bias
3. Irreducible errors
R Programming:
Library() to load libraries
Sample libraries: MASS, ISLR, tree
Install.packages(ISLR) to install a package
Lm() to fit a simple linear regression model
Lm(y~x,data); y response, x predictor, data data set
Lm.fit=lm(medv~lstat,data=Boston)
Lm.fit basic information about the model

Summary(lm.fit) gives p-value and standard error value of coefficients; also R^2
Statistic and F-Statistic
Names() to find out the information stored in the lm.fit
Coef() to access coefficients
Confint() to read confidence intervals of coefficients
Predict() to produce confidence intervals and prediction intervals for the prediction
and for the given value
Predict(lm.fit, data.frame(lstat=(c(5,10,15)),interval=confidence)
Abline(lm.fit) to draw any line
Abline(intercept, slope)
Carefully explain the diferences between the KNN classifier and KNN
regression methods.
KNN classifier and KNN regression methods are closely related in formula.
However, the final result of KNN classifier is the classification output for Y
(qualitative), where as the output for a KNN regression predicts the
quantitative value for f(X).
Regression: the output variable takes continuous values.

Classification: the output variable takes class labels.
Regression involves estimating or predicting a response. Classification is identifying group
membership.
Response is binary (response classes = 2) then logistic regression

Maximum likelyhood
LDA is popular when we have more than 2 responses; when classes are wellseparated; if n is small
If the training set is very large then QDA
When decision boundary is highly non-linear then KNN
Stratifying or segmenting the predictor space into a number of simple regions is

called tree based methods and summarizing the predictor space in a tree
representation is called decision tree.
Tree based methods are simple and very useful for interpretation.
Decision tree can be applied to regression and classification problems.
In the tree, regions are known as terminal nodes or leaves
The points along the tree where the predictor space is split are known as internal
nodes
Segments branches
Decision tree on Regression: (Quantitative)
1. Divide the predictor space into distinct regions
2. Make same prediction for the predictors in the same region and mean them
Building a regression tree

1. Use recursive binary splitting
2. Apply cost complexity pruning to the large trees to split them into subtrees
3. Use k-fold cross validation
Decision tree on Classification: (qualitative)
Clustering:
Finding subgroups or clusters in a data set.
K-means clustering:
Partitioning a data set into k distinct, non-overlapping clusters
1. Randomly assign a number, from 1 to k , to each of the observation
2. Iterate until the cluster assignment stop changing
a. For each of the K clusters, find the centroid
b. Assign each observation to the cluster whose centroid is closest

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

Statistical Analysis:

CEO salary. inference. predictors: age, industry experience, industry,

car part replacement, prediction, response: needs to be replace, good,

cancer type clustering. diagnose cancer types more accurately.

Statistical Learning and R Programming:

Training data: observations we use to train or teach our method to estimate f.

Semi-Supervised learning: some observations has response and the remaining

Sqrt() returns square root

As.factors() to convert quantitative variables into qualitative variables

What are the advantages and disadvantages of a very flexible (versus

are a potential to inaccurately estimate f if the form of f assumed is wrong

Simple linear regression:

observing any value equal to |t| or larger (where B1=0)

Total Sum of Squares (TSS)= Sigma(yi-y)^2

Lm.fit basic information about the model

Regression: the output variable takes continuous values.

Response is binary (response classes = 2) then logistic regression

Stratifying or segmenting the predictor space into a number of simple regions is

Building a regression tree

You might also like