Professional Documents
Culture Documents
1. Regression
2. Classification
3. Clustering
Regression:
1. Linear regression
a. Simple linear regression (lm.fit =lm(medvlstat ,data=Boston ))
b. Multiple linear regression (lm.fit =lm(medv.,data=Boston ))
Classification:
1. Logistic regression
(glm.fit=glm(DirectionLag1+Lag2+Lag3+Lag4+Lag5+Volume
,data=Smarket ,family =binomial ))
2. LDA (Linear discriminant analysis) (lda.fit=lda(DirectionLag1+Lag2
,data=Smarket ,subset =train))
3. QDA (Quadratic discriminant analysis)
(qda.fit=qda(DirectionLag1+Lag2 ,data=Smarket ,subset =train))
4. KNN (K-Nearest Neighbor) (knn.pred=knn (train .X,test.X,train
.Direction ,k=1))
5. Decision trees (tree.carseats =tree(High.-Sales ,Carseats ,subset
=train ))
a. Regression trees ()
b. Classification trees ()
Examples:
stock market price direction, prediction, response: up, down,
input: yesterday's price movement % change, two previous day price movement %
change, etc.
illness classification, inference, response: ill, healthy, input: resting
heart rate, resting breath rate, mile run time
Clustering:
1. K-means (km.out =kmeans (x,2, nstart =20))
2. Hierarchical (hc.complete =hclust (dist(x), method =" complete "))
x<-c(1,2,3,4)
Length: length(x)
ls() list out all of the objects (data and functions) that we have saved so far
rm() to remove the saved objects
rm(list=ls()) to remove all saved objects
matrix() to create a matrix
x= matrix(data=c(1,2,3,4),nrow=2, ncol=2)
or
x=matrix(c(1,2,3,4),2,2)
Output:
1
2
X=matrix(c(1,2,3,4), 2, 2, byrow=TRUE)
Output:
1
3
cor(x,y)
set.seed(1) to generate same set of random numbers
mean() to find mean
var() to find variance
sd() to find standard deviation = square root of variance
plot(x,y), plot(x,y,xlab=,ylab=,main=) to scatterplot of numbers x versus y
pdf() to create pdf
jpeg() to create a jpeg
dev.of() to indicate the plot end
seq(a,b), seq(a,b,length) to create a sequence
image() and contour() to produces a color-coded plot. To produce heatmap
persp() to produce 3-d plot
A=matrix(1:16,4,4)
A[2,3] 2nd row 3rd column
A[c(1,3),c(2,4)] 1st row and 3rd row, 2nd and 4th column
A[1:3,2:4] 1st,2nd,and 3rd row and 2nd,3rd,and 4th column
A[1:2,] first 2 rows and all columns
A[,1:2] all columns and first 2 rows
A[-c(1,3),] 2nd and 4th row and all columns
Dim(A) matrix dimension ( rows - 4 cols - 4)
Read.table() to load a data set
Read.csv()
Write.table() to export data
Auto=read.table(Auto.data)
fix(Auto) to render output similar to excel (data frame)
Auto=read.table(Auto.data,header=T,na.strings=?)
Fix(Auto)
Na.strings to point out missing string
Names() to check the variable names
Attach(Auto) to make the variables in this data frame available by name
small p-value indicates that there is an association between predictor and the
response.
We reject the null hypothesis, that is, we declare a relationship to exist between X
and Y if the p-value is small enough.
p-value cutoffs: for rejecting the null hypothesis are 5 or 1%
RSE: is an estimate of the standard deviation E
R^2 statistic provides an alternative measure of fit
R^2= 1- RSS/TSS
Carefully explain the diferences between the KNN classifier and KNN
regression methods.
KNN classifier and KNN regression methods are closely related in formula.
However, the final result of KNN classifier is the classification output for Y
(qualitative), where as the output for a KNN regression predicts the
quantitative value for f(X).
Clustering:
Finding subgroups or clusters in a data set.
K-means clustering:
Partitioning a data set into k distinct, non-overlapping clusters
1. Randomly assign a number, from 1 to k , to each of the observation
2. Iterate until the cluster assignment stop changing
a. For each of the K clusters, find the centroid
b. Assign each observation to the cluster whose centroid is closest