8 views

Uploaded by kookmasteraj

Data mining

- Hasil Olah
- Data Regresi Suhu Dingin
- Data Output
- Excel Stats Nicar2013
- Knowledge management and its importance in improving the performance of small and medium-size enterprises
- LINEST Function
- Young Children Job Satisfaction
- teh tia.doc
- ACTWEE
- i 016176469
- regression
- Edwards2 Chap One
- Industrial Engineering (Simple Linear)
- Tugas MRA
- SAS Persaingan
- regression of thermodynamic data
- 478RMStrengthPrediction.pdf
- Output
- Soal Statisik 3 Word
- Bs Project Grp 8

You are on page 1of 7

data if it were a signal.

Evans Rule (conservative): n/p > 10 (at least 10 observations per predictor)

Doanes Rule (relaxed): n/p > 5 (at least 5 observations per predictor)

Standardize the data

Training Partition (typically the largest partition) contains the data used to build the various models we are examining. The

same training partition is generally used to develop multiple models.

Validation Partition (sometimes called the test partition) is used to assess the performance of each model so that you can

compare models and pick the best one.

Test Partition (sometimes called the holdout or evaluation partition) is used if we need to assess the performance of the chosen

model with new data.

Simple linear regression: Regression analysis involving one independent variable (X) & one dependent variable (Y) in which

the relationship between the variables is approximated by a straight line.

Mean square residual (MSE) = Residual Standard Error (#8 on table on next page) MSR= SSR/K (K = # of predictors)

R2

SSR

SST

the response variable that is explained by the estimated

regression equation.

To test for a significant regression relationship conduct a

hypothesis test to determine whether the value of 1 is 0.

H0: 1 = 0 H1: 1 0

Reject H0 if p-value <

Test statistics for hypothesis tests about slope. Note: Degree

of freedom is n-2 t =b1/se(b1) se=standard error on

The coded variables are called dummy variables. If a categorical variable has m levels then we have to introduce m-1 dummy

variables in model.

Interpretation of s: 0 = B (Mean of base level) - 1 = A - B (Mean of base level) - 2 = B C

1. Backward elimination (Backward stepwise)

Type I (include some unimportant independent variables in the model) or Type II errors (eliminate some important independent

variables). RA2 - The adjusted coefficient of determination is done to penalize the inclusion of useless predictors.

Regression Selection R2 - RA2 - CP Want smallest CP error

R2 = SSR/SST = 1 SSE/SST = 1 [MSE/(n-k-1)] - RA2 = 1 (n-1)(MSE/SST)=1-[(n-1)/(n-k-1)]*(1-R2) - CP =SSEk/MSEL +2(k-1)-5

R2 > RA2 and for poor-fitting models R A2 may be negative. Choose highest adjusted R squared

If we reject the null hypothesis, there is enough evidence to support that at least one of the coefficients is zero. The overall

model appears to be statistically useful for predicting y

If we cannot reject the null hypothesis, there is not enough evidence to support that at least one of the coefficients is nonzero.

The overall model does not appear to be statistically useful for predicting y

in the dataset are eventually used for both training and testing

and each observation is used for validation exactly once

The idea in k-nearest neighbor methods is to identify k records in

the training dataset that are similar to the new record that we

wish to classify. We then use there similar (neighboring ) records

to classify the new record into a class, assigning the new record to the predominant class among these neighbors.

If we choose k too low, we may be fitting to the noise of the data.

If we choose k too high, we will armiss out on the methods ability to

capture the local structure in the data.

If k = n, we simply assign all records to the majority class in the training

data.

Typically, values of k falls in the range of 1-20.

The odd number is normally chosen to avoid ties.

We partition the data into training data and validation data.

For example, 18 data points for the training data and 6 data

points for the validation data.

Use the training data to classify the records in the validation data, then compute the error rates for various choices of k.

Perform k-foldcross validation and record the error rate for various choices of k.

Notes - The default of the cutoff value is 0.5 but it can be set differently. It can be used to classify the response with more than

two classes. It can be applied with a numerical response.

Steps

The first step of determining neighbors by computing distances remains unchanged.

The second step is modified such that we take the average response value of the k-nearest neighbors to determine the

prediction.

The best k can be determined by the other measure besides the misclassification rate.

Advantages - It is simple. It does not require parametric assumptions. It performs well with a large enough training set.

Shortcomings - The time to find the nearest neighbors in a large training set can be prohibitive. The number of records

required in the training set to qualify as large increases exponentially with the number of predictors p.

Kn only works with quantitative variables.

Misclassification rate wrong classified divided by all numbers summed up (b+c)/(a+b+c+d) = error

Accuracy = 1 error = (a+d)/(a+b+c+d)

If dataset is too small for partition it may yield unstable results

()In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to

vary.

()If p value is greater than alpha(level of significance) we accept, if less then then we reject

1-Residuals = he residuals are the difference between the

actual values of the variable you're predicting and predicted

values

2-Significance Stars= Shorthand for significance figures

3-Estimated Coeffecient= The estimated coeffecient is the

value of slope calculated by the regression.

4-Standard Error of the Coeffecient Estimate - Measure of

the variability in the estimate for the coeffecient. Lower

means better but this number is relative to the value fo the

coeffecient.

5-T value of coefficient estimate = Score that measures

whether or not the coeffecient for this variable is meaningful

for the model. You probably won't use this value itself, but

know that it is used to calculate the p-value and the

significance levels.

6-Variable P Value = Probability the variable is NOT relevant.

You want this number to be as small as possible.

8-Residual STD error/Degrees of freedom = The Residusal Std

Error is just the standard deviation of your residuals. The Degrees of

Freedom is the difference between the number of observations

included in your training sample adn the number of variables used in your model (intercept counts as a variable).

9-R2= Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in

what you're predicting that is explained by the model.

10- F-Statistic & resulting p-value= Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and

compares it to a model that has fewer parmeters. In theory the model with more parameters should fit better. If the model with more parameters

(your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost).

If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

N=number of records

CP is the complexity parameter. Any split that does not decrease the overall lack of fit by a factor of CP is not attempt (Default is 0.01)

Relative Error is found from the whole dataset and it is always the same for different runs.

Xerror can change for different runs

Error Rate(whole dataset) = the root node error times the relative error =

Total Records

(smaller of

number is misclass)

10-fold CV error rate = root node error times the xerror

Prune Tree

Step 1: Find the best subtree of each size (1,2,3, ).

Step 2: Pick the tree in the sequence that gives the smallest misclassification error in the validation set.

The idea behind pruning is to recognize that a very large tree is likely to e overfitting the training data and that the weakest branches,

which hardly reduce the error rate, should be removed.

The tree method can also be used for numerical response variables regression tree.

- Both the principle and the procedure are the same.

- There are three details that are different from the

classification tree.

(i) Prediction

(ii) Impurity measures

(iii) Evaluating performance

Regression Tree

Both the principle and the procedure are the same. There are three details that are different from the classification tree.

(i) Prediction

(ii) Impurity measures

(iii) Evaluating performance

Logistic Regression

Logistic regression model explains a relationship between a binary response and predictors

using a logit link function.

Y is used to represent the binary response.

P(Y = 1) or p is the probability of belonging to class 1

p

odds

p

0 1 x1 ... k xk

p

0 1 x1 ... k xk p e

log it ( p ) ln

1 p

1 e 0 1x1 ... k xk Odds e 0 1x1 ... k xk

1 odds

Odds

If xj increases 1 unit, then odds changes by (e j -1)(100)% (holding all other predictors constant.)

Odds ratio

e 0 CD (1)... k xk

e CD

e 0 CD ( 0 )... k xk

Association Rules

Support

no. transactions that include both condition and result item sets

s

the total number of records

Confidencec no. transactions that include both condition and result item sets

no. transactions with condition item sets

P (result | condition)

P (condition )

confidence P(condition and result )

Lift ratio

P(result )

P(condition ) P (result )

Lift Ratio

1 p

A lift ration greater than 1.0 suggests that there is some usefulness to the rule - the level of

association between the condition and result item sets is higher than would be expected if they

were independent.The larger the lift ratio, the greater the strength of the association.

The support indicates its impact in terms of overall size. If only a small number of transactions are

affected, the rule may be of a little use (unless the consequent is very valuable and/or the rule is

very efficient in finding it).

Cluster Analysis

dij is a distance metric or dissimilarity

The following properties are required.

measure, between records i and j.

Nonnegative

dij 0.

(xi1, xi2, , xip) is the vector of p

Self-Proximity

dii = 0.

measurements for record i.

Symmetry

dij = dji.

(xj1, xj2, , xjp) is the vector of p

Triangle Inequality

dij dik + dkj.

measurements for record j.

2

.

std deviation

Interpretation

We explore the characteristics of each cluster

by

a. Obtaining summary statistics from each

cluster on each measurement that was

used in the cluster analysis

b. Examining the clusters for the presence

of some common feature (variable) that

was not used in the cluster analysis

c. Cluster labeling: based on the

interpretation, trying to assign a name

or label to each other.

AIC =L + 2k L is usually given so not 2L?

L = Residual Deviance Log likelihood k=# of parameters

Lower AIC and BIC better

AIC indicates how good the estimates that maximize the chance of obtaining the data.

AIC Gives penalty to a higher # of predictors

-Holding the other variables constant, the (response variable) is more/less likely to be in class 1 if

the variable is 1

Root Node X Error = Cross validation error rate

function of the combined set^then lowest of the new numbers after function

Splitting values, order lowest to highest, then half

of the pairs going down in a row?

(Different Note)

Rel. Error = relative error or misclassification for

tree at that stage to convert to absolute error

multiply by root node error.

Yes, take left number add all..no takes right

- Hasil OlahUploaded byIrwan Cungkring
- Data Regresi Suhu DinginUploaded byAndre Amin Hidayat
- Data OutputUploaded byHenggah Prayogi
- Excel Stats Nicar2013Uploaded byyaktamer
- Knowledge management and its importance in improving the performance of small and medium-size enterprisesUploaded byThe Ijbmt
- LINEST FunctionUploaded byMomon Dompu
- Young Children Job SatisfactionUploaded byLee Hou Yew
- teh tia.docUploaded bySiti Nur Hasanah
- ACTWEEUploaded bySolarPanel
- i 016176469Uploaded bySaghirAurakxai
- regressionUploaded byPhilip Chethalan
- Edwards2 Chap OneUploaded byMomentum Press
- Industrial Engineering (Simple Linear)Uploaded byHencystefa 'Irawan
- Tugas MRAUploaded bybrigitiwi
- SAS PersainganUploaded byYan Zenatra D'ocolbiover
- regression of thermodynamic dataUploaded byamo
- 478RMStrengthPrediction.pdfUploaded byredcoat
- OutputUploaded byAbudzar Ghifari
- Soal Statisik 3 WordUploaded byoktaviani
- Bs Project Grp 8Uploaded byRia George Kallumkal
- ANTHROPOLOGY - Stature Estimation from the Skeleton.pdfUploaded byZerina Kulović
- Study Guide for ECO 3411Uploaded byasd1084
- Multiple Linear RegressionUploaded byjazzlovey
- Group 01 AoD_Assignment_Gr. 1Uploaded bySahil Jain
- CookUploaded bywindhyfrida
- Do Theme Parks Deserve Their SuccessUploaded byleungtszsuen
- EDG 1503 CorrelationUploaded bySafrena DifErra
- 03 Regresi Data PanelUploaded byMeidio Talo Prista
- Ch24 AnswersUploaded byamisha2562585
- 24_3Uploaded byAditya Achmad Narendra Whindracaya

- HW 9 UpdateUploaded bykookmasteraj
- Intramural Forfeit Fee ContractUploaded bykookmasteraj
- Finance AccountingUploaded bykookmasteraj
- Determinants Fall 2014Uploaded bykookmasteraj
- Leary_2Uploaded bykookmasteraj
- Amy Hawkins Associate DegreeUploaded bykookmasteraj
- Costco Run 1Uploaded bykookmasteraj
- Old SyllabusUploaded bykookmasteraj

- Hand PanningUploaded byChing Mordeno
- Asso CET Brochure 2014Uploaded byaglasem
- Synovial Fluid and Fecal Analysis .Group 5Uploaded byJohn Alfrey Dondiego Pueblo
- Tally.erp 9 Proposal for Business OrganizationUploaded byHasanusjaman Rubel
- 608-1794-1-SM (2)Uploaded byJoshu McCau
- Data Communication EquipmentUploaded byCarlo Guzon
- DEPARTMENT Academic Calendar 2016 17 Version1Uploaded bySiva Kumar
- Development of Curved SurfacesUploaded byswabright
- D4417.pdfUploaded bynaim
- RootCauseAnalysis[1]Uploaded bySamehibrahem
- MASSTR~11Uploaded byMayank Rangwani
- High Yield Internal MedicineUploaded byHenakhan0
- Part Two: Speaking TreeUploaded byAngella3
- A-Guidebook-to-Writing-Better-Code.pdfUploaded byradiumtau
- Localization of Pheromonal Sexual Dimorphism in DrosophilaUploaded byapi-3742014
- Pirelli 2017.pdfUploaded byEdwin Ttito
- Scientific Research PaperUploaded bylilcashy
- N10 Teaching-plan-Ayuste - Copy.docxUploaded byRikkard Ambrose
- Graphical User Interfaces in HaskellUploaded byAsghar Farhadi
- 02 Investigating Communication. Chapter 2.PDFUploaded bycommunismei
- Defense Mechanism of GingivaUploaded byPiyusha Sharma
- Basavanna-ChinmaySurpur (1)Uploaded byMahesh
- Participatory Rural Appraisal as a Method for Community Engagement toward Sustainability in Indonesian Mining SectorUploaded bysyahrir
- Personalities TestUploaded byBlanca Ruiz de Somocurcio
- Min Young HeartUploaded byThe Calgary Sun
- Employment Application and Background CheckUploaded byTatyana Rivera
- FDS6982SUploaded byAlejandro Delgado
- Fundamentals ReviewUploaded bygcodougan
- Brun - four Color theoremUploaded byWilliam Leighton Dawson
- Why Christianity is FalseUploaded byIWantToBelieve8728