Professional Documents
Culture Documents
Instructor: E-mail: Room: Class hours: Monday Thursday Friday 10:0011:00 12:0013:00 13:0014:00 LB01 ICT Lab 2 LB107 Brett Houlding brett.houlding@tcd.ie Lloyds Room 129
Homework
There will be one course assignment. This will be compulsory and will count 15% towards your total mark for ST3007. This will require knowledge of the statistics software that will be taught in the laboratory sessions.
Books
There is no compulsory textbook for this course, but the following cover some (or all) of the material: Krzanowski, W. Principles of Multivariate Analysis; Chateld, C. and Collins, A. Introduction to Multivariate Analysis; Lattin, J., Carroll, J. D. and Green, P. E. Analyzing Multivariate Data; Hosmer, D. and Lemeshow, S. Applied Logistic Regression; Everitt, B. Cluster Analysis; Gordon, A. Classication; McLachlan, G. Discriminant Analysis and Statistical Pattern Recognition; Cox, T. and Cox, M. Multidimensional Scaling. Any additional suggestions are welcome.
Multivariate Data
Multivariate data arises when two or more attributes are recorded for each of a set of objects. Example: The heights and the weights of a set of students could be recorded. Example: A bank may record the current account balance, savings account balance and the credit card balance for its customers. In some cases we have only recorded numerical attributes (as above), but in other cases some of the attributes are numerical and some are categorical. Example: In the above banking example, the gender and geographical location of the customers may also be recorded.
Italian Wine
Forina et al. (1986) collected data recording the chemical and physical properties of Barbera, Barola and Grignolino wines. Chemical Properties Alcohol Alkalinity of Ash Flavonoids Color Intensity Proline The aim of the study was to develop a method of classifying wines into type from their chemical properties. Malic Acid Magnesium Nonavonoid Phenols Hue Ash Total Phenols Proanthocyanins OD280/OD315 Of Diluted Wines
10
US Arrests
The number of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states were recorded.
Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Connecticut 3.3 110 77 11.1 .......................................... Wisconsin 2.6 53 66 10.8 Wyoming 6.8 161 60 15.6
11
12
13
Resting Pulse
A researcher is interested in understanding the eect of smoking and weight upon resting pulse rate (Low/High).
PULSE SMOKES WEIGHT 1 Low No 140 2 Low No 145 3 Low Yes 160 4 Low Yes 190 5 Low No 155 6 Low No 165 7 High No 150 8 Low No 190 .................... 9 High No 150 10 Low No 108
14
G-Force Blackout
Military pilots sometimes black out when their brains are deprived of oxygen due to G-forces during violent maneuvers. Glaister and Miller (1990) produced similar symptoms by exposing volunteers lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data lists the subjects ages and whether they showed syncopal blackout related signs during an 18 minute period. Subject JW JM DT LK JK MK FP DG Age 39 42 20 37 20 21 41 52 BlackOut 0 1 0 1 1 0 1 1
15
Bank Loans
A bank in South Carolina looked at a data set consisting of 750 applications for loans. For each application, the following variables were recorded.
Variable SUCCESS AI XMD DF DR DS DA NNWP NMFI NA Description Was the loan given or not? Applicants (and co-applicants) income Applicants debt minus mortgage payment 1=Female, 0=Male 1=Non white, 0=White 1=Single, 0=Not Single Age of house Percentage non-white in neighbourhood Mean family income in neighbourhood Mean age of houses in neighbourhood
16
17
Graphical Summaries
It is worth plotting your data to look for interesting structure. For two continuous variables, a scatter plot is a good choice.
90 waiting 50 1.5 60 70 80
2.0
2.5
3.0
3.5 eruptions
4.0
4.5
5.0
18
HighJump
24.5
X200m
23.0
LongJump
44
50
Javelin
38
X800m
13.0
13.6
14.2
12
14
16
5.6
6.0
6.4
130
140
150
19
5.6
6.2
12
14
ShotPutt
16
13.0
13.8
Multivariate analysis
Our objective is to: Analyze the internal structure of the data in order to understand and/or explain the data in a reduced number of dimensions, e.g., Principal Components, or Factor Analysis. Assign group membership: 1. supervised methods assume a given structure within the data, e.g., classication, or discriminant analysis. 2. unsupervised methods discover structure from the data alone, e.g., cluster analysis.
20
Data Reduction
Multivariate data can have many variables recorded. Sometimes we can nd a way of producing a reduced number of variables that contain most of the information of the original data. This is the aim of data reduction. Example: In the decathlon (heptathlon) data, the athletes are awarded points. The points are supposed to measure the overall ability of the athlete. Good athletes have high points and poor athletes have low points. The points score is a single number that is used to describe the athletes performance in ten (seven) events. We will look at data-driven methods of reducing the dimensionality of the data that retains much of the information contained in the original data. The method of data reduction used depends on what information we wish to retain.
21
6600
4 Points 6400 17
6200
8 9 5 11 10 73 14 12 18 13 16 15
5800
6000
1 4 3 2 1 PC1 0 1 2
24
Discriminant Analysis/Classication
The olive oil data is for oils collected from three regions (Southern Italy, Sicily, Northern Italy). Can we use the data to construct a rule that would help us to determine where new olive oil samples come from? This is the aim of discriminant analysis (or classication). In fact, we know the origin of the oils to an even greater accuracy (North-Apulia, Calabria, South-Apulia, Sicily, Inland-Sardinia, Coast-Sardinia, East-Liguria, West-Liguria, Umbria). Can we classify the oils to this ne an accuracy?
25
Dimension 2
300
100
100
500
1000
Dimension 1
Dimension 1
We can classify to the three regions with about 100% accuracy and to the nine areas with about 97% accuracy.
26
Cluster Analysis
When doing discriminant analysis, we know that there are groups in the data. We want to be able to classify observations to the known groups. Cluster analysis is used to establish if groups exist in the data. Once the existence of groups is established, we then want to classify observations to groups.
27
2.0
2.5
3.0
3.5 eruptions
4.0
4.5
5.0
28
Hierarchical Clustering
Hierarchical clustering builds a tree, where similar objects are joined low down and less similar objects are joined higher up.
Cluster Dendrogram
50 Height 0 10 30
d
29
k -Means Clustering
k -Means Clustering assigns objects into k groups where the observations within a group are very similar and the groups are very dierent from each other. A two group clustering of the data follows...
90 waiting 50 1.5 60 70 80
2.0
2.5
3.0
3.5 eruptions
4.0
4.5
5.0
30
Logistic Regression
Logistic regression is a special type of regression method where the response variable is a binary variable. Recall that in ordinary regression the response variable is continuous. Example: Recall the bank loan example. We could have a binary variable recording whether the loan was granted or not. Logistic regression could tell us which factors inuence the decision to award a loan or not. For this data, the most important factors were XMD (Applicants debt minus mortgage payment), AI (Applicants income) and DA (Age of house). Logistic regression could be used on the G-Force or Resting Pulse data too.
31