Advanced Statistical Techniques 1 (AST-1) Shankar Venkatagiri, QMIS The QM core, and a course on linear algebra are adequate

prerequisites to this course. Here is an atempt to recap the basics. As ou kno!, statistics helps us e"amine phenomena !hose outcomes are uncertain. This uncertaint comes in #arious shapes. $hile there ma be se#eral thousand emplo ees !orking at a %rm, their basic salaries tend to occup a %"ed number o& slots. This scenario is modelled b a random #ariable ' let(s call it S ' that )maps* each emplo ee to a real number re+ecting a salar . The histogram o& the #ariable depicts the underl ing distribution. The #alues that S assumes occup the ",a"is, and their corresponding &requencies are dra!n as bars on the ,a"is. This distribution is used to calculate parameters such as the a#erage salar , the de#iation in pa , the middle -./, the percentile, and so on. Students begin their statistical 1ourne b stud ing )canonical* random #ariables, such as the 2inomial, 3oisson, 4ormal, and 5"ponential. Their distributions can be harnessed to model real !orld phenomena. 6or e"ample, the number o& customer arri#als at a store, recorded o#er %"ed time inter#als, can be %ted to a 3oisson distribution, !hose parameter 7 8 is deri#ed &rom the arri#als data,set. A chi,square distribution is then used to assess the goodness o& this %t. The building block &or all statistics is the random sample, !hich is a representati#e set o& obser#ations o& a random #ariable X. Suppose a 1ournalist !ants to in#estigate the debts o& &resh M2A graduates, and ascertain their a#erage debt le#el. 2ecause it is impossible to poll e#er &resh M2A, her conclusion ma not be precise. She could ha9ard a guess b sur#e ing a 7limited8 sample o& &resh M2As, recording their outstanding loan amount plus credit card balances, and a#eraging the obser#ations. Suppose our 1ournalist samples :. graduates. Her one set o& obser#ations 7";, "<, = , ":.8 ma pro#ide a decent estimate &or the actual a#erage debt le#el in &resh M2As. It(s 1ust an estimate, not the actual #alue>

An real,#alued &unction o#er a sample is termed a statistic. T picall , a statistic is used to estimate a population(s parameter. Take the sample mean, X ? 7X; @ X< @ = @ Xn8An, !hich can be used to estimate the population mean o& the random #ariable X. 6or this choice o& statistic, !e ha#e a po!er&ul result in the Bentral Cimit Theorem 7BCT8.

The BCT tells us that &or suDcientl large sample si9es, the distribution o& sample means &or a random #ariable X ' computed across multiple samples, o& course ' is appro"imatel 4ormal, regardless o& ho! X is distributed. 6urthermore, the mean parameter &or X can be appro"imatel determined b a#eraging a 7smaller8 sample. A con%dence inter#al ma be built around this point estimateE this inter#al speci%es a range o& #alues that the mean can assume. This is because diFerent a#erage #alues shall be produced !ith diFerent samples. 4ormalit helps us contain the bounds o& this inter#al. Gn the +ip side, the BCT helps us to #alidate statistical h potheses !e make about the population, such as H0H ? <I,:.. #ersus HaH J <I,:... The null h pothesis is re1ected i& sample e#idence s!ings )signi%cantl * in &a#our o& Ha, i.e. the sample mean is considerabl smaller than <I,:... A h pothesis test checks i& the sample e#idence produces a statistic that lies be ond critical #alues, !hich are speci%ed based on the desired con%dence le#el. Gther parameters like the population #ariance 2 are estimated !ith a suitable sample statistic 7 s28, !hose sampling distribution 7Bhi,square8 is kno!n a priori. 4e"t, the student goes on to compare and contrast t!o or more populations !ith techniques such as the t,test and A4GVA. Bhi square tests are emplo ed to detect dependencies bet!een t!o or more #ariables. The QM core concludes !ith a discourse on simple linear regression, !hich helps characteri9e hidden relationships bet!een t!o random #ariables. AST-1 Outline AST,; builds upon the conceptual &oundations laid b QM, and !ill help the student anal se much more comple" and multi#ariate datasets. A researcher ma be interested in the in+uence o& multiple 7pre,determined8 &actors on the price o& a house. 5ach &actor is considered a random #ariable, and is termed an e"planator #ariable. These &actors can be o& an scaleH nominal, ordinal, inter#al or ratio. The house price is termed the dependent or response #ariable. Ki#en supporting conditions, !e can base our price &orecast on data &rom a single 7random8 sample o& houses that ha#e been sold. Gne method !e use can be multiple regression, !hich is an e"tension o& the simple regression model to man #ariables.

Price "u#er 1 "u#er 2 ... "u#er m

Income . . .

arital status


!actor n

Lnderstandabl , a prudent approach to handle such multi#ariate datasets is !ith the matri" notation. $e !ill begin the course !ith a discussion on matri" algebra, as it is applied to statistics. Keometrical trans&ormations such as scaling, pro1ection and rotation can be represented as matrices. These matrices applied to multi#ariate data,sets pro#ide ne! insights on the structure o& the data. Ha#ing assimilated the &rame!ork o& QM, an student &aced !ith the prospect o& increased comple"it in a multi#ariate dataset could pose a &e! )natural* questionsH Ban !e split up the &actor #ariables as dependent and independent #ariablesM Are these &actors correlated ' perhaps #ia linear combinations o& a subset o& themM

Ban the &actors be clustered into diFerent campsM Are there redundancies !ithin the &actorsM

These questions and man more !ill dri#e the discussions in the AST course. Te$t%oo&s The primar te"tbook &or the course is Analyzing Multivariate Data, b Catin, Barroll and Kreen, !hich is a#ailable in the Indian 5dition. A secondar re&erence could be Multivariate Analysis, Nth 5dition, b Hair, Anderson, Tatham and 2lack. There is a substantial bookshel& o& re&erences on this sub1ect at the librar . ethodolo'# $e shall meet three hours a !eek to discuss the topics. 6rom time to time, the instructor !ill tr to sol#e problems in a tutorial mode. Assignments ma be sol#ed using a package such as (, !hich is a#ailable &rom htpHAA!!!.r,pro1ect.orgA. The session slides and related links shall be uploaded on Moodle. Schedule It is ad#isable &or students to come prepared !ith the assigned readings, !hich are chapters &rom the te"tbook 7Catin et al8. The %rst chapter pro#ides an e"cellent o#er#ie! o& #arious MOA methods. Students are ad#ised to read the chapter at their leisure.

)ours , , , , , , , ,

To*ic Pandom #ectors, #ariance,co#ariance matri", geometr o& the SVO Multi#ariate 4ormal Oistribution Pegression Anal sis R Oiagnostics 3rincipal Bomponents Anal sis 5"plorator 6actor Anal sis Multidimensional Scaling Bluster Anal sis Banonical Borrelation Cogit Bhoice Models

(e+erences Qohnson R $ichernH Bhapter : Catin et alH Bhapter < Qohnson R $ichernH Bhapter S Catin et alH Bhapter : Catin et alH Bhapter S Catin et alH Bhapters Catin et alH Bhapter I Catin et alH Bhapter T Catin et alH Bhapter 0 Catin et alH Bhapter ;:

.radin' Polic# There shall be in,class qui99es, an in,class %nal e"am, and a take,home assignment. The grading polic is sho!n belo!. Mid,term Assignment 3aper presentation 6inal 5"am <./ <./ <./ S./

/ontact Shankar Venkatagiri Assistant 3ro&essor, QMIS Indian Institute o& Management 2angalore 3honeH .0IS<<,<;:SI

