You are on page 1of 5

Data Analysis Assignment 2 Introduction During the last few years, there has been an enormous and exciting

development of Activity-Based Computing and Human Activity Recognition. This was allowed by the advent of miniaturized sensing technology that can be directly worn by individuals. These devices are used to measure variables that capture movement and allow researchers to predict the type of human activity undergoing. As recent research has shown, human activity can be predicted using a single tri-axial accelerometer [1]. In particular, mobile smartphones provided with tools as gyroscopes and accelerometers have been used in order to experimentally measure changes in movement parameters and correlate them to the activities. This might be used in order to predict human activity in an accurate form. The relevance of this kind of research relies in the possibility of developing smartphones that will anticipate required services by users. In this paper we provide a predictive model for human activity, using the Human Activity Recognition Using Smartphones Dataset [2]. These data was built by an experiment carried out with a group of 30 volunteers within 19-48 years old, during which each person performed six activities: a) walking, b) walking upstairs, c) walking downstairs, d) sitting, e) standing, and f) laying. We used random forest in order to detect relevant variables and then we built a predictive tree. We also used pruning to make a smaller model, easier to interpret. Methods First we renamed the variables in order to avoid name duplications. Then we partitioned the data in order to get a Train Set and a Test Set. Our Train Set included data collected for subjects 1, 3, 5, 6 and 7, and our Test Set included the data collected for subjects 27,28,29 and 30. We also used a Validation Set with subjects 8,9,10 and 11. We used a combination of predictive methods: random forest, predictive tree and pruning. The random forest was developed by Leo Breiman and Adele Cutler [3], and it is a very efficient algorithm that uses model aggregation ideas and ensemble methods for both classification and regression problems. As Genuer et al explain, the principle of random forests is to combine many binary decision trees built using several bootstrap samples coming from the learning sample L and choosing randomly at each node a subset of explanatory variables X [4]. Results As a first step to train the predictive model, we used a Random Forest within our Training Set, allowing all variables as predictors. The results were:
randomForest (formula = as.factor(activity) ~ ., data = train.set, Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 23 OOB estimate of error rate: 1.29% proximity = TRUE)

The accuracy of the random forest classification is high but it turns to a complicated model with 23 variables and also has the risk of over fitting. Therefore, we decided to use this tool only to detect relevant variables following the advice of the literature [4] and use them in a simpler Tree, with less variables and easier to interpret. The random forest gives a measure of importance for each variable in the prediction trees developed, which is called MeanDecreaseGini. We used this measure to select the most relevant variables: we established a criterion that included variables that had a value of importance of 11 % or more, this resulted in 23 variables to be included in our predictive model. With these selected variables we then performed a new tree. The results were:
Classification tree using Random Forest to Select relevant variables: tree(formula = as.factor(activity) ~ V42 + V57 + V560 + V41 + V51 + V54 + V53 + V559 + V50 + V58 + V10 + V382 + V505 + V394 + V4 + V390 + V348 + V228 + V232 + V70 + V97 + V97, data = train.set) Variables actually used in tree construction: [1] "V382" "V57" "V51" "V505" "V70" "V53" "V58" Number of terminal nodes: 9 Residual mean deviance: 0.3687 = 595.2 / 1614 Misclassification error rate: 0.06654 = 108 / 1623

This Tree showed a lower error rate (0.066). After this, we performed a Tree with only the variables that were actually used (seven variables), and implemented pruning (with best=6) to get the smaller model possible to fit our data. Our decision was to get an easier to use and interpret model. Our predictive model is as follows: E(HA) = V382 + V57 + V51 + V505 + V70
E(HA) is the Expected Human Activity V382 is the Body Acceleration Jerk band Energy () 1,8 V57 is the Gravity Acceleration Energy in X V51 is the Gravity Acceleration Maximum in Z V505 is Body Acceleration Magnitude Median Absolute Deviation V70 is Gravity Acceleration Autorregression Coefficient in Y

Figure 1. Predictive Tree for Human Activity Recognition Using Smartphone

As the figure of the Tree shows, the variable of Body Acceleration Jerk allows us to differentiate two clusters of activities: on one side we get standing, sitting and laying, and on the other side we get walk, walking up and walking down. Within the left cluster the variable Gravity Acceleration Energy (V57) allows us to separate laying from standing and sitting, and these later are separated through the variable of Gravity Acceleration Maximum in Z. Within the right cluster, the variable that measures the Median Absolute Deviation of Body Acceleration Magnitude differentiates walkup and walk from walk down. Then walk up is separated from walk by the variable Gravity Acceleration Autorregression Coefficient in Y. Once we had our predictive model established we performed a cross validation using our validation data set.
Classification tree: snip.tree (tree = ValidTree, nodes = 12L) Number of terminal nodes: 6 Residual mean deviance: 0.289 = 41.33 / 143 Misclassification error rate: 0.04027 = 6 / 149

Table 1. Confusion matrix using Validation Set


Laying 28 0 0 0 0 0 Sitting 0 23 0 0 0 0 Standing 0 0 26 0 0 0 Walk 0 0 0 23 1 0 Walkdown 0 0 0 1 19 0 Walkup 0 0 0 3 2 24

Laying Sitting Standing Walk Walkdown Walkup

The Miss classification error rate was very low (0.04) so we considered validated our model and proceeded to test it in the actual Test Set.
Classification tree: snip.tree(tree = TestTree, nodes = c(12L, 11L, 7L)) Variables actually used in tree construction: [1] "V382" "V57" "V505" "V70" Number of terminal nodes: 6 Residual mean deviance: 0.5751 = 850.5 / 1479 Misclassification error rate: 0.1051 = 156 / 1485

Table 2. Confusion matrix for Test Set using Predictive Model Laying 293 0 0 0 0 0 Sitting 0 204 0 0 0 0 Standing 0 60 283 0 0 0 Walk 0 0 0 209 3 3 Walkdown 0 0 0 0 189 62 Walkup 0 0 0 20 8 151

Laying Sitting Standing Walk Walkdown Walkup

The error rate we got in the Test Set was higher (0.105) than the error rate showed in our validation data. Nevertheless, the error rate is still low and the model is quite simple, easy to interpret and performs very well in computational times.

Conclusions We were able to construct a predictive model for human activity recognition using only six variables measuring movement from miniaturized sensing technology located in a Smartphone. The methods used involved random forest for variable selection, trees for prediction and pruning to lower the number of variables. There are limitations to our model because we did not explore the problem of high correlation between the predictors used.

References 1. Khan, Adil Mehmood, Human Activity Recognition Using A Single Tri-axial Accelerometer, PhD. Thesis, South Korea, Kyung Hee University, 2011. 2. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec. 2012. 3. Breiman, Leo, Random Forests in Machine Learning 45 (1): pp. 532, 2001. 4. Genuer, Robin; Poggi, Jean-Michel and Tuleau-Malot, Christine, Variable Selection using Random Forests in Pattern Recognition Letters 31, 14, pp. 2225-2236, 2010.

You might also like