The document describes building a machine learning model to recognize weight lifting exercises from sensor data. Various classification models are trained on a dataset and evaluated using 10-fold cross-validation. The random forest model performs best with an average accuracy of 58.8% and out-of-sample error of 0.41, so it is used to make predictions on a test dataset.
The document describes building a machine learning model to recognize weight lifting exercises from sensor data. Various classification models are trained on a dataset and evaluated using 10-fold cross-validation. The random forest model performs best with an average accuracy of 58.8% and out-of-sample error of 0.41, so it is used to make predictions on a test dataset.
The document describes building a machine learning model to recognize weight lifting exercises from sensor data. Various classification models are trained on a dataset and evaluated using 10-fold cross-validation. The random forest model performs best with an average accuracy of 58.8% and out-of-sample error of 0.41, so it is used to make predictions on a test dataset.
Data Summary This analysis aims to build a machine learning model to automatically recognize the activity type, given the data from various wearable sensors (such as accelerometers). The data used for this analysis comes from a publication that performs a similar analysis [1]. Several type of models classification models are built for the training dataset and cross-validation is used to pick the best performing model. The random forest model appears to perform best for the given data. Data Processing The training and testing datasets provided for this analysis come from the data set available here [2]. Exploratory Data Analysis trainDS <- read.csv("pml-training.csv", stringsAsFactors = FALSE) testDS <- read.csv("pml-testing.csv", stringsAsFactors = FALSE) The training and the testing datasets have 160 variables. The training dataset has 19622 observations. The testing dataset has 20 observations. In the training dataset, some columns indicate raw measurements such as acceleration, pitch, roll, and yaw from various sensor units like belt, forearm, dumbell etc; other variables indicate aggregates and descriptives of the aforementioned raw measurements such as min, max, avg, stddev, var, amplitude, skew, kurtosis etc. These columns take on non-NA values only for a very small fraction of the total number of observations (2.07%); the non-NA values occur only when the new_window takes on a value of yes. It appears these values are computed for a set of measurements taken during a time window and are missing for all other observations. These variables are excluded from training feature-set since a very large number of observations have these values missing. The first 7 variables record identification values such as timestamps, usernames, and other flag values. These variables are also excluded from the training set. Subsets of the training and testing datasets are created using the 59 raw measurements and: the target variable classe for training dataset and the variable problem_id for test dataset. These datasets are stored to trainDS2 and testDS2. Analysis In order to evaluate the fitness of models, a cross-validation strategy is used. A 10-fold cross-validation scheme is used. The following learning strategies are used: Random Forest, Support Vector Machines, and Gradient Boosted Regression. kFolds <- 10 rfAcc <- rep(NA, kFolds) svmAcc <- rep(NA, kFolds) gbmAcc <- rep(NA, kFolds) lrAcc <- rep(NA, kFolds) nC <- floor(nrow(trainDS2)/kFolds) for (j in 1:kFolds) { minI <- (j - 1) * nC + 1 maxI <- j * nC cvChunkIndex <- minI:maxI cvChunk <- trainDS2[cvChunkIndex, ] trainChunk <- trainDS2[-cvChunkIndex, ] rfM <- randomForest(classe ~ ., data = trainChunk) svmM <- svm(classe ~ ., data = trainChunk) gbmM <- gbm(classe ~ ., data = trainChunk, cv.folds = 3, distribution = "multinomial") best.iter <- gbm.perf(gbmM, method = "cv", plot.it = FALSE) gbmP <- predict(gbmM, newdata = cvChunk, n.trees = best.iter) gbmP <- levels(cvChunk$classe)[sapply(1:nrow(cvChunk), function(i) { which.max(gbmP[i, , 1]) })] rfAcc[j] <- sum(predict(rfM, newdata = cvChunk) == cvChunk$classe)/nrow(cvChunk) svmAcc[j] <- sum(predict(svmM, newdata = cvChunk) == cvChunk$classe)/nrow(cvChunk) gbmAcc[j] <- sum(gbmP == cvChunk$classe)/nrow(cvChunk) } The 10-fold cross-validated accuracies are plotted in the following figure. The average cross-validated accuracies and average out-of-sample error estimates can be see in the following table. Method Accuracy OOSError RF 58.80 0.41 SVM 50.91 0.49 GBM 21.60 0.78 Based on these results, it appears that the random forest method is best suited among the attempted methods for the given data. So a random forest model built using the entire training set and is used for predicting the activity type for the test set observations. tControl <- trainControl(method = "cv", number = 3) rfM1 <- train(classe ~ ., data = trainDS2, trControl = tControl, method = "rf") rfP1 <- predict(rfM1, newdata = testDS2) Results A 10-fold cross-validation approach is used to pick a model for training and testing the data. The random forest model is selected based on its higher average accuracy and lower out-of- sample error. The predictions made for the test data set using the model built using the entire training set are follows. ProblemId Prediction 1 B 2 A 3 B 4 A 5 A 6 E 7 D 8 B 9 A 10 A 11 B 12 C 13 B 14 A 15 E 16 E 17 A 18 B 19 B 20 B References [1]. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 201 [2]. Groupware.les.inf.puc-rio.br, (2014). [online] Available at: http://groupware.les.inf.puc- rio.br/static/WLE/WearableComputing_weight_lifting_exercises_biceps_curl_variations.csv [Accessed 22 Jun. 2014].
ChatGPT Side Hustles 2024 - Unlock the Digital Goldmine and Get AI Working for You Fast with More Than 85 Side Hustle Ideas to Boost Passive Income, Create New Cash Flow, and Get Ahead of the Curve
Learn Python Programming for Beginners: Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind