Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: (see the section on the Weight Lifting Exercise Dataset).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases
Using a random forest classifier with a k-fold cross validation of 7, the optimal model has an accuracy of 0.993 and an OOB rate of 0.66%. The variable importance plot shows that the roll_belt variable was most important in predicting the classe
variable.
Applying our model on the test set, we attain a similar accuracy of 0.993. Applying the model on the 20 test case in our validation set, we achieve 100% accuracy in predicting the right classe
variable.
pacman::p_load(data.table, caret, parallel, doParallel, purrr, visdat, dplyr, printr, kableExtra, corrplot, e1071, randomForest)
# training
url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training <- fread(url_train,
na.strings = c("#DIV/0", "", "NA"),
stringsAsFactors = TRUE)
# testing data
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- fread(url_test,
na.strings = c("#DIV/0", "", "NA"),
stringsAsFactors = TRUE)
# glimpse(training)
Looking at our data (output is too large), we see there’s a total of 160 variables that we have to build our model. Most of these variables are not useful for building our prediction model, especially the first 7 columns, which are just rowno., usernames, timestamps, etc.
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
rbind(training = dim(training),
testing = dim(testing)) %>%
kbl() %>%
kable_classic(full_width = F, html_font = "Cambria")
training | 19622 | 153 |
testing | 20 | 153 |
Next to reduce the unnecessary variables, we set a threshold for the amount of NAs a variable has in our data. I’m going to set the threshold as 70% and use the discard function from the purrr package to discard the variables. (Another way to do this is by using nearZeroVar
function which finds variables with near zero variability or with PCA)
# function to remove columns with NAs
na_remove_col <- function(data, threshold) {
data %>%
discard(~ sum(is.na(.x)) / length(.x) * 100 > threshold)
}
clean_train <- na_remove_col(training, 70)
clean_test <- na_remove_col(testing, 70)
rbind(training = dim(clean_train),
testing = dim(clean_test)) %>%
kbl() %>%
kable_classic(full_width = F, html_font = "Cambria")
training | 19622 | 53 |
testing | 20 | 53 |
Now we see that exactly 100 variables were removed after the threshold.
The data we have is a training data set and a validation data set. The standard procedure is to partition our training set into train and test set, and then apply our final model to our validation set. The function createDataPartition
will be used to split our data.
set.seed(2021) # for reproducability
inTrain <- createDataPartition(clean_train$classe, p=0.7, list=FALSE)
train <- clean_train[inTrain, ]
test <- clean_train[-inTrain, ]
Now we have our training and test data which is 70% and 30% of our initial training data respectively.
By doing a bit of EDA on our training data, we can observe whether there are variables which are highly correlated using the corrplot
library.
corr_data <- select_if(train, is.numeric)
corrplot(
cor(corr_data),
method = "color",
tl.pos = "n",
insig = "blank"
)
Our correlation plot shows that the most of our variables are not very correlated. With the exception of the first few columns at the upper left, and columns at the middle. Correlated variables can bring about issues when we use it for building models such as Random forest, which is the model I will be using for this prediction.
To predict the classe
variable in our data, which is a factor variable, what we need is a classifier model. I’m going to use a random forest model because it’s a flexible and easy to use ensemble learning algorithm that provides high accuracy predictions through cross-validation.
That said, building random forest models can be computationally expensive, so we’ll be setting registering for parallel processing with the parallel
and doParallel
packages.
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
As said before, random forest uses cross-validation to randomly split the fitted training set into train and test sets based on the given k-folds (k), in this case 7, in the trainControl
function. This means our model will be trained 7 times based on the cross-validated data. We also set allowParallel
as True to allow for parallel processing.
Using Caret, model training can be done with the train
function, and our method is “rf” which stands for random forest, and we preProcess with PCA.
set.seed(2021)
fitControl <- trainControl(method = "cv",
number = 7,
allowParallel = TRUE)
rf.fit <- train(
classe ~ .,
method = "rf",
data = train,
trControl = fitControl
)
# stop cluster
stopCluster(cluster)
registerDoSEQ()
# save model into an rds file to save time
saveRDS(rf.fit,file="rfmodel.rds")
After training the model, we stop the clusters, and then save the model into an rds file to save time. We can then load it later and perform a downstream analysis.
Now we measure our model performance with statistics like kappa and accuracy, along with some plots.
model.rf <- readRDS(file = "rfmodel.rds")
model.rf
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (7 fold)
## Summary of sample sizes: 11774, 11774, 11776, 11775, 11774, 11776, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9929389 0.9910672
## 27 0.9921382 0.9900546
## 52 0.9863865 0.9827784
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From the results, we see that the optimal model, has an accuracy of 0.99
model.rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.66%
## Confusion matrix:
## A B C D E class.error
## A 3903 3 0 0 0 0.0007680492
## B 16 2638 4 0 0 0.0075244545
## C 0 21 2372 3 0 0.0100166945
## D 0 0 36 2215 1 0.0164298401
## E 0 0 3 4 2518 0.0027722772
The OOB is our out of sample rate, which is 0.66%. This means our accuracy is considered high and acceptable for our prediction.
Below you see the plot for the error of each classe
prediction as the no of trees increase, and we see that as we reach around 150 trees, the OOB becomes flat, and we can use 150 as the ntrees
for our trcontrol
if we decide to further fine-tune our model.
plot(model.rf$finalModel)
importance <- varImp(model.rf, scale = FALSE)
plot(importance, top=10)
VarImp
function by R tells us that from our model, the most important feature in predicting the classe variable is roll_belt
.
Using our trained model, we can apply it to our test set, and observe the accuracy.
pred.rf <- predict(model.rf, test)
confM <- confusionMatrix(test$classe, pred.rf)
confM$table %>%
kbl() %>%
kable_paper("hover", full_width = F)
A | B | C | D | E | |
---|---|---|---|---|---|
A | 1673 | 1 | 0 | 0 | 0 |
B | 9 | 1130 | 0 | 0 | 0 |
C | 0 | 8 | 1017 | 1 | 0 |
D | 0 | 0 | 15 | 948 | 1 |
E | 0 | 0 | 3 | 2 | 1077 |
confM$overall["Accuracy"]
## Accuracy
## 0.9932031
We obtain an accuracy of 0.99, which means only around 1% of classe
variables were falsely classified.
Finally we apply our model to the 20 test cases given in the validation data.
final.pred.rf <- predict(model.rf, clean_test)
summary(final.pred.rf)
## A B C D E
## 7 8 1 1 3
final.pred.rf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] randomForest_4.6-14 e1071_1.7-4 corrplot_0.84
## [4] kableExtra_1.3.1 printr_0.1 dplyr_1.0.2
## [7] visdat_0.5.3 purrr_0.3.4 doParallel_1.0.16
## [10] iterators_1.0.13 foreach_1.5.1 caret_6.0-86
## [13] ggplot2_3.3.2 lattice_0.20-41 data.table_1.13.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 lubridate_1.7.9 class_7.3-17
## [4] digest_0.6.27 ipred_0.9-9 R6_2.5.0
## [7] plyr_1.8.6 stats4_4.0.2 evaluate_0.14
## [10] highr_0.8 httr_1.4.2 pillar_1.4.7
## [13] rlang_0.4.10 curl_4.3 rstudioapi_0.13
## [16] rpart_4.1-15 Matrix_1.2-18 rmarkdown_2.5
## [19] splines_4.0.2 webshot_0.5.2 gower_0.2.2
## [22] stringr_1.4.0 munsell_0.5.0 compiler_4.0.2
## [25] xfun_0.19 pkgconfig_2.0.3 htmltools_0.5.0
## [28] nnet_7.3-14 tidyselect_1.1.0 tibble_3.0.4
## [31] prodlim_2019.11.13 codetools_0.2-16 viridisLite_0.3.0
## [34] crayon_1.3.4 withr_2.3.0 MASS_7.3-52
## [37] recipes_0.1.15 ModelMetrics_1.2.2.2 grid_4.0.2
## [40] nlme_3.1-149 gtable_0.3.0 lifecycle_0.2.0
## [43] pacman_0.5.1 magrittr_2.0.1 pROC_1.17.0.1
## [46] scales_1.1.1 stringi_1.5.3 reshape2_1.4.4
## [49] timeDate_3043.102 xml2_1.3.2 ellipsis_0.3.1
## [52] generics_0.1.0 vctrs_0.3.5 lava_1.6.8.1
## [55] tools_4.0.2 glue_1.4.2 survival_3.2-3
## [58] yaml_2.2.1 colorspace_2.0-0 rvest_0.3.6
## [61] knitr_1.30
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.