Human Activity Prediction with R

Background
- Data
- Goal
Executive summary
Loading Packages
- Loading Data
Data Preprocessing
Data Partition
Exploratory Data Analysis
Prediction
Final Prediction on Validation
Session info
Citation

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here
The test data are available here
The data for this project come from this source

Goal

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases

Executive summary

Using a random forest classifier with a k-fold cross validation of 7, the optimal model has an accuracy of 0.993 and an OOB rate of 0.66%. The variable importance plot shows that the roll_belt variable was most important in predicting the classe variable.

Applying our model on the test set, we attain a similar accuracy of 0.993. Applying the model on the 20 test case in our validation set, we achieve 100% accuracy in predicting the right classe variable.

Loading Packages

pacman::p_load(data.table, caret, parallel, doParallel, purrr, visdat, dplyr, printr, kableExtra, corrplot, e1071, randomForest)

Loading Data

# training
url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

training <- fread(url_train,
                  na.strings = c("#DIV/0", "", "NA"),
                  stringsAsFactors = TRUE)
    
# testing data
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

testing <- fread(url_test,
                 na.strings = c("#DIV/0", "", "NA"),
                 stringsAsFactors = TRUE)

Data Preprocessing

# glimpse(training)

Looking at our data (output is too large), we see there’s a total of 160 variables that we have to build our model. Most of these variables are not useful for building our prediction model, especially the first 7 columns, which are just rowno., usernames, timestamps, etc.

training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]

rbind(training = dim(training),
      testing = dim(testing)) %>%
      kbl() %>%
      kable_classic(full_width = F, html_font = "Cambria")

training	19622	153
testing	20	153

Next to reduce the unnecessary variables, we set a threshold for the amount of NAs a variable has in our data. I’m going to set the threshold as 70% and use the discard function from the purrr package to discard the variables. (Another way to do this is by using nearZeroVar function which finds variables with near zero variability or with PCA)

# function to remove columns with NAs
na_remove_col <- function(data, threshold) {
    data %>%
        discard(~ sum(is.na(.x)) / length(.x) * 100 > threshold)
}

clean_train <- na_remove_col(training, 70)

clean_test <- na_remove_col(testing, 70)

rbind(training = dim(clean_train),
      testing = dim(clean_test)) %>%
      kbl() %>%
      kable_classic(full_width = F, html_font = "Cambria")

training	19622	53
testing	20	53

Now we see that exactly 100 variables were removed after the threshold.

Data Partition

The data we have is a training data set and a validation data set. The standard procedure is to partition our training set into train and test set, and then apply our final model to our validation set. The function createDataPartition will be used to split our data.

set.seed(2021) # for reproducability

inTrain <- createDataPartition(clean_train$classe, p=0.7, list=FALSE)

train <- clean_train[inTrain, ]
test <- clean_train[-inTrain, ]

Now we have our training and test data which is 70% and 30% of our initial training data respectively.

Exploratory Data Analysis

By doing a bit of EDA on our training data, we can observe whether there are variables which are highly correlated using the corrplot library.

corr_data <- select_if(train, is.numeric)
corrplot(
    cor(corr_data),
    method = "color",
    tl.pos = "n",
    insig = "blank"
)

Our correlation plot shows that the most of our variables are not very correlated. With the exception of the first few columns at the upper left, and columns at the middle. Correlated variables can bring about issues when we use it for building models such as Random forest, which is the model I will be using for this prediction.

Prediction

Random forest Model

To predict the classe variable in our data, which is a factor variable, what we need is a classifier model. I’m going to use a random forest model because it’s a flexible and easy to use ensemble learning algorithm that provides high accuracy predictions through cross-validation.

Setting Parallel Processing

That said, building random forest models can be computationally expensive, so we’ll be setting registering for parallel processing with the parallel and doParallel packages.

cluster <- makeCluster(detectCores() - 1) 
registerDoParallel(cluster)

Building the model

As said before, random forest uses cross-validation to randomly split the fitted training set into train and test sets based on the given k-folds (k), in this case 7, in the trainControl function. This means our model will be trained 7 times based on the cross-validated data. We also set allowParallel as True to allow for parallel processing.

Using Caret, model training can be done with the train function, and our method is “rf” which stands for random forest, and we preProcess with PCA.

set.seed(2021)

fitControl <- trainControl(method = "cv",
                           number = 7,
                           allowParallel = TRUE)

rf.fit <- train(
    classe ~ .,
    method = "rf",
    data = train,
    trControl = fitControl
)

# stop cluster
stopCluster(cluster)
registerDoSEQ()

# save model into an rds file to save time
saveRDS(rf.fit,file="rfmodel.rds")

After training the model, we stop the clusters, and then save the model into an rds file to save time. We can then load it later and perform a downstream analysis.

Now we measure our model performance with statistics like kappa and accuracy, along with some plots.

Model Performance

model.rf <- readRDS(file = "rfmodel.rds")
model.rf

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold) 
## Summary of sample sizes: 11774, 11774, 11776, 11775, 11774, 11776, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9929389  0.9910672
##   27    0.9921382  0.9900546
##   52    0.9863865  0.9827784
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From the results, we see that the optimal model, has an accuracy of 0.99

model.rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.66%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    3    0    0    0 0.0007680492
## B   16 2638    4    0    0 0.0075244545
## C    0   21 2372    3    0 0.0100166945
## D    0    0   36 2215    1 0.0164298401
## E    0    0    3    4 2518 0.0027722772

The OOB is our out of sample rate, which is 0.66%. This means our accuracy is considered high and acceptable for our prediction.

Below you see the plot for the error of each classe prediction as the no of trees increase, and we see that as we reach around 150 trees, the OOB becomes flat, and we can use 150 as the ntrees for our trcontrol if we decide to further fine-tune our model.

plot(model.rf$finalModel)

Variable Importance

importance <- varImp(model.rf, scale = FALSE)
plot(importance, top=10)

VarImp function by R tells us that from our model, the most important feature in predicting the classe variable is roll_belt .

Prediction on test set

Using our trained model, we can apply it to our test set, and observe the accuracy.

pred.rf <- predict(model.rf, test)
confM <- confusionMatrix(test$classe, pred.rf)
confM$table %>%
  kbl() %>%
  kable_paper("hover", full_width = F)

	A	B	C	D	E
A	1673	1	0	0	0
B	9	1130	0	0	0
C	0	8	1017	1	0
D	0	0	15	948	1
E	0	0	3	2	1077

confM$overall["Accuracy"]

##  Accuracy 
## 0.9932031

We obtain an accuracy of 0.99, which means only around 1% of classe variables were falsely classified.

Final Prediction on Validation

Finally we apply our model to the 20 test cases given in the validation data.

final.pred.rf <- predict(model.rf, clean_test)
summary(final.pred.rf)

## A B C D E 
## 7 8 1 1 3

final.pred.rf

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Session info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] randomForest_4.6-14 e1071_1.7-4         corrplot_0.84      
##  [4] kableExtra_1.3.1    printr_0.1          dplyr_1.0.2        
##  [7] visdat_0.5.3        purrr_0.3.4         doParallel_1.0.16  
## [10] iterators_1.0.13    foreach_1.5.1       caret_6.0-86       
## [13] ggplot2_3.3.2       lattice_0.20-41     data.table_1.13.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5           lubridate_1.7.9      class_7.3-17        
##  [4] digest_0.6.27        ipred_0.9-9          R6_2.5.0            
##  [7] plyr_1.8.6           stats4_4.0.2         evaluate_0.14       
## [10] highr_0.8            httr_1.4.2           pillar_1.4.7        
## [13] rlang_0.4.10         curl_4.3             rstudioapi_0.13     
## [16] rpart_4.1-15         Matrix_1.2-18        rmarkdown_2.5       
## [19] splines_4.0.2        webshot_0.5.2        gower_0.2.2         
## [22] stringr_1.4.0        munsell_0.5.0        compiler_4.0.2      
## [25] xfun_0.19            pkgconfig_2.0.3      htmltools_0.5.0     
## [28] nnet_7.3-14          tidyselect_1.1.0     tibble_3.0.4        
## [31] prodlim_2019.11.13   codetools_0.2-16     viridisLite_0.3.0   
## [34] crayon_1.3.4         withr_2.3.0          MASS_7.3-52         
## [37] recipes_0.1.15       ModelMetrics_1.2.2.2 grid_4.0.2          
## [40] nlme_3.1-149         gtable_0.3.0         lifecycle_0.2.0     
## [43] pacman_0.5.1         magrittr_2.0.1       pROC_1.17.0.1       
## [46] scales_1.1.1         stringi_1.5.3        reshape2_1.4.4      
## [49] timeDate_3043.102    xml2_1.3.2           ellipsis_0.3.1      
## [52] generics_0.1.0       vctrs_0.3.5          lava_1.6.8.1        
## [55] tools_4.0.2          glue_1.4.2           survival_3.2-3      
## [58] yaml_2.2.1           colorspace_2.0-0     rvest_0.3.6         
## [61] knitr_1.30

Citation

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.