library(data.table)
library(tidyr)
library(dplyr)
library(ggplot2)
library(caret)
library(rattle)
library(tictoc)
library(AppliedPredictiveModeling)
library(kableExtra)

Synopsis

The data for this project was sourced from the Pontifical Catholic University of Rio de Janeiro (PUC-Rio).\(^1\)

This project employs the caret package in R to build machine learning models that can predict which exercise an individual is performing from wearable sensor data. Sensor data was collected on 4 healthy subjects performing 8 hours of activities. The activities were categorized into 5 classes (sitting-down, standing-up, standing, walking, and sitting).

The repository for this project can be found on my github.

if(!file.exists('./data')){dir.create('./data')}
fileUrl1 <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
fileUrl2 <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
if(!file.exists('./data/pml-training.csv')){download.file(fileUrl1, './data/pml-training.csv')}
if(!file.exists('./data/pml-testing.csv')){download.file(fileUrl2, './data/pml-testing.csv')}
pml_training <- read.csv('./data/pml-training.csv', na.strings=c('', 'NA', '#DIV/0!'))
pml_testing <- read.csv('./data/pml-testing.csv', na.strings = c('', 'NA', '#DIV/0!'))
print(names(pml_training), quote = F)
##   [1] X                        user_name                raw_timestamp_part_1    
##   [4] raw_timestamp_part_2     cvtd_timestamp           new_window              
##   [7] num_window               roll_belt                pitch_belt              
##  [10] yaw_belt                 total_accel_belt         kurtosis_roll_belt      
##  [13] kurtosis_picth_belt      kurtosis_yaw_belt        skewness_roll_belt      
##  [16] skewness_roll_belt.1     skewness_yaw_belt        max_roll_belt           
##  [19] max_picth_belt           max_yaw_belt             min_roll_belt           
##  [22] min_pitch_belt           min_yaw_belt             amplitude_roll_belt     
##  [25] amplitude_pitch_belt     amplitude_yaw_belt       var_total_accel_belt    
##  [28] avg_roll_belt            stddev_roll_belt         var_roll_belt           
##  [31] avg_pitch_belt           stddev_pitch_belt        var_pitch_belt          
##  [34] avg_yaw_belt             stddev_yaw_belt          var_yaw_belt            
##  [37] gyros_belt_x             gyros_belt_y             gyros_belt_z            
##  [40] accel_belt_x             accel_belt_y             accel_belt_z            
##  [43] magnet_belt_x            magnet_belt_y            magnet_belt_z           
##  [46] roll_arm                 pitch_arm                yaw_arm                 
##  [49] total_accel_arm          var_accel_arm            avg_roll_arm            
##  [52] stddev_roll_arm          var_roll_arm             avg_pitch_arm           
##  [55] stddev_pitch_arm         var_pitch_arm            avg_yaw_arm             
##  [58] stddev_yaw_arm           var_yaw_arm              gyros_arm_x             
##  [61] gyros_arm_y              gyros_arm_z              accel_arm_x             
##  [64] accel_arm_y              accel_arm_z              magnet_arm_x            
##  [67] magnet_arm_y             magnet_arm_z             kurtosis_roll_arm       
##  [70] kurtosis_picth_arm       kurtosis_yaw_arm         skewness_roll_arm       
##  [73] skewness_pitch_arm       skewness_yaw_arm         max_roll_arm            
##  [76] max_picth_arm            max_yaw_arm              min_roll_arm            
##  [79] min_pitch_arm            min_yaw_arm              amplitude_roll_arm      
##  [82] amplitude_pitch_arm      amplitude_yaw_arm        roll_dumbbell           
##  [85] pitch_dumbbell           yaw_dumbbell             kurtosis_roll_dumbbell  
##  [88] kurtosis_picth_dumbbell  kurtosis_yaw_dumbbell    skewness_roll_dumbbell  
##  [91] skewness_pitch_dumbbell  skewness_yaw_dumbbell    max_roll_dumbbell       
##  [94] max_picth_dumbbell       max_yaw_dumbbell         min_roll_dumbbell       
##  [97] min_pitch_dumbbell       min_yaw_dumbbell         amplitude_roll_dumbbell 
## [100] amplitude_pitch_dumbbell amplitude_yaw_dumbbell   total_accel_dumbbell    
## [103] var_accel_dumbbell       avg_roll_dumbbell        stddev_roll_dumbbell    
## [106] var_roll_dumbbell        avg_pitch_dumbbell       stddev_pitch_dumbbell   
## [109] var_pitch_dumbbell       avg_yaw_dumbbell         stddev_yaw_dumbbell     
## [112] var_yaw_dumbbell         gyros_dumbbell_x         gyros_dumbbell_y        
## [115] gyros_dumbbell_z         accel_dumbbell_x         accel_dumbbell_y        
## [118] accel_dumbbell_z         magnet_dumbbell_x        magnet_dumbbell_y       
## [121] magnet_dumbbell_z        roll_forearm             pitch_forearm           
## [124] yaw_forearm              kurtosis_roll_forearm    kurtosis_picth_forearm  
## [127] kurtosis_yaw_forearm     skewness_roll_forearm    skewness_pitch_forearm  
## [130] skewness_yaw_forearm     max_roll_forearm         max_picth_forearm       
## [133] max_yaw_forearm          min_roll_forearm         min_pitch_forearm       
## [136] min_yaw_forearm          amplitude_roll_forearm   amplitude_pitch_forearm 
## [139] amplitude_yaw_forearm    total_accel_forearm      var_accel_forearm       
## [142] avg_roll_forearm         stddev_roll_forearm      var_roll_forearm        
## [145] avg_pitch_forearm        stddev_pitch_forearm     var_pitch_forearm       
## [148] avg_yaw_forearm          stddev_yaw_forearm       var_yaw_forearm         
## [151] gyros_forearm_x          gyros_forearm_y          gyros_forearm_z         
## [154] accel_forearm_x          accel_forearm_y          accel_forearm_z         
## [157] magnet_forearm_x         magnet_forearm_y         magnet_forearm_z        
## [160] classe

The data contains 160 variables. One of these variables represents the outcome, $classe, while the rest may potentially serve as predictor variables. First I check the integrity of the data, then the data can be prepared for model creation.

Preparing the Data

Check integrity of data

Making a contingency table based on the percent of usable data in each variable, I can quickly determine which columns are useful and which are not.

useableData <- table(round(colMeans(!is.na(pml_training))*100, 1))
as.data.table(useableData) %>%
        kbl(col.names = c('% Useable Data', '# of Variables')) %>% 
        kable_styling(bootstrap_options = c('striped',
                                            'hover', 
                                            'condensed', 
                                            'responsive'), 
                      full_width = F)

Table 1. Table showing the count of variables containing the shown amount of useable data.

% Useable Data # of Variables
0 6
1.6 7
1.7 4
1.9 2
2 12
2.1 69
100 60

Many variables have < 2% useable data. Additionally, some variables are not meaningful to the excercise being performed (Ex. user_name, timestamps, window). These are removed before any model training is performed.

Also, the outcome variable $classe is coerced to a factor.

Subsetting Data and Partitioning Training/Testing Sets

pml_training$classe <- factor(pml_training$classe)
badCols <- as.vector(colMeans(!is.na(pml_training))*100<5)
pml_training_data <- pml_training[,!badCols]
pml_training_data <- pml_training_data[,-c(1:7)]
pml_testing_data <- pml_testing[,!badCols]
pml_testing_data <- pml_testing_data[,-c(1:7)]
set.seed(13415)
inTrain = createDataPartition(pml_training_data$classe, 
                              p = 0.6, list = F)
training <- pml_training_data[inTrain,]
testing <- pml_training_data[-inTrain,]

summary() was used to compare variables in the data and it was found that the distribution of each variable differed greatly from the others. However, feature scaling is not necessary for decision tree type models since they are not sensitive to variance between variables (it will only reduce readability of the data and models).

Testing Different Models

Since I am predicting a categorical outcome, I used tree predictors and the metric to determine performance was Accuracy (fraction correct), not RMSE or \(R^2\). Kappa (measure of concordance) can also be used, however the outcome scenarios are not significantly skewed enough to require normalization of this metric:

prop.table(table(training$classe))
## 
##         A         B         C         D         E 
## 0.2843071 0.1935292 0.1744226 0.1638927 0.1838485

For model creation I used the caret package due to the uniform syntax of the functions between algorithms. (Week 3 of Data Science from JHU (Coursera) describes decision tree models, and I chose to use the models described there.)

I tested four different models:

  1. Decision Tree
  2. Bootstrap Aggregated Tree (bagging)
  3. Random Forest
  4. Boosted Model

Cross validation can easily be performed with each algorithm using the trainControl() function. For this dataset I used k-fold cross validation. This involves splitting the observations into k-subsets, then removing a subset while training on the remainder for each subset. I’ve chosen to perform 5-fold cross validation here.

trainC <- trainControl(method = 'cv', number = 5)

For each model, I trained the model on the training data, predicted outcomes on the testing data, then analyzed the out of sample accuracy of the predictions by running a confusion matrix on predictions and testing outcomes with confusionMatrix(). I also measured the time for each model to train using Sys.time().

1. Decision Tree

Decision trees estimate a probability per node \(m\) for each classification \(k\):

\[ \hat{P}_{mk} = \frac{1}{N_m}\sum_{i:x_j\in R_m} 1(y_i = k) \]

  • Here node \(m\) represents region \(R_m\), with \(N_m\) observations.
  • This describes the probability for class \(k\) in leaf \(m\) - In other words, the number of times class \(k\) appears in leaf \(m\)
set.seed(13415)
start <- Sys.time()
modFitTree <- train(classe ~ ., 
                    method = 'rpart', 
                    data = training, 
                    trControl = trainC)
Treetime <- Sys.time() - start
predTree <- predict(modFitTree, newdata = testing)
cmTree <- confusionMatrix(predTree, factor(testing$classe))
cmTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2038  639  659  574  207
##          B   42  517   47  239  206
##          C  147  362  662  473  372
##          D    0    0    0    0    0
##          E    5    0    0    0  657
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4938          
##                  95% CI : (0.4826, 0.5049)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3378          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9131  0.34058  0.48392   0.0000  0.45562
## Specificity            0.6297  0.91561  0.79098   1.0000  0.99922
## Pos Pred Value         0.4950  0.49191  0.32837      NaN  0.99245
## Neg Pred Value         0.9480  0.85269  0.87890   0.8361  0.89073
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2598  0.06589  0.08437   0.0000  0.08374
## Detection Prevalence   0.5247  0.13395  0.25695   0.0000  0.08437
## Balanced Accuracy      0.7714  0.62810  0.63745   0.5000  0.72742
Treetime
## Time difference of 3.325004 secs

It is no surprise that the testing data has such a low out of sample accuracy (49.4%), considering there is no ‘D’ in the prediction model.

2. Bootstrap Aggregated Tree (bagging)

This model uses the bootstrap principle to build 25 tree models from separated subsets of the train data, then constructs a final aggregated model which should be more accurate.

set.seed(13415)
start <- Sys.time()
modFitTB <- train(classe ~ ., 
                  method = 'treebag', 
                  data = training, 
                  trControl = trainC)
TBtime <- Sys.time() - start
predTB <- predict(modFitTB, newdata = testing)
cmTB <- confusionMatrix(predTB, factor(testing$classe))
cmTB
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2218   29    1    1    1
##          B    6 1468   16    5    8
##          C    8   12 1337   31    6
##          D    0    8   14 1246    0
##          E    0    1    0    3 1427
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9809          
##                  95% CI : (0.9776, 0.9838)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9758          
##                                           
##  Mcnemar's Test P-Value : 2.474e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9937   0.9671   0.9773   0.9689   0.9896
## Specificity            0.9943   0.9945   0.9912   0.9966   0.9994
## Pos Pred Value         0.9858   0.9767   0.9591   0.9826   0.9972
## Neg Pred Value         0.9975   0.9921   0.9952   0.9939   0.9977
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2827   0.1871   0.1704   0.1588   0.1819
## Detection Prevalence   0.2868   0.1916   0.1777   0.1616   0.1824
## Balanced Accuracy      0.9940   0.9808   0.9843   0.9828   0.9945
TBtime
## Time difference of 33.166 secs

The ‘D’ outcome has now been utilized and the out of sample accuracy has increased to 98.1% with a time increase of ~28 seconds

3. Random Forest

Random forest not only bootstraps over observations for each tree model, but also bootstraps which variables will be considered when determining the probability at each node.

\[ p(c|v) = \frac{1}{T}\sum^{T}_{t}p_t(c|v) \]

\(t\) = tree, \(T\) = number of trees, \(c\) = outcome, \(v\) = observation

  • \(p_t\) is the probability of an outcome \(c\) to occur given an observation \(v\) for a given tree \(t\).
  • This probability is averaged over the total number of trees \(T\).
set.seed(13415)
start <- Sys.time()
modFitRF <- train(classe ~ .,
                  method = 'rf',
                  ntrees = 25,
                  data = training,
                  trControl = trainC)
RFtime <- Sys.time() - start
predRF <- predict(modFitRF, newdata = testing)
cmTB <- confusionMatrix(predRF, factor(testing$classe))
cmTB
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2226   18    0    0    0
##          B    4 1489   21    0    0
##          C    2   11 1343   29    1
##          D    0    0    4 1255    1
##          E    0    0    0    2 1440
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9881          
##                  95% CI : (0.9855, 0.9904)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.985           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9809   0.9817   0.9759   0.9986
## Specificity            0.9968   0.9960   0.9934   0.9992   0.9997
## Pos Pred Value         0.9920   0.9835   0.9690   0.9960   0.9986
## Neg Pred Value         0.9989   0.9954   0.9961   0.9953   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2837   0.1898   0.1712   0.1600   0.1835
## Detection Prevalence   0.2860   0.1930   0.1767   0.1606   0.1838
## Balanced Accuracy      0.9971   0.9885   0.9875   0.9876   0.9992
RFtime
## Time difference of 4.8081 mins

The random forest model provides a further increase in out of sample accuracy (98.8%) with a time increase of ~4 minutes.

4. Boosting with Trees (gbm)

Boosting with trees is similar to ‘Bagging’ in the sense that we are aggregating a large number of classifiers to build a stronger predictor. However, boosting adjusts the weight of an observation based on the last classification:

\[ f(x) = sgn\left(\sum_{t = 1}^T\alpha_th_t(x)\right) \]

  • If the observation \(h_t(x)\) in question was previously classified incorrectly, the weight \(\alpha_t\) of the observation will be increased.
  • The default number of trees for gbm is 100.
set.seed(13415)
start <- Sys.time()
modFitGBM <- train(classe ~ .,
                   method = 'gbm',
                   data = training,
                   trControl = trainC)
GBMtime <- Sys.time() - start
predGBM <- predict(modFitGBM, newdata = testing)
cmGBM <- confusionMatrix(predGBM, factor(testing$classe))
cmGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2192   56    0    1    2
##          B   28 1410   61    0   14
##          C    8   46 1284   44   14
##          D    3    3   19 1229   14
##          E    1    3    4   12 1398
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9576          
##                  95% CI : (0.9529, 0.9619)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9463          
##                                           
##  Mcnemar's Test P-Value : 1.156e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.9289   0.9386   0.9557   0.9695
## Specificity            0.9895   0.9837   0.9827   0.9941   0.9969
## Pos Pred Value         0.9738   0.9319   0.9198   0.9692   0.9859
## Neg Pred Value         0.9929   0.9829   0.9870   0.9913   0.9932
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2794   0.1797   0.1637   0.1566   0.1782
## Detection Prevalence   0.2869   0.1928   0.1779   0.1616   0.1807
## Balanced Accuracy      0.9858   0.9563   0.9607   0.9749   0.9832
GBMtime
## Time difference of 2.306483 mins

The gbm model resulted in a lower out of sample accuracy than random forest and bagged trees (95.8%) with a final time of ~2.2 minutes.

Model Comparison and Interpretation

After taking a look at the out of sample error rates for each of the models, the random forest model modFitRF is found to be the best choice with an accuracy of 98.8% and a kappa of 98.5%.

The top 20 most important variables can be viewed with varImp(modFitRF):

varImp(modFitRF)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## yaw_belt               73.10
## magnet_dumbbell_z      66.01
## pitch_forearm          59.96
## magnet_dumbbell_y      59.76
## pitch_belt             57.87
## roll_forearm           53.24
## magnet_dumbbell_x      49.71
## accel_belt_z           43.69
## magnet_belt_z          43.65
## accel_dumbbell_y       42.71
## magnet_belt_y          41.76
## roll_dumbbell          40.87
## accel_dumbbell_z       35.90
## accel_forearm_x        31.95
## roll_arm               31.60
## accel_arm_x            28.81
## accel_dumbbell_x       27.40
## yaw_dumbbell           27.14
## total_accel_dumbbell   27.11

It may be useful to determine which variables more strongly contribute to a successful prediction. This information can be used to prune the sensor apparatus to save cost.

For example, there are no ‘gyros’ variables in the top 20 important variables, indicating that this component of the sensor array may not be necessary for successful prediction of exercise type.

Visualizing important variables in the chosen model

imp_var <- as.data.frame((varImp(modFitRF)[1])$importance)
imp_var <- rownames_to_column(imp_var) %>% arrange(desc(Overall))
imp_data <- training %>% select(imp_var[1:6, 1])
transparentTheme(trans = 0.3)
featurePlot(x = imp_data, 
            y = factor(training$classe),
            plot = 'box',
            scales = list(x = list(relation = 'free'),
                          y = list(relation = 'free')),
            auto.key = list(columns = 5)
            )

Figure 1. Box plots of the top 6 predictors for the random forest model.

transparentTheme(trans = 0.3)
featurePlot(x = imp_data, 
            y = factor(training$classe),
            plot = 'density',
            scales = list(x = list(relation = 'free'),
                          y = list(relation = 'free')),
            pch = '|',
            auto.key = list(columns = 5)
            )

Figure 2. Density plots of the top 6 predictors for the random forest model.

The box plot reveals noticeable changes between exercise types in the $pitch_forearm variable, however there isn’t much obvious difference between exercise types for the other variables. The density plots also provide some visual differences between exercise types, though there are similar trends between variables leading to a lot of overlap. The ‘pairs’ plot did not reveal any interesting visual patterns.

Final Prediction on 20 Test Cases

predCases <- predict(modFitRF, newdata = pml_testing_data[,-53])
predCases
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

In conclusion, the random forest model was found to perform the best with an out of sample error rate of 1.2%. The same K-fold cross validation was used to optimize these models, however further optimization of tree numbers for the bagged, random forest, and gbm models could result in higher accuracy using a grid search type cross validation. This would considerably increase the model creation time (hours per model with my hardware), so I chose not to pursue this optimization at this time.

References

  1. Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.