library(data.table)
library(tidyr)
library(dplyr)
library(ggplot2)
library(caret)
library(rattle)
library(tictoc)
library(AppliedPredictiveModeling)
library(kableExtra)
The data for this project was sourced from the Pontifical Catholic University of Rio de Janeiro (PUC-Rio).\(^1\)
This project employs the caret
package in R to build machine learning models that can predict which exercise an individual is performing from wearable sensor data. Sensor data was collected on 4 healthy subjects performing 8 hours of activities. The activities were categorized into 5 classes (sitting-down, standing-up, standing, walking, and sitting).
The repository for this project can be found on my github.
if(!file.exists('./data')){dir.create('./data')}
fileUrl1 <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
fileUrl2 <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
if(!file.exists('./data/pml-training.csv')){download.file(fileUrl1, './data/pml-training.csv')}
if(!file.exists('./data/pml-testing.csv')){download.file(fileUrl2, './data/pml-testing.csv')}
pml_training <- read.csv('./data/pml-training.csv', na.strings=c('', 'NA', '#DIV/0!'))
pml_testing <- read.csv('./data/pml-testing.csv', na.strings = c('', 'NA', '#DIV/0!'))
print(names(pml_training), quote = F)
## [1] X user_name raw_timestamp_part_1
## [4] raw_timestamp_part_2 cvtd_timestamp new_window
## [7] num_window roll_belt pitch_belt
## [10] yaw_belt total_accel_belt kurtosis_roll_belt
## [13] kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## [16] skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## [19] max_picth_belt max_yaw_belt min_roll_belt
## [22] min_pitch_belt min_yaw_belt amplitude_roll_belt
## [25] amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt
## [28] avg_roll_belt stddev_roll_belt var_roll_belt
## [31] avg_pitch_belt stddev_pitch_belt var_pitch_belt
## [34] avg_yaw_belt stddev_yaw_belt var_yaw_belt
## [37] gyros_belt_x gyros_belt_y gyros_belt_z
## [40] accel_belt_x accel_belt_y accel_belt_z
## [43] magnet_belt_x magnet_belt_y magnet_belt_z
## [46] roll_arm pitch_arm yaw_arm
## [49] total_accel_arm var_accel_arm avg_roll_arm
## [52] stddev_roll_arm var_roll_arm avg_pitch_arm
## [55] stddev_pitch_arm var_pitch_arm avg_yaw_arm
## [58] stddev_yaw_arm var_yaw_arm gyros_arm_x
## [61] gyros_arm_y gyros_arm_z accel_arm_x
## [64] accel_arm_y accel_arm_z magnet_arm_x
## [67] magnet_arm_y magnet_arm_z kurtosis_roll_arm
## [70] kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm
## [73] skewness_pitch_arm skewness_yaw_arm max_roll_arm
## [76] max_picth_arm max_yaw_arm min_roll_arm
## [79] min_pitch_arm min_yaw_arm amplitude_roll_arm
## [82] amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell
## [85] pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## [88] kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## [91] skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## [94] max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## [97] min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## [100] amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## [103] var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## [106] var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## [109] var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## [112] var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## [115] gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## [118] accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## [121] magnet_dumbbell_z roll_forearm pitch_forearm
## [124] yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## [127] kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## [130] skewness_yaw_forearm max_roll_forearm max_picth_forearm
## [133] max_yaw_forearm min_roll_forearm min_pitch_forearm
## [136] min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## [139] amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## [142] avg_roll_forearm stddev_roll_forearm var_roll_forearm
## [145] avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm
## [148] avg_yaw_forearm stddev_yaw_forearm var_yaw_forearm
## [151] gyros_forearm_x gyros_forearm_y gyros_forearm_z
## [154] accel_forearm_x accel_forearm_y accel_forearm_z
## [157] magnet_forearm_x magnet_forearm_y magnet_forearm_z
## [160] classe
The data contains 160 variables. One of these variables represents the outcome, $classe, while the rest may potentially serve as predictor variables. First I check the integrity of the data, then the data can be prepared for model creation.
Making a contingency table based on the percent of usable data in each variable, I can quickly determine which columns are useful and which are not.
useableData <- table(round(colMeans(!is.na(pml_training))*100, 1))
as.data.table(useableData) %>%
kbl(col.names = c('% Useable Data', '# of Variables')) %>%
kable_styling(bootstrap_options = c('striped',
'hover',
'condensed',
'responsive'),
full_width = F)
Table 1. Table showing the count of variables containing the shown amount of useable data.
% Useable Data | # of Variables |
---|---|
0 | 6 |
1.6 | 7 |
1.7 | 4 |
1.9 | 2 |
2 | 12 |
2.1 | 69 |
100 | 60 |
Many variables have < 2% useable data. Additionally, some variables are not meaningful to the excercise being performed (Ex. user_name, timestamps, window). These are removed before any model training is performed.
Also, the outcome variable $classe
is coerced to a factor.
pml_training$classe <- factor(pml_training$classe)
badCols <- as.vector(colMeans(!is.na(pml_training))*100<5)
pml_training_data <- pml_training[,!badCols]
pml_training_data <- pml_training_data[,-c(1:7)]
pml_testing_data <- pml_testing[,!badCols]
pml_testing_data <- pml_testing_data[,-c(1:7)]
set.seed(13415)
inTrain = createDataPartition(pml_training_data$classe,
p = 0.6, list = F)
training <- pml_training_data[inTrain,]
testing <- pml_training_data[-inTrain,]
summary()
was used to compare variables in the data and it was found that the distribution of each variable differed greatly from the others. However, feature scaling is not necessary for decision tree type models since they are not sensitive to variance between variables (it will only reduce readability of the data and models).
Since I am predicting a categorical outcome, I used tree predictors and the metric to determine performance was Accuracy (fraction correct), not RMSE or \(R^2\). Kappa (measure of concordance) can also be used, however the outcome scenarios are not significantly skewed enough to require normalization of this metric:
prop.table(table(training$classe))
##
## A B C D E
## 0.2843071 0.1935292 0.1744226 0.1638927 0.1838485
For model creation I used the caret
package due to the uniform syntax of the functions between algorithms. (Week 3 of Data Science from JHU (Coursera) describes decision tree models, and I chose to use the models described there.)
I tested four different models:
Cross validation can easily be performed with each algorithm using the trainControl()
function. For this dataset I used k-fold cross validation. This involves splitting the observations into k-subsets, then removing a subset while training on the remainder for each subset. I’ve chosen to perform 5-fold cross validation here.
trainC <- trainControl(method = 'cv', number = 5)
For each model, I trained the model on the training
data, predicted outcomes on the testing
data, then analyzed the out of sample accuracy of the predictions by running a confusion matrix on predictions and testing outcomes with confusionMatrix()
. I also measured the time for each model to train using Sys.time()
.
Decision trees estimate a probability per node \(m\) for each classification \(k\):
\[ \hat{P}_{mk} = \frac{1}{N_m}\sum_{i:x_j\in R_m} 1(y_i = k) \]
set.seed(13415)
start <- Sys.time()
modFitTree <- train(classe ~ .,
method = 'rpart',
data = training,
trControl = trainC)
Treetime <- Sys.time() - start
predTree <- predict(modFitTree, newdata = testing)
cmTree <- confusionMatrix(predTree, factor(testing$classe))
cmTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2038 639 659 574 207
## B 42 517 47 239 206
## C 147 362 662 473 372
## D 0 0 0 0 0
## E 5 0 0 0 657
##
## Overall Statistics
##
## Accuracy : 0.4938
## 95% CI : (0.4826, 0.5049)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3378
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9131 0.34058 0.48392 0.0000 0.45562
## Specificity 0.6297 0.91561 0.79098 1.0000 0.99922
## Pos Pred Value 0.4950 0.49191 0.32837 NaN 0.99245
## Neg Pred Value 0.9480 0.85269 0.87890 0.8361 0.89073
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2598 0.06589 0.08437 0.0000 0.08374
## Detection Prevalence 0.5247 0.13395 0.25695 0.0000 0.08437
## Balanced Accuracy 0.7714 0.62810 0.63745 0.5000 0.72742
Treetime
## Time difference of 3.325004 secs
It is no surprise that the testing data has such a low out of sample accuracy (49.4%), considering there is no ‘D’ in the prediction model.
This model uses the bootstrap principle to build 25 tree models from separated subsets of the train data, then constructs a final aggregated model which should be more accurate.
set.seed(13415)
start <- Sys.time()
modFitTB <- train(classe ~ .,
method = 'treebag',
data = training,
trControl = trainC)
TBtime <- Sys.time() - start
predTB <- predict(modFitTB, newdata = testing)
cmTB <- confusionMatrix(predTB, factor(testing$classe))
cmTB
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2218 29 1 1 1
## B 6 1468 16 5 8
## C 8 12 1337 31 6
## D 0 8 14 1246 0
## E 0 1 0 3 1427
##
## Overall Statistics
##
## Accuracy : 0.9809
## 95% CI : (0.9776, 0.9838)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9758
##
## Mcnemar's Test P-Value : 2.474e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9937 0.9671 0.9773 0.9689 0.9896
## Specificity 0.9943 0.9945 0.9912 0.9966 0.9994
## Pos Pred Value 0.9858 0.9767 0.9591 0.9826 0.9972
## Neg Pred Value 0.9975 0.9921 0.9952 0.9939 0.9977
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2827 0.1871 0.1704 0.1588 0.1819
## Detection Prevalence 0.2868 0.1916 0.1777 0.1616 0.1824
## Balanced Accuracy 0.9940 0.9808 0.9843 0.9828 0.9945
TBtime
## Time difference of 33.166 secs
The ‘D’ outcome has now been utilized and the out of sample accuracy has increased to 98.1% with a time increase of ~28 seconds
Random forest not only bootstraps over observations for each tree model, but also bootstraps which variables will be considered when determining the probability at each node.
\[ p(c|v) = \frac{1}{T}\sum^{T}_{t}p_t(c|v) \]
\(t\) = tree, \(T\) = number of trees, \(c\) = outcome, \(v\) = observation
set.seed(13415)
start <- Sys.time()
modFitRF <- train(classe ~ .,
method = 'rf',
ntrees = 25,
data = training,
trControl = trainC)
RFtime <- Sys.time() - start
predRF <- predict(modFitRF, newdata = testing)
cmTB <- confusionMatrix(predRF, factor(testing$classe))
cmTB
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2226 18 0 0 0
## B 4 1489 21 0 0
## C 2 11 1343 29 1
## D 0 0 4 1255 1
## E 0 0 0 2 1440
##
## Overall Statistics
##
## Accuracy : 0.9881
## 95% CI : (0.9855, 0.9904)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9973 0.9809 0.9817 0.9759 0.9986
## Specificity 0.9968 0.9960 0.9934 0.9992 0.9997
## Pos Pred Value 0.9920 0.9835 0.9690 0.9960 0.9986
## Neg Pred Value 0.9989 0.9954 0.9961 0.9953 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2837 0.1898 0.1712 0.1600 0.1835
## Detection Prevalence 0.2860 0.1930 0.1767 0.1606 0.1838
## Balanced Accuracy 0.9971 0.9885 0.9875 0.9876 0.9992
RFtime
## Time difference of 4.8081 mins
The random forest model provides a further increase in out of sample accuracy (98.8%) with a time increase of ~4 minutes.
Boosting with trees is similar to ‘Bagging’ in the sense that we are aggregating a large number of classifiers to build a stronger predictor. However, boosting adjusts the weight of an observation based on the last classification:
\[ f(x) = sgn\left(\sum_{t = 1}^T\alpha_th_t(x)\right) \]
set.seed(13415)
start <- Sys.time()
modFitGBM <- train(classe ~ .,
method = 'gbm',
data = training,
trControl = trainC)
GBMtime <- Sys.time() - start
predGBM <- predict(modFitGBM, newdata = testing)
cmGBM <- confusionMatrix(predGBM, factor(testing$classe))
cmGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2192 56 0 1 2
## B 28 1410 61 0 14
## C 8 46 1284 44 14
## D 3 3 19 1229 14
## E 1 3 4 12 1398
##
## Overall Statistics
##
## Accuracy : 0.9576
## 95% CI : (0.9529, 0.9619)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9463
##
## Mcnemar's Test P-Value : 1.156e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.9289 0.9386 0.9557 0.9695
## Specificity 0.9895 0.9837 0.9827 0.9941 0.9969
## Pos Pred Value 0.9738 0.9319 0.9198 0.9692 0.9859
## Neg Pred Value 0.9929 0.9829 0.9870 0.9913 0.9932
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2794 0.1797 0.1637 0.1566 0.1782
## Detection Prevalence 0.2869 0.1928 0.1779 0.1616 0.1807
## Balanced Accuracy 0.9858 0.9563 0.9607 0.9749 0.9832
GBMtime
## Time difference of 2.306483 mins
The gbm model resulted in a lower out of sample accuracy than random forest and bagged trees (95.8%) with a final time of ~2.2 minutes.
After taking a look at the out of sample error rates for each of the models, the random forest model modFitRF
is found to be the best choice with an accuracy of 98.8% and a kappa of 98.5%.
The top 20 most important variables can be viewed with varImp(modFitRF)
:
varImp(modFitRF)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## yaw_belt 73.10
## magnet_dumbbell_z 66.01
## pitch_forearm 59.96
## magnet_dumbbell_y 59.76
## pitch_belt 57.87
## roll_forearm 53.24
## magnet_dumbbell_x 49.71
## accel_belt_z 43.69
## magnet_belt_z 43.65
## accel_dumbbell_y 42.71
## magnet_belt_y 41.76
## roll_dumbbell 40.87
## accel_dumbbell_z 35.90
## accel_forearm_x 31.95
## roll_arm 31.60
## accel_arm_x 28.81
## accel_dumbbell_x 27.40
## yaw_dumbbell 27.14
## total_accel_dumbbell 27.11
It may be useful to determine which variables more strongly contribute to a successful prediction. This information can be used to prune the sensor apparatus to save cost.
For example, there are no ‘gyros’ variables in the top 20 important variables, indicating that this component of the sensor array may not be necessary for successful prediction of exercise type.
imp_var <- as.data.frame((varImp(modFitRF)[1])$importance)
imp_var <- rownames_to_column(imp_var) %>% arrange(desc(Overall))
imp_data <- training %>% select(imp_var[1:6, 1])
transparentTheme(trans = 0.3)
featurePlot(x = imp_data,
y = factor(training$classe),
plot = 'box',
scales = list(x = list(relation = 'free'),
y = list(relation = 'free')),
auto.key = list(columns = 5)
)
Figure 1. Box plots of the top 6 predictors for the random forest model.
transparentTheme(trans = 0.3)
featurePlot(x = imp_data,
y = factor(training$classe),
plot = 'density',
scales = list(x = list(relation = 'free'),
y = list(relation = 'free')),
pch = '|',
auto.key = list(columns = 5)
)
Figure 2. Density plots of the top 6 predictors for the random forest model.
The box plot reveals noticeable changes between exercise types in the $pitch_forearm
variable, however there isn’t much obvious difference between exercise types for the other variables. The density plots also provide some visual differences between exercise types, though there are similar trends between variables leading to a lot of overlap. The ‘pairs’ plot did not reveal any interesting visual patterns.
predCases <- predict(modFitRF, newdata = pml_testing_data[,-53])
predCases
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
In conclusion, the random forest model was found to perform the best with an out of sample error rate of 1.2%. The same K-fold cross validation was used to optimize these models, however further optimization of tree numbers for the bagged, random forest, and gbm models could result in higher accuracy using a grid search type cross validation. This would considerably increase the model creation time (hours per model with my hardware), so I chose not to pursue this optimization at this time.