This post is following exercise 1 in Chapter 12 of Applied Predicative Modeling. Here I use the machine learning package CARET in R to make classification models; in particular, the linear classification models discussed in Chapter 12.
The dataset in question is about hepatic injury (liver damage). It includes a dataframe of biological related predictors of liver damage bio
, a dataframe of chemical related predictors chem
, and the response variable we are interested in, injury
. If a model can be fitted adequately, that model could potentially be used to screen for harmful compounds in the future.
library(caret)
library(pROC)
library(AppliedPredictiveModeling)
data(hepatic)
Before fitting a model it’s always necessary to preprocess data. For linear classification algorithms, this means
- Remove near zero variance predictor variables/ near-zero variance) predictors.
- Remove collinear variables
##remove problematic variables for 'bio' dataframe
bio_zerovar <- nearZeroVar(bio)
bio_collinear <- findLinearCombos(cov(bio))$remove
bio <- bio[, -c(bio_zerovar, bio_collinear)]
##remove problematic variables for 'chem' dataframe
chem_zerovar <- nearZeroVar(chem)
chem_collinear <- findLinearCombos(chem)$remove
chem <- chem[, -c(chem_zerovar, chem_collinear)]
The injury
response variable I am fitting to has three classes - “None”, “Mild”, and “Severe”. If the response variable has too many classes, it can (somewhat subjectively) be trimmed down. For expediency’s sake, I decide to collapse injury
into 2 classes - “Yes” or “No”, where I count mild injuries as “No”. (Warning This might influence prediction negatively, so in the future I’ll be sure to try multi-class probability predictions.)
##Collapse response into "Yes" or "No" values
lut <- c("None" = "No", "Mild" = "No", "Severe" = "Yes")
injury2 <- factor(unname(lut[injury]))
Now, I should consider a few other questions.
- How to partition data with unbalanced classes (“No” far outnumbers “Yes” values in
injury
.)
- How to partition data with unbalanced classes (“No” far outnumbers “Yes” values in
- How to validate model results
- Which metric to maximize for the best model (Accuracy, Sensitivity, Specificity)
Thankfully, makes data partitioning easy with createDataPartition
, which automatically uses stratified sampling to help control for severe class imbalances. As for validation, I choose for each model to be cross validated using 10 fold repeat cross validation, specified in the trainControl
command. (Although my first model will not need cross validation, the others will, so I’ll simply use the same control for each model.)
set.seed(1234)
##Partition the data with stratified sampling
training <- createDataPartition(injury2, p = .8, list = FALSE)
##Partition train and test sets for bio data
bio_train <- bio[training, ]
bio_test <- bio[-training, ]
##Partition for train and test sets for chemical data
chem_train <- chem[training, ]
chem_test <- chem[-training, ]
##Partition for the train and test sets for the response variable
injury_train <- injury2[training]
injury_test <- injury2[-training]
## Set training control for model building
ctrl <- trainControl(method = "repeatedcv", 10, repeats = 10, summaryFunction = twoClassSummary, classProbs = TRUE, savePredictions = TRUE)
It’s important to decide what goal to train my models for - Accuracy, Sensitivity, or Specificity? For this, you also need to know what your “positive” variable is from Caret’s perspective. Caret chooses the first factor class as its “positive” value, which corresponds to “No” in my injury
vector. Therefore, Sensitivity corresponds to amount of “No” values correctly predicted, whereas Specificity equates to the “Yes” values correctly predicted.
Deciding between those choices, it seems most important to build a model that can predict the “Yes” values - cases result in hepatic damage. This makes sense, a mistake in a model predicting no liver damage would tragic, so we should do everything we can to capture the “Yes” values as much as possible, even if it means sacrificing accuracy. A look at the data:
The first model is a classifier using linear discriminant analysis. I apply a model for the biological indicators as well as the chemical indicators, to see which have better predicative power.
bio_lda <- train(bio_train, injury_train, method = "lda", trControl = ctrl, metric = "Spec")
chem_lda <- train(chem_train, injury_train, method = "lda", trControl = ctrl, metric = "Spec")
##Generate LDA model predictions for bio indicator test set
bio_lda_pred <- predict(bio_lda, bio_test)
##Generate confusion matrix to show results
confusionMatrix(bio_lda_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 44 3
## Yes 6 3
##
## Accuracy : 0.8393
## 95% CI : (0.7167, 0.9238)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9278
##
## Kappa : 0.3115
## Mcnemar's Test P-Value : 0.5050
##
## Sensitivity : 0.8800
## Specificity : 0.5000
## Pos Pred Value : 0.9362
## Neg Pred Value : 0.3333
## Prevalence : 0.8929
## Detection Rate : 0.7857
## Detection Prevalence : 0.8393
## Balanced Accuracy : 0.6900
##
## 'Positive' Class : No
##
Note that the “non-information rate” is .89, meaning that if we randomly guessed “No” each time, the model would automatically be right 89% of the time. Accuracy here is .83, which appears to underperform. But remember, accuracy isn’t important - predicting true “Yes” values correctly is.
For the biological predictors, we get .5 for Specificity, correctly identifying 3 Yes values but also generating 3 false negative predictions for 3 other Yes values – I hope other models can do better.
##Chem predictor LDA model
chem_lda_pred <- predict(chem_lda, chem_test)
##confusion matrix
confusionMatrix(chem_lda_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 41 5
## Yes 9 1
##
## Accuracy : 0.75
## 95% CI : (0.6163, 0.8561)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9994
##
## Kappa : -0.0103
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.8200
## Specificity : 0.1667
## Pos Pred Value : 0.8913
## Neg Pred Value : 0.1000
## Prevalence : 0.8929
## Detection Rate : 0.7321
## Detection Prevalence : 0.8214
## Balanced Accuracy : 0.4933
##
## 'Positive' Class : No
##
LDA for chemical predictors fares worse at predicting the true Yes values. Here Specificity only reaches .16.
Now lets try some other models for comparison. The first will be a penalized logistic regression model. For alpha values of 1, and lambda 0, it will behave like a lasso model, whereas with alpha 0 and a non-zero lambda, a ridge regression model. Anywhere in between is an elastic net. Here, I don’t specify a tuning grid, I just let Caret come up with a list of parameters for me, with `tuneLength = 20’.
bio_plr <- train(bio_train, injury_train, method = "glmnet", family = "binomial", metric = "Spec", tuneLength = 20,
trControl = ctrl)
chem_plr <- train(chem_train, injury_train, method = "glmnet", family = "binomial", metric = "Spec", tuneLength = 20, trControl = ctrl)
bio_plr_pred <- predict(bio_plr, bio_test)
confusionMatrix(bio_plr_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 43 4
## Yes 7 2
##
## Accuracy : 0.8036
## 95% CI : (0.6757, 0.8977)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9858
##
## Kappa : 0.1585
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8600
## Specificity : 0.3333
## Pos Pred Value : 0.9149
## Neg Pred Value : 0.2222
## Prevalence : 0.8929
## Detection Rate : 0.7679
## Detection Prevalence : 0.8393
## Balanced Accuracy : 0.5967
##
## 'Positive' Class : No
##
This penalized logistic regression doesn’t perform as well as simple linear discriminant analysis for the bio predictors.
And now the chem predictors.
chem_plr_pred <- predict(chem_plr, chem_test)
confusionMatrix(chem_plr_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 44 6
## Yes 6 0
##
## Accuracy : 0.7857
## 95% CI : (0.6556, 0.8841)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9945
##
## Kappa : -0.12
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8800
## Specificity : 0.0000
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.0000
## Prevalence : 0.8929
## Detection Rate : 0.7857
## Detection Prevalence : 0.8929
## Balanced Accuracy : 0.4400
##
## 'Positive' Class : No
##
Penalized logistic regression fares even worse for chemical predictors. Clearly this is not a strong model to capture the structure of the data for the pattern we’re looking for.
Now, to fit a partial least squares regression model.
bio_pls <- train(bio_train, injury_train, method = "pls", trControl = ctrl, metric = "Spec", tuneLength = 20)
chem_pls <- train(chem_train, injury_train, method = "pls", trControl = ctrl, metric = "Spec", tuneLength = 20)
bio_pls_pred <- predict(bio_pls, bio_test)
confusionMatrix(bio_pls_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 45 5
## Yes 5 1
##
## Accuracy : 0.8214
## 95% CI : (0.696, 0.9109)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9664
##
## Kappa : 0.0667
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9000
## Specificity : 0.1667
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.1667
## Prevalence : 0.8929
## Detection Rate : 0.8036
## Detection Prevalence : 0.8929
## Balanced Accuracy : 0.5333
##
## 'Positive' Class : No
##
Only .16 for Specificity achieved here.
And for the chemical predictors, we get 0 as seen below. Chemical indicators are continuously faring worse than biological predictors at predicting hepatic injury, and there is still no great model, so far.
chem_pls_pred <- predict(chem_pls, chem_test, probability = TRUE)
confusionMatrix(chem_pls_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 47 6
## Yes 3 0
##
## Accuracy : 0.8393
## 95% CI : (0.7167, 0.9238)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9278
##
## Kappa : -0.0769
## Mcnemar's Test P-Value : 0.5050
##
## Sensitivity : 0.9400
## Specificity : 0.0000
## Pos Pred Value : 0.8868
## Neg Pred Value : 0.0000
## Prevalence : 0.8929
## Detection Rate : 0.8393
## Detection Prevalence : 0.9464
## Balanced Accuracy : 0.4700
##
## 'Positive' Class : No
##
bio_centroid <- train(bio_train, injury_train, method = "pam",
trControl = ctrl, preProcess = c("center", "scale"), metric = "Spec", tuneLength = 20)
## 12345678910111213141516171819202122232425262728293011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
chem_centroid <- train(chem_train, injury_train, method = "pam",
trControl = ctrl, preProcess = c("center", "scale"), metric = "Spec", tuneLength = 20)
## 12345678910111213141516171819202122232425262728293011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
bio_centroid_pred <- predict(bio_centroid, bio_test)
chem_centroid_pred <- predict(chem_centroid, chem_test)
confusionMatrix(bio_centroid_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 50 6
## Yes 0 0
##
## Accuracy : 0.8929
## 95% CI : (0.7812, 0.9597)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.60647
##
## Kappa : 0
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8929
## Neg Pred Value : NaN
## Prevalence : 0.8929
## Detection Rate : 0.8929
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : No
##
0 for specificity.
confusionMatrix(chem_centroid_pred, injury_test, positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 45 5
## Yes 5 1
##
## Accuracy : 0.8214
## 95% CI : (0.696, 0.9109)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9664
##
## Kappa : 0.0667
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9000
## Specificity : 0.1667
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.1667
## Prevalence : 0.8929
## Detection Rate : 0.8036
## Detection Prevalence : 0.8929
## Balanced Accuracy : 0.5333
##
## 'Positive' Class : No
##
And .16. for Specificity here, yawn.
So the best model for predicting hepatic injury turns out to be the first fit, the LDA model on the biological indicators.
predictions_bio_lda <- predict(bio_lda, bio_test, type = "prob")
pROC::plot.roc(injury_test, predictions_bio_lda$Yes)
However, the area under the curve is not as high as I’d wish. Perhaps in the future I’ll revisit this data and see what can be done differently to predict injury.