── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
here() starts at /Users/deannalanier/Desktop/All_Classes_UGA/2023Spr_Classes/MADA/deannalanier-MADA-portfolio
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.2 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.0.1
✔ infer 1.0.4 ✔ workflows 1.1.3
✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
✔ parsnip 1.0.4 ✔ yardstick 1.1.0
✔ recipes 1.0.5
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-7
Flu Anlaysis - Model Eval
Load Libraries
Load the data
#path to clean data
= readRDS(here("fluanalysis", "data", "cleandata.rds")) #load RDS file data
Check the data
#check the data to make sure it has loaded properly
head(data)
SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN Sneeze
1 Yes No No No Yes No
2 Yes Yes No Yes Yes No
3 Yes Yes Yes Yes No Yes
4 Yes Yes Yes Yes Yes Yes
5 Yes No Yes No No No
6 No No Yes No Yes Yes
Fatigue SubjectiveFever Headache Weakness WeaknessYN CoughIntensity CoughYN2
1 Yes Yes Yes Mild Yes Severe Yes
2 Yes Yes Yes Severe Yes Severe Yes
3 Yes Yes Yes Severe Yes Mild Yes
4 Yes Yes Yes Severe Yes Moderate Yes
5 Yes Yes Yes Moderate Yes None No
6 Yes Yes Yes Moderate Yes Moderate Yes
Myalgia MyalgiaYN RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
1 Mild Yes No No No No No No
2 Severe Yes No No No No No No
3 Severe Yes Yes Yes Yes No No Yes
4 Severe Yes Yes No No No No Yes
5 Mild Yes No No No No Yes Yes
6 Moderate Yes No No Yes Yes No No
ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn Vision Vomit
1 No No No No Yes No No No No
2 No No Yes Yes Yes No No No No
3 No Yes No No Yes Yes Yes No No
4 No Yes Yes No Yes No No No No
5 No Yes No No Yes No No No No
6 No Yes No No Yes Yes No No No
Wheeze BodyTemp
1 No 98.3
2 No 100.4
3 No 100.8
4 Yes 98.8
5 No 100.5
6 Yes 98.4
Data Splitting
# set seed
set.seed(456)
## Split the data into test and training
=initial_split(data,strata = Nausea)
data_split
#create training and test
=training(data_split)
data_train=testing(data_split) data_test
Create Workflow Fit Model
#logistical model because of categorical outcome
=recipe(Nausea~.,data=data_train)
recipe_1=logistic_reg()%>%
log_modset_engine("glm")
#create workflow
=workflow()%>%
workflow_1add_model(log_mod)%>%
add_recipe(recipe_1)
workflow_1
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
#fit the model with training set
=workflow_1%>%
model_fitfit(data=data_train)
%>%
model_fitextract_fit_parsnip()%>%
tidy()
# A tibble: 38 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.21 9.15 -0.351 0.726
2 SwollenLymphNodesYes -0.435 0.231 -1.88 0.0596
3 ChestCongestionYes 0.0827 0.250 0.330 0.741
4 ChillsSweatsYes 0.119 0.332 0.358 0.720
5 NasalCongestionYes 0.327 0.295 1.11 0.267
6 CoughYNYes -0.501 0.590 -0.849 0.396
7 SneezeYes 0.214 0.251 0.854 0.393
8 FatigueYes 0.391 0.463 0.845 0.398
9 SubjectiveFeverYes 0.423 0.260 1.62 0.104
10 HeadacheYes 0.415 0.342 1.22 0.224
# … with 28 more rows
#model evaluation
predict(model_fit,data_train)
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
# A tibble: 547 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 No
6 No
7 No
8 No
9 No
10 No
# … with 537 more rows
=augment(model_fit,data_train) model_train_eval
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
#ROC curve to estimate area
%>%
model_train_eval roc_curve(truth = Nausea, .pred_No) %>%
autoplot()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the yardstick package.
Please report the issue at <]8;;https://github.com/tidymodels/yardstick/issueshttps://github.com/tidymodels/yardstick/issues]8;;>.
#AUC
%>%
model_train_eval roc_auc(truth = Nausea, .pred_No)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.790
Predict
# eval
predict(model_fit,data_test)
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
# A tibble: 183 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 Yes
6 Yes
7 No
8 Yes
9 Yes
10 No
# … with 173 more rows
=augment(model_fit,data_test) model_test_eval
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
#ROC curve to estimate area
%>%
model_test_evalroc_curve(truth=Nausea,.pred_No)%>%
autoplot()
#AUC
=augment(model_fit,data_test) model_test_eval
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
%>%
model_test_evalroc_auc(truth=Nausea,.pred_No)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.707
fit model with main predictor
set.seed(5678)
=recipe(Nausea~RunnyNose,data=data_train)
recipe_2
#logistical model
=logistic_reg()%>%
log_modset_engine("glm")
=workflow()%>%
workflow_2add_model(log_mod)%>%
add_recipe(recipe_2)
workflow_2
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
=workflow_2%>%
model_2fit(data=data_train)
%>%
model_2extract_fit_parsnip()%>%
tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.625 0.170 -3.67 0.000242
2 RunnyNoseYes 0.00301 0.200 0.0150 0.988
Use model to predict
# model eval training set ROC
predict(model_2,data_train)
# A tibble: 547 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 No
6 No
7 No
8 No
9 No
10 No
# … with 537 more rows
=augment(model_2,data_train)
model_train_eval_2%>%
model_train_eval_2roc_curve(truth=Nausea,.pred_No)%>%
autoplot()
#AUC
%>%
model_train_eval_2roc_auc(truth=Nausea,.pred_No)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.500
# ROC and prediction
predict(model_2,data_test)
# A tibble: 183 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 No
6 No
7 No
8 No
9 No
10 No
# … with 173 more rows
=augment(model_2,data_test)
model_test_eval_2%>%
model_test_eval_2roc_curve(truth=Nausea,.pred_No)%>%
autoplot()
# AUC
%>%
model_test_eval_2roc_auc(truth=Nausea,.pred_No)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.520
This section is added by Aidan Troha
We will be using the data from the cleaned flu analysis data, so we will need to load the data from the data
folder.
<- readRDS(here::here("fluanalysis","data","cleandata.rds")) dat
Generating training and test data sets
We’ll then need to find a way to create a dummy data set, called the test data set, from the cleaned data. We will use this data to test the efficacy of the generated model. We will use the remaining data, the training data set, to fit the model.
To attempt this, we will set a seed with set.seed()
for randomization to ensure that these processes are reproducible. Further, we use initial_split()
from the rsample
package to generate a splitting rule for the training
and test
data sets.
set.seed(55555)
<- rsample::initial_split(dat,prop=7/10)
data_split <- training(data_split)
training_data <- testing(data_split) test_data
Generating a worklow
We intend to use the tidymodels
workflow to generate our linear regression model. Within this workflow, we use recipe()
and worklfow()
to identify the relationships of interest.
# Initialize the interactions we are interested in
<- recipe(BodyTemp ~ ., data = training_data)
flu_line_rec # Initialize the logistic regression formula
<- linear_reg() %>%
line_mod set_engine("lm")
# Initialize the workflow
<-
flu_wflowP2 workflow() %>%
add_model(line_mod) %>%
add_recipe(flu_line_rec)
flu_wflowP2
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Now that we have generated the workflow, we can fit the model to the training and test data sets, respectively.
<- flu_wflowP2 %>%
training_fit fit(data = training_data)
<- flu_wflowP2 %>%
test_fit fit(data = test_data)
Fitting the model with primary predictor
Now, let’s choose only 1 predictor instead of using all of them.
<- recipe(BodyTemp ~ RunnyNose, data = training_data)
flu_line_rec2
<-
flu_wflow2 workflow() %>%
add_model(line_mod) %>%
add_recipe(flu_line_rec2)
<- flu_wflow2 %>%
training_fit2 fit(data = training_data)
<- flu_wflow2 %>%
test_fit2 fit(data = test_data)
We now want to compare the estimates across both models for each data set. To do this, we use augment()
.
<- augment(training_fit, training_data) training_aug
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from a rank-deficient fit may be misleading
<- augment(test_fit, test_data) test_aug
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from a rank-deficient fit may be misleading
<- augment(training_fit2, training_data)
training_aug2 <- augment(test_fit2, test_data) test_aug2
If we want to assess how well the model makes predictions, we can evaluate this with the Root Mean Squared Error, the RMSE, for continuous variable. rmse
from the Metrics
package will evaluate the fit of the model on the training_data
and the test_data
, separately.
# Model with all predictors
::rmse(actual = training_aug$BodyTemp, predicted = training_aug$.pred) Metrics
[1] 1.137446
::rmse(actual = test_aug$BodyTemp, predicted = test_aug$.pred) Metrics
[1] 0.9599012
# Model with main predictor alone
::rmse(actual = training_aug2$BodyTemp, predicted = training_aug2$.pred) Metrics
[1] 1.215498
::rmse(actual = test_aug2$BodyTemp, predicted = test_aug2$.pred) Metrics
[1] 1.120484
Conclusion
The Data above show that the model with all the possible predictors is the best model to fit the data as it minimizes the Root Mean Squared Errors (RMSE) of the data.