── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
here() starts at /Users/deannalanier/Desktop/All_Classes_UGA/2023Spr_Classes/MADA/deannalanier-MADA-portfolio
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.2 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.0.1
✔ infer 1.0.4 ✔ workflows 1.1.3
✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
✔ parsnip 1.0.4 ✔ yardstick 1.1.0
✔ recipes 1.0.5
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
Flu Anlaysis - Wrangling
Data Loading and Processing
Load Libraries
Load the data
= readRDS(here("fluanalysis", "data", "SympAct_Any_Pos.Rda")) #load RDS file raw_data
Data Contents
str(raw_data) #ensure the data is complete
'data.frame': 735 obs. of 63 variables:
$ DxName1 : Factor w/ 92 levels "Acute bronchitis, unspecified",..: 57 16 57 57 9 57 39 17 57 57 ...
$ DxName2 : Factor w/ 142 levels "14 weeks gestation of pregnancy",..: NA 69 11 129 69 NA 69 60 12 NA ...
$ DxName3 : Factor w/ 92 levels "Abnormal weight loss",..: NA NA NA NA NA NA NA NA 38 NA ...
$ DxName4 : Factor w/ 41 levels "Acute bronchitis, unspecified",..: NA NA NA NA NA NA NA NA 30 NA ...
$ DxName5 : Factor w/ 5 levels "Acute suppurative otitis media without spontaneous rupture of ear drum, right ear",..: NA NA NA NA NA NA NA NA 3 NA ...
$ Unique.Visit : chr "340_17632125" "340_17794836" "342_17737773" "342_17806002" ...
$ ActivityLevel : int 10 6 2 2 5 3 4 0 0 5 ...
..- attr(*, "label")= chr "Activity Level"
$ ActivityLevelF : Factor w/ 11 levels "0","1","2","3",..: 11 7 3 3 6 4 5 1 1 6 ...
$ SwollenLymphNodes: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 2 1 ...
..- attr(*, "label")= chr "Swollen Lymph Nodes"
$ ChestCongestion : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
..- attr(*, "label")= chr "Chest Congestion"
$ ChillsSweats : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 2 2 1 ...
..- attr(*, "label")= chr "Chills/Sweats"
$ NasalCongestion : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
..- attr(*, "label")= chr "Nasal Congestion"
$ CoughYN : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 2 2 2 ...
..- attr(*, "label")= chr "Cough"
$ Sneeze : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 2 1 1 ...
..- attr(*, "label")= chr "Sneeze"
$ Fatigue : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Fatigue"
$ SubjectiveFever : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
..- attr(*, "label")= chr "Subjective Fever"
$ Headache : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
..- attr(*, "label")= chr "Headache"
$ Weakness : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 3 3 2 4 3 3 ...
$ WeaknessYN : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Weakness"
$ CoughIntensity : Factor w/ 4 levels "None","Mild",..: 4 4 2 3 1 3 4 3 3 3 ...
..- attr(*, "label")= chr "Cough Severity"
$ CoughYN2 : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
$ Myalgia : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 2 3 2 4 3 2 ...
$ MyalgiaYN : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Myalgia"
$ RunnyNose : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 2 2 2 2 ...
..- attr(*, "label")= chr "Runny Nose"
$ AbPain : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Abdominal Pain"
$ ChestPain : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 2 1 1 1 ...
..- attr(*, "label")= chr "Chest Pain"
$ Diarrhea : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
$ EyePn : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
..- attr(*, "label")= chr "Eye Pain"
$ Insomnia : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 1 2 2 2 ...
..- attr(*, "label")= chr "Sleeplessness"
$ ItchyEye : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Itchy Eyes"
$ Nausea : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 2 2 ...
$ EarPn : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Ear Pain"
$ Hearing : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Loss of Hearing"
$ Pharyngitis : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 1 1 ...
..- attr(*, "label")= chr "Sore Throat"
$ Breathless : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 2 ...
..- attr(*, "label")= chr "Breathlessness"
$ ToothPn : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 2 1 ...
..- attr(*, "label")= chr "Tooth Pain"
$ Vision : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Blurred Vision"
$ Vomit : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
..- attr(*, "label")= chr "Vomiting"
$ Wheeze : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
..- attr(*, "label")= chr "Wheezing"
$ BodyTemp : num 98.3 100.4 100.8 98.8 100.5 ...
$ RapidFluA : Factor w/ 2 levels "Positive for Influenza A",..: 2 NA 2 2 NA NA NA 1 2 2 ...
$ RapidFluB : Factor w/ 2 levels "Positive for Influenza B",..: 2 NA 2 2 NA NA NA 2 2 2 ...
$ PCRFluA : Factor w/ 4 levels " Influenza A Detected",..: NA NA NA NA NA NA 2 NA NA NA ...
$ PCRFluB : Factor w/ 3 levels " Influenza B Detected",..: NA NA NA NA NA NA 2 NA NA NA ...
$ TransScore1 : num 1 3 4 5 0 2 2 5 4 4 ...
$ TransScore1F : Factor w/ 6 levels "0","1","2","3",..: 2 4 5 6 1 3 3 6 5 5 ...
..- attr(*, "label")= chr "Infectiousness Score"
$ TransScore2 : num 1 2 3 4 0 2 2 4 3 3 ...
$ TransScore2F : Factor w/ 5 levels "0","1","2","3",..: 2 3 4 5 1 3 3 5 4 4 ...
..- attr(*, "label")= chr "Infectiousness Score"
$ TransScore3 : num 1 1 2 3 0 2 2 3 2 2 ...
$ TransScore3F : Factor w/ 4 levels "0","1","2","3": 2 2 3 4 1 3 3 4 3 3 ...
..- attr(*, "label")= chr "Infectiousness Score"
$ TransScore4 : num 0 2 4 4 0 1 1 4 3 3 ...
$ TransScore4F : Factor w/ 5 levels "0","1","2","3",..: 1 3 5 5 1 2 2 5 4 4 ...
$ ImpactScore : int 7 8 14 12 11 12 8 7 10 7 ...
$ ImpactScore2 : int 6 7 13 11 10 11 7 6 9 6 ...
$ ImpactScore3 : int 3 4 9 7 6 7 3 3 6 4 ...
$ ImpactScoreF : Factor w/ 21 levels "0","1","2","3",..: 8 9 15 13 12 13 9 8 11 8 ...
..- attr(*, "label")= chr "Morbidity Score"
$ ImpactScore2F : Factor w/ 19 levels "0","1","2","3",..: 7 8 14 12 11 12 8 7 10 7 ...
..- attr(*, "label")= chr "Morbidity Score"
$ ImpactScore3F : Factor w/ 15 levels "0","1","2","3",..: 4 5 10 8 7 8 4 4 7 5 ...
..- attr(*, "label")= chr "Morbidity Score"
$ ImpactScoreFD : Factor w/ 17 levels "2","3","4","5",..: 6 7 13 11 10 11 7 6 9 6 ...
$ TotalSymp1 : num 8 11 18 17 11 14 10 12 14 11 ...
$ TotalSymp1F : Factor w/ 19 levels "5","6","7","8",..: 4 7 14 13 7 10 6 8 10 7 ...
$ TotalSymp2 : num 8 10 17 16 11 14 10 11 13 10 ...
$ TotalSymp3 : num 8 9 16 15 11 14 10 10 12 9 ...
Process data as follows:
1. Remove all variables that have Score or Total or FluA or FluB or Dxname or Activity or Unique.Visit
2. Remove all observations with NA
Don’t do this manually one by one, figure out how to use R commands that let you remove things in an efficient manner.
= raw_data %>% #create new variable to ensure raw_data is not manipulated
clean_data select(-contains(c("Score","Total","FluA","FluB","Dxname","Activity")))%>% #select columns to remove
select(-c('Unique.Visit')) %>% #remove collumn
na.omit() #remove all observations with NA
Additional Processing (added for machine learning modeling Module 11)
remove yes/no versions for symptoms with multiple levels
=
clean_data_update %>%
clean_data select(!c(WeaknessYN, CoughYN, MyalgiaYN, CoughYN2))
categorical/ordinal predictors there are categorical and ordinal predictors. Code categorical variables as unordered factors and others as ordered factors.
# categorical productors
= recipe(~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + Sneeze + Fatigue + SubjectiveFever + Headache + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + EarPn + Pharyngitis + Breathless + ToothPn + Vomit + Wheeze, data = clean_data_update)
unorderedRecipe
= unorderedRecipe %>%
unorderedTest step_dummy(all_predictors()) %>%
prep(training = clean_data_update)
= bake(unorderedTest, new_data = NULL) categoricalTestData
#ordinal predictor
#specify the ordinal levels as found in the data
= c("None", "Mild", "Moderate", "Severe")
ordinalLevels = clean_data_update %>%
clean_data_update mutate(Weakness = ordered(Weakness),
CoughIntensity = ordered(CoughIntensity),
Myalgia = ordered(Myalgia))
#create recipe (like categorical above)
= recipe(~ Weakness + CoughIntensity + Myalgia, data = clean_data_update)
ordinalRecipe = ordinalRecipe %>%
ordinalTest step_ordinalscore(all_predictors()) %>%
prep(training = clean_data_update)
<- bake(ordinalTest, new_data = NULL) ordinalTestData
identify and remove binary predictors that have <50 entries in one category
summary(clean_data_update)
SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze
No :418 No :323 No :130 No :167 No :339
Yes:312 Yes:407 Yes:600 Yes:563 Yes:391
Fatigue SubjectiveFever Headache Weakness CoughIntensity
No : 64 No :230 No :115 None : 49 None : 47
Yes:666 Yes:500 Yes:615 Mild :223 Mild :154
Moderate:338 Moderate:357
Severe :120 Severe :172
Myalgia RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
None : 79 No :211 No :639 No :497 No :631 No :617 No :315
Mild :213 Yes:519 Yes: 91 Yes:233 Yes: 99 Yes:113 Yes:415
Moderate:325
Severe :113
ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn
No :551 No :475 No :568 No :700 No :119 No :436 No :565
Yes:179 Yes:255 Yes:162 Yes: 30 Yes:611 Yes:294 Yes:165
Vision Vomit Wheeze BodyTemp
No :711 No :652 No :510 Min. : 97.20
Yes: 19 Yes: 78 Yes:220 1st Qu.: 98.20
Median : 98.50
Mean : 98.94
3rd Qu.: 99.30
Max. :103.10
# vision and hearing have les than 50
=
clean_data_update %>%
clean_data_update select(!c(Vision, Hearing))
Save RDS File
#save cleaned data
= here("fluanalysis", "data", "cleandata.rds")
cleandata_location saveRDS(clean_data, file = cleandata_location)
# save cleaned data after Module 11
= here("fluanalysis", "data", "cleandata2.rds")
cleandata_location2 saveRDS(clean_data_update, file = cleandata_location2)