Flu Anlaysis - Wrangling

Data Loading and Processing

Load Libraries

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
here() starts at /Users/deannalanier/Desktop/All_Classes_UGA/2023Spr_Classes/MADA/deannalanier-MADA-portfolio

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.4     ✔ yardstick    1.1.0
✔ recipes      1.0.5     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

Load the data

raw_data = readRDS(here("fluanalysis", "data", "SympAct_Any_Pos.Rda")) #load RDS file

Data Contents

str(raw_data) #ensure the data is complete
'data.frame':   735 obs. of  63 variables:
 $ DxName1          : Factor w/ 92 levels "Acute bronchitis, unspecified",..: 57 16 57 57 9 57 39 17 57 57 ...
 $ DxName2          : Factor w/ 142 levels "14 weeks gestation of pregnancy",..: NA 69 11 129 69 NA 69 60 12 NA ...
 $ DxName3          : Factor w/ 92 levels "Abnormal weight loss",..: NA NA NA NA NA NA NA NA 38 NA ...
 $ DxName4          : Factor w/ 41 levels "Acute bronchitis, unspecified",..: NA NA NA NA NA NA NA NA 30 NA ...
 $ DxName5          : Factor w/ 5 levels "Acute suppurative otitis media without spontaneous rupture of ear drum, right ear",..: NA NA NA NA NA NA NA NA 3 NA ...
 $ Unique.Visit     : chr  "340_17632125" "340_17794836" "342_17737773" "342_17806002" ...
 $ ActivityLevel    : int  10 6 2 2 5 3 4 0 0 5 ...
  ..- attr(*, "label")= chr "Activity Level"
 $ ActivityLevelF   : Factor w/ 11 levels "0","1","2","3",..: 11 7 3 3 6 4 5 1 1 6 ...
 $ SwollenLymphNodes: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 2 1 ...
  ..- attr(*, "label")= chr "Swollen Lymph Nodes"
 $ ChestCongestion  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Chest Congestion"
 $ ChillsSweats     : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 2 2 1 ...
  ..- attr(*, "label")= chr "Chills/Sweats"
 $ NasalCongestion  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Nasal Congestion"
 $ CoughYN          : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 2 2 2 ...
  ..- attr(*, "label")= chr "Cough"
 $ Sneeze           : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 2 1 1 ...
  ..- attr(*, "label")= chr "Sneeze"
 $ Fatigue          : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "label")= chr "Fatigue"
 $ SubjectiveFever  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
  ..- attr(*, "label")= chr "Subjective Fever"
 $ Headache         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
  ..- attr(*, "label")= chr "Headache"
 $ Weakness         : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 3 3 2 4 3 3 ...
 $ WeaknessYN       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "label")= chr "Weakness"
 $ CoughIntensity   : Factor w/ 4 levels "None","Mild",..: 4 4 2 3 1 3 4 3 3 3 ...
  ..- attr(*, "label")= chr "Cough Severity"
 $ CoughYN2         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ Myalgia          : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 2 3 2 4 3 2 ...
 $ MyalgiaYN        : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "label")= chr "Myalgia"
 $ RunnyNose        : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 2 2 2 2 ...
  ..- attr(*, "label")= chr "Runny Nose"
 $ AbPain           : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Abdominal Pain"
 $ ChestPain        : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 2 1 1 1 ...
  ..- attr(*, "label")= chr "Chest Pain"
 $ Diarrhea         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ EyePn            : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Eye Pain"
 $ Insomnia         : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Sleeplessness"
 $ ItchyEye         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Itchy Eyes"
 $ Nausea           : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 2 2 ...
 $ EarPn            : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Ear Pain"
 $ Hearing          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Loss of Hearing"
 $ Pharyngitis      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 1 1 ...
  ..- attr(*, "label")= chr "Sore Throat"
 $ Breathless       : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 2 ...
  ..- attr(*, "label")= chr "Breathlessness"
 $ ToothPn          : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 2 1 ...
  ..- attr(*, "label")= chr "Tooth Pain"
 $ Vision           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Blurred Vision"
 $ Vomit            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
  ..- attr(*, "label")= chr "Vomiting"
 $ Wheeze           : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
  ..- attr(*, "label")= chr "Wheezing"
 $ BodyTemp         : num  98.3 100.4 100.8 98.8 100.5 ...
 $ RapidFluA        : Factor w/ 2 levels "Positive for Influenza A",..: 2 NA 2 2 NA NA NA 1 2 2 ...
 $ RapidFluB        : Factor w/ 2 levels "Positive for Influenza B",..: 2 NA 2 2 NA NA NA 2 2 2 ...
 $ PCRFluA          : Factor w/ 4 levels " Influenza A Detected",..: NA NA NA NA NA NA 2 NA NA NA ...
 $ PCRFluB          : Factor w/ 3 levels " Influenza B Detected",..: NA NA NA NA NA NA 2 NA NA NA ...
 $ TransScore1      : num  1 3 4 5 0 2 2 5 4 4 ...
 $ TransScore1F     : Factor w/ 6 levels "0","1","2","3",..: 2 4 5 6 1 3 3 6 5 5 ...
  ..- attr(*, "label")= chr "Infectiousness Score"
 $ TransScore2      : num  1 2 3 4 0 2 2 4 3 3 ...
 $ TransScore2F     : Factor w/ 5 levels "0","1","2","3",..: 2 3 4 5 1 3 3 5 4 4 ...
  ..- attr(*, "label")= chr "Infectiousness Score"
 $ TransScore3      : num  1 1 2 3 0 2 2 3 2 2 ...
 $ TransScore3F     : Factor w/ 4 levels "0","1","2","3": 2 2 3 4 1 3 3 4 3 3 ...
  ..- attr(*, "label")= chr "Infectiousness Score"
 $ TransScore4      : num  0 2 4 4 0 1 1 4 3 3 ...
 $ TransScore4F     : Factor w/ 5 levels "0","1","2","3",..: 1 3 5 5 1 2 2 5 4 4 ...
 $ ImpactScore      : int  7 8 14 12 11 12 8 7 10 7 ...
 $ ImpactScore2     : int  6 7 13 11 10 11 7 6 9 6 ...
 $ ImpactScore3     : int  3 4 9 7 6 7 3 3 6 4 ...
 $ ImpactScoreF     : Factor w/ 21 levels "0","1","2","3",..: 8 9 15 13 12 13 9 8 11 8 ...
  ..- attr(*, "label")= chr "Morbidity Score"
 $ ImpactScore2F    : Factor w/ 19 levels "0","1","2","3",..: 7 8 14 12 11 12 8 7 10 7 ...
  ..- attr(*, "label")= chr "Morbidity Score"
 $ ImpactScore3F    : Factor w/ 15 levels "0","1","2","3",..: 4 5 10 8 7 8 4 4 7 5 ...
  ..- attr(*, "label")= chr "Morbidity Score"
 $ ImpactScoreFD    : Factor w/ 17 levels "2","3","4","5",..: 6 7 13 11 10 11 7 6 9 6 ...
 $ TotalSymp1       : num  8 11 18 17 11 14 10 12 14 11 ...
 $ TotalSymp1F      : Factor w/ 19 levels "5","6","7","8",..: 4 7 14 13 7 10 6 8 10 7 ...
 $ TotalSymp2       : num  8 10 17 16 11 14 10 11 13 10 ...
 $ TotalSymp3       : num  8 9 16 15 11 14 10 10 12 9 ...

Process data as follows:

1. Remove all variables that have Score or Total or FluA or FluB or Dxname or Activity or Unique.Visit

2. Remove all observations with NA

Don’t do this manually one by one, figure out how to use R commands that let you remove things in an efficient manner.

clean_data = raw_data %>% #create new variable to ensure raw_data is not manipulated
  select(-contains(c("Score","Total","FluA","FluB","Dxname","Activity")))%>% #select columns to remove
  select(-c('Unique.Visit')) %>% #remove collumn
  na.omit() #remove all observations with NA

Additional Processing (added for machine learning modeling Module 11)

remove yes/no versions for symptoms with multiple levels

clean_data_update =
  clean_data %>%
  select(!c(WeaknessYN, CoughYN, MyalgiaYN, CoughYN2))

categorical/ordinal predictors there are categorical and ordinal predictors. Code categorical variables as unordered factors and others as ordered factors.

# categorical productors 
unorderedRecipe = recipe(~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + Sneeze + Fatigue + SubjectiveFever + Headache + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + EarPn + Pharyngitis + Breathless + ToothPn + Vomit + Wheeze, data = clean_data_update)

unorderedTest = unorderedRecipe %>%
  step_dummy(all_predictors()) %>%
  prep(training = clean_data_update)

categoricalTestData = bake(unorderedTest, new_data = NULL)
#ordinal predictor

#specify the ordinal levels as found in the data
ordinalLevels = c("None", "Mild", "Moderate", "Severe")
clean_data_update = clean_data_update %>%
  mutate(Weakness = ordered(Weakness),
         CoughIntensity = ordered(CoughIntensity),
         Myalgia = ordered(Myalgia))

#create recipe (like categorical above)
ordinalRecipe = recipe(~ Weakness + CoughIntensity + Myalgia, data = clean_data_update)
ordinalTest = ordinalRecipe %>%
  step_ordinalscore(all_predictors()) %>%
  prep(training = clean_data_update)
ordinalTestData <- bake(ordinalTest, new_data = NULL)

identify and remove binary predictors that have <50 entries in one category

summary(clean_data_update)
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze   
 No :418           No :323         No :130      No :167         No :339  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:391  
                                                                         
                                                                         
                                                                         
                                                                         
 Fatigue   SubjectiveFever Headache      Weakness    CoughIntensity
 No : 64   No :230         No :115   None    : 49   None    : 47   
 Yes:666   Yes:500         Yes:615   Mild    :223   Mild    :154   
                                     Moderate:338   Moderate:357   
                                     Severe  :120   Severe  :172   
                                                                   
                                                                   
     Myalgia    RunnyNose AbPain    ChestPain Diarrhea  EyePn     Insomnia 
 None    : 79   No :211   No :639   No :497   No :631   No :617   No :315  
 Mild    :213   Yes:519   Yes: 91   Yes:233   Yes: 99   Yes:113   Yes:415  
 Moderate:325                                                              
 Severe  :113                                                              
                                                                           
                                                                           
 ItchyEye  Nausea    EarPn     Hearing   Pharyngitis Breathless ToothPn  
 No :551   No :475   No :568   No :700   No :119     No :436    No :565  
 Yes:179   Yes:255   Yes:162   Yes: 30   Yes:611     Yes:294    Yes:165  
                                                                         
                                                                         
                                                                         
                                                                         
 Vision    Vomit     Wheeze       BodyTemp     
 No :711   No :652   No :510   Min.   : 97.20  
 Yes: 19   Yes: 78   Yes:220   1st Qu.: 98.20  
                               Median : 98.50  
                               Mean   : 98.94  
                               3rd Qu.: 99.30  
                               Max.   :103.10  
# vision and hearing have les than 50

clean_data_update = 
  clean_data_update %>%
  select(!c(Vision, Hearing))

Save RDS File

#save cleaned data
cleandata_location = here("fluanalysis", "data", "cleandata.rds")
saveRDS(clean_data, file = cleandata_location)
# save cleaned data after Module 11
cleandata_location2 = here("fluanalysis", "data", "cleandata2.rds")
saveRDS(clean_data_update, file = cleandata_location2)