Data Analysis Exercise

Data Information

This data set contains the deaths involving COVID-19, pneumonia, and influenza reported to NCHS by sex, age group, and jurisdiction

Data Source

This data was obtained from the Centers for Disease Control and Prevention (CDC)’s National Center for Health Statistics (NCHS).

Data Contents

This dataset contains 116,000 rows and 16 columns.

Data Loading and Processing

Load Libraries

Loading required package: ggplot2
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   1.0.1      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'gridExtra'


The following object is masked from 'package:dplyr':

    combine


Loading required package: timechange


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Load Data

Raw_Data <- read_csv("dataanalysis-exercise/data/Provisional_COVID-19_Deaths_by_Sex_and_Age.csv")
Rows: 115668 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Data As Of, Start Date, End Date, Group, State, Sex, Age Group, Foo...
dbl (8): Year, Month, COVID-19 Deaths, Total Deaths, Pneumonia Deaths, Pneum...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Contents

str(Raw_Data)
spc_tbl_ [115,668 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Data As Of                              : chr [1:115668] "01/25/2023" "01/25/2023" "01/25/2023" "01/25/2023" ...
 $ Start Date                              : chr [1:115668] "01/01/2020" "01/01/2020" "01/01/2020" "01/01/2020" ...
 $ End Date                                : chr [1:115668] "01/21/2023" "01/21/2023" "01/21/2023" "01/21/2023" ...
 $ Group                                   : chr [1:115668] "By Total" "By Total" "By Total" "By Total" ...
 $ Year                                    : num [1:115668] NA NA NA NA NA NA NA NA NA NA ...
 $ Month                                   : num [1:115668] NA NA NA NA NA NA NA NA NA NA ...
 $ State                                   : chr [1:115668] "United States" "United States" "United States" "United States" ...
 $ Sex                                     : chr [1:115668] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
 $ Age Group                               : chr [1:115668] "All Ages" "Under 1 year" "0-17 years" "1-4 years" ...
 $ COVID-19 Deaths                         : num [1:115668] 1098594 413 1446 227 444 ...
 $ Total Deaths                            : num [1:115668] 10215767 59245 105422 11382 17721 ...
 $ Pneumonia Deaths                        : num [1:115668] 1036445 798 2258 496 643 ...
 $ Pneumonia and COVID-19 Deaths           : num [1:115668] 551596 68 367 54 129 ...
 $ Influenza Deaths                        : num [1:115668] 19209 49 367 116 165 ...
 $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:115668] 1600437 1190 3688 780 1115 ...
 $ Footnote                                : chr [1:115668] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Data As Of` = col_character(),
  ..   `Start Date` = col_character(),
  ..   `End Date` = col_character(),
  ..   Group = col_character(),
  ..   Year = col_double(),
  ..   Month = col_double(),
  ..   State = col_character(),
  ..   Sex = col_character(),
  ..   `Age Group` = col_character(),
  ..   `COVID-19 Deaths` = col_double(),
  ..   `Total Deaths` = col_double(),
  ..   `Pneumonia Deaths` = col_double(),
  ..   `Pneumonia and COVID-19 Deaths` = col_double(),
  ..   `Influenza Deaths` = col_double(),
  ..   `Pneumonia, Influenza, or COVID-19 Deaths` = col_double(),
  ..   Footnote = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Cleaning Data of interest

Remove special characters from column names

# Save copy of the raw data to clean
data = Raw_Data

#remove special characters
names(data) = gsub(" ", "_",
           gsub("-", "_",
           gsub(",", "", names(data))))

Change data types where necessary.
character -> date
numeric -> character

data = data %>% 
  mutate_at(
    vars('Data_As_Of', 'Start_Date','End_Date'), 
    as_date,
    format = "%m-%d-%y"
    )%>% mutate(Year = as.character(Year))

Determine what data to keep

dfm = data %>% pivot_longer(cols=c('COVID_19_Deaths','Pneumonia_Deaths','Influenza_Deaths'),
                    names_to="variable",
                    values_to="value")
year_plot = ggplot(dfm,aes(x = Year,y = value)) + 
    geom_bar(aes(fill = variable),stat = "identity",position = "dodge")+theme(legend.position = "bottom",legend.key.width= unit(0.1, 'cm'), legend.title= element_blank(),axis.text.x=element_text(size=9))

age_plot = ggplot(dfm,aes(x = Age_Group,y = value)) + 
    geom_bar(aes(fill = variable),stat = "identity",position = "dodge")+theme(legend.position="none",axis.text.x = element_text(size = 10, angle=60, hjust=1))


year_plot
Warning: Removed 91149 rows containing missing values (`geom_bar()`).

age_plot
Warning: Removed 91149 rows containing missing values (`geom_bar()`).

According to the footnotes column, there are multiple data points that had been suppressed due to NCHS confidentiality standards. Separated by state and/or age have many points that have been suppressed. The best option is to subset the data by year, and keep only rows containing all ages, and states.

Subset to keep
collected in 2020 and 2021

clean_data = data %>% filter(State == "United States",Age_Group=="All Ages",Year == 2020 | Year == 2021, Sex == "All Sexes")%>%
  select(Start_Date, End_Date, COVID_19_Deaths,Pneumonia_Deaths,Influenza_Deaths, Group, Sex, Year)
clean_data
# A tibble: 26 × 8
   Start_Date End_Date   COVID_19_Deaths Pneumonia_D…¹ Influ…² Group Sex   Year 
   <date>     <date>               <dbl>         <dbl>   <dbl> <chr> <chr> <chr>
 1 2020-01-01 2020-12-31          385666        352010    8787 By Y… All … 2020 
 2 2021-01-01 2021-12-31          463199        412020    1092 By Y… All … 2021 
 3 2020-01-01 2020-01-31               6         17909    2125 By M… All … 2020 
 4 2020-02-01 2020-02-29              25         15740    2373 By M… All … 2020 
 5 2020-03-01 2020-03-31            7174         22479    2437 By M… All … 2020 
 6 2020-04-01 2020-04-30           65550         46427    1237 By M… All … 2020 
 7 2020-05-01 2020-05-31           38329         29010     126 By M… All … 2020 
 8 2020-06-01 2020-06-30           18026         19294      40 By M… All … 2020 
 9 2020-07-01 2020-07-31           31135         27121      50 By M… All … 2020 
10 2020-08-01 2020-08-31           29911         27358      43 By M… All … 2020 
# … with 16 more rows, and abbreviated variable names ¹​Pneumonia_Deaths,
#   ²​Influenza_Deaths

Save Data and Summary

str_data = str(clean_data)
tibble [26 × 8] (S3: tbl_df/tbl/data.frame)
 $ Start_Date      : Date[1:26], format: "2020-01-01" "2021-01-01" ...
 $ End_Date        : Date[1:26], format: "2020-12-31" "2021-12-31" ...
 $ COVID_19_Deaths : num [1:26] 385666 463199 6 25 7174 ...
 $ Pneumonia_Deaths: num [1:26] 352010 412020 17909 15740 22479 ...
 $ Influenza_Deaths: num [1:26] 8787 1092 2125 2373 2437 ...
 $ Group           : chr [1:26] "By Year" "By Year" "By Month" "By Month" ...
 $ Sex             : chr [1:26] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
 $ Year            : chr [1:26] "2020" "2021" "2020" "2020" ...
summary_data = summary(clean_data)

summary_data
   Start_Date            End_Date          COVID_19_Deaths  Pneumonia_Deaths
 Min.   :2020-01-01   Min.   :2020-01-31   Min.   :     6   Min.   : 15624  
 1st Qu.:2020-06-08   1st Qu.:2020-08-07   1st Qu.: 18220   1st Qu.: 21174  
 Median :2020-12-16   Median :2021-01-15   Median : 31732   Median : 28184  
 Mean   :2020-12-03   Mean   :2021-01-27   Mean   : 65297   Mean   : 58772  
 3rd Qu.:2021-05-24   3rd Qu.:2021-07-23   3rd Qu.: 52139   3rd Qu.: 41722  
 Max.   :2021-12-01   Max.   :2021-12-31   Max.   :463199   Max.   :412020  
 Influenza_Deaths     Group               Sex                Year          
 Min.   :  28.00   Length:26          Length:26          Length:26         
 1st Qu.:  47.75   Class :character   Class :character   Class :character  
 Median :  81.50   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 759.92                                                           
 3rd Qu.: 303.75                                                           
 Max.   :8787.00                                                           

Save RDS File

saveRDS(clean_data, file = "dataanalysis-exercise/output/clean_data.rds")
saveRDS(summary_data, file = "dataanalysis-exercise/output/data_summary.rds")

#This section was added by Jacob———————————-

Examining the relationship between COVID-19 Deaths and Pneumonia Deaths

COVID_Deaths = clean_data$COVID_19_Deaths
Pneumonia_Deaths = clean_data$Pneumonia_Deaths
plot(COVID_Deaths, Pneumonia_Deaths)

It appears there a really strong postive correlation between Covid 19 deaths and pneumonia death. Lets calculate the correlation coefficient to see if it agrees with the scatter plot above.

cor(COVID_Deaths, Pneumonia_Deaths)
[1] 0.9944441

Given the correlation coefficient of 0.99 we can say that covid-19 deaths and pneumonia deaths have a strong postive correlation.