Air Pollution Data Analysis with R functions

Introduction
- Data
Loading packages
Part 1
- Pollutant mean function
- Example outputs
Part 2
- Complete function
- Example outputs
Part 3
- Example Outputs
Session info

Introduction

For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web sit

Data

The zip file containing the data can be downloaded here:

specdata.zip [2.4MB]

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

Loading packages

library("data.table")
library(dplyr)

Part 1

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

Pollutant mean function

pollutantmean <- function(directory, pollutant, id = 1:332) {
    
    # Format number with fixed width and then append .csv to number
    fileNames <- paste0(directory, '/', formatC(id, width=3, flag="0"), ".csv" )
    
    # Reading in all files and making a large data.table
    dt <- lapply(fileNames, data.table::fread) %>% 
        rbindlist()
    
    dt %>% summarise_at(c(pollutant), mean, na.rm=TRUE)
}

Example outputs

pollutantmean("specdata", "sulfate", 1:10)

##    sulfate
## 1 4.064128

pollutantmean("specdata", "nitrate", 70:72)

##    nitrate
## 1 1.706047

pollutantmean("specdata", "nitrate", 23)

##    nitrate
## 1 1.280833

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. A prototype of this function follows

Complete function

complete <- function(directory, id=1:332) {
    
    # Format number with fixed width and then append .csv to number
    fileNames <- paste0(directory, '/', formatC(id, width=3, flag="0"), ".csv" )

    # Reading in all files and making a large data.table
    df <- lapply(fileNames, data.table::fread) %>% 
        rbindlist()
    
    df %>% 
        filter(complete.cases(df)) %>%
        group_by(ID) %>%
        summarise(nobs=n(), .groups="drop")
}

Example outputs

complete("specdata", 1)

## # A tibble: 1 x 2
##      ID  nobs
##   <int> <int>
## 1     1   117

complete("specdata", c(2, 4, 8, 10, 12))

## # A tibble: 5 x 2
##      ID  nobs
##   <int> <int>
## 1     2  1041
## 2     4   474
## 3     8   192
## 4    10   148
## 5    12    96

complete("specdata", 30:25)

## # A tibble: 6 x 2
##      ID  nobs
##   <int> <int>
## 1    25   463
## 2    26   586
## 3    27   338
## 4    28   475
## 5    29   711
## 6    30   932

complete("specdata", 3)

## # A tibble: 1 x 2
##      ID  nobs
##   <int> <int>
## 1     3   243

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

corr <- function(directory, threshold=0) {
    lst <- lapply(file.path(directory, list.files(path=directory, pattern=".csv")), data.table::fread)
    
    # bind all files by rows
    dt <- lst %>%
        rbindlist()
    
    dt %>%
        filter(complete.cases(dt)) %>% 
        group_by(ID) %>%
        mutate(nobs=n()) %>%
        filter(nobs > threshold) %>%
        summarise(corr = cor(x=sulfate, y=nitrate), .groups="drop") %>%
        select(corr) %>%
        as.matrix() %>%
        c()
}

Example Outputs

cr <- corr("specdata", 150)
head(cr)

## [1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814

summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.21057 -0.04999  0.09463  0.12525  0.26844  0.76313

cr <- corr("specdata", 400)
head(cr)

## [1] -0.01895754 -0.04389737 -0.06815956 -0.07588814  0.76312884 -0.15782860

summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.17623 -0.03109  0.10021  0.13969  0.26849  0.76313

cr <- corr("specdata", 5000)
summary(cr)

##    Mode 
## logical

length(cr)

## [1] 0

cr <- corr("specdata")
summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.05282  0.10718  0.13684  0.27831  1.00000

length(cr)

## [1] 323

Session info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.2       data.table_1.13.0
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.30       magrittr_2.0.1   tidyselect_1.1.0 R6_2.5.0        
##  [5] rlang_0.4.8      fansi_0.4.1      stringr_1.4.0    tools_4.0.2     
##  [9] xfun_0.19        utf8_1.1.4       cli_2.2.0        htmltools_0.5.0 
## [13] ellipsis_0.3.1   yaml_2.2.1       digest_0.6.27    assertthat_0.2.1
## [17] tibble_3.0.4     lifecycle_0.2.0  crayon_1.3.4     purrr_0.3.4     
## [21] vctrs_0.3.5      glue_1.4.2       evaluate_0.14    rmarkdown_2.5   
## [25] stringi_1.5.3    compiler_4.0.2   pillar_1.4.7     generics_0.1.0  
## [29] pkgconfig_2.0.3

Air Pollution Data Analysis with R functions

Benedict Neo

Introduction

Data

Loading packages

Part 1

Pollutant mean function

Example outputs

Part 2

Complete function

Example outputs

Part 3

Example Outputs

Session info