Coursera课程 R-Programming Week1 编程练习代码

For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site.

这篇文章里我分析并编写了三个作业要求的数据处理函数，完成相应功能：

作业1 完成pollutantmean()函数，统计指定数据均值

数据源：点击这里下载

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

这个压缩文件包含了332个独立的csv文件，反应了美国332个地区的硫、氮等污染物的数据信息，按照编号001-332.csv命名。每个文件包含以下三个字段：

Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

Date: 该数据监测到的时间
sulfate: 硫污染物指数
nitrate: 氮污染物指数

数据格式大概是这样，还算简单吧：

这里写图片描述

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows

编写一个叫做pollutantmean的函数，统计出指定文件，指定污染物的平均值。

扫描二维码关注公众号，回复： 3760273 查看本文章

该函数其中包含三个参数:

directory: 数据源文件夹的地址
pollutant: 污染物名字(sulfate 或者 nitrate)
id: 一个向量集，包括所有需要输入的文件名

函数的第一行如下：

pollutantmean <- function(directory, pollutant = "sulfate", id = 1:332) {...}

得到的结果应该是这样的：

# example
pollutantmean("specdata", "sulfate", 1:10)
# [1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
# [2] 1.706
pollutantmean("specdata", "nitrate", 23)
# [3] 1.281

我的思路：

指定目录文件夹的位置，构造目录字符串，读取332个csv文件。
根据id号，依次处理每一个文件中的有效值（无效的NA值需要删掉），新建一个sum向量，将其存入累计的sum向量中
最后计算sum向量中的均值，返回该均值。

代码：

pollutantmean <- function(directory, pollutant = "sulfate", id = 1:332) {
  if(grep("specdata", directory) == 1) {
    directory <- ("./specdata/") # 需要在该目录下放进刚才下载的文件夹
  }
  # 初始化sum向量
  sum_vector <- c()
  # 找到并读取所有csv文件
  all_files <- as.character( list.files(directory) )
  file_paths <- paste(directory, all_files, sep="")

  for(i in id) {
    current_file <- read.csv(file_paths[i], header=T, sep=",")
    head(current_file)
    pollutant
    na_removed <- current_file[!is.na(current_file[, pollutant]), pollutant]
    sum_vector <- c(sum_vector, na_removed)
  }

  result <- mean(sum_vector)
  return(round(result, 3)) 
}

作业2 完成Compelete函数，统计有效数据的个数

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. A prototype of this function follows

完成Complete()函数，读取指定文件中有效的数据个数，该函数包含两个参数：

directory: 数据源文件夹的地址
id: 一个向量集，包括所有需要输入的文件名

输出一个带有id和nob值的表格，具体格式参考样例：

> complete("specdata", 1)
  id nobs
1  1  117
> complete("specdata", c(2, 4, 8, 10, 12))
  id nobs
1  2 1041
2  4  474
3  8  192
4 10  148
5 12   96
> complete("specdata", 30:25)
  id nobs
1 30  932
2 29  711
3 28  475
4 27  338
5 26  586
6 25  463
> complete("specdata", 3)
  id nobs
1  3  243

思路：

设置文件目录
找到需要统计的所有文件
计算每个文件中完整的数据个数
构造需要输出的表格，返回结果

complete <- function(directory, id = 1:332) {
  # 设置目录
  if(grep("specdata", directory) == 1) {
    directory <- ("specdata/")
  }

  # 得到需要统计的文件个数
  id_len <- length(id)
  complete_data <- rep(0, id_len)

  # 找到需要统计的文件
  all_files <- as.character( list.files(directory) )
  file_paths <- paste(directory, all_files, sep="")
  j <- 1 

  # 计算每个文件中完整的数据个数
  for (i in id) {
    current_file <- read.csv(file_paths[i], header=T, sep=",")
    complete_data[j] <- sum(complete.cases(current_file))
    j <- j + 1
  }
  result <- data.frame(id = id, nobs = complete_data)
  return(result)
}

作业3 完成Corr函数，统计某个阈值以上硫污染物和氮污染物数据的相关性

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

输入一个阈值，表示在某个数据量各自的文件中，所有有效数据的二者污染物相关性比较。
例如

corr("specdata", 800)
# 读取specdata文件夹下，有效数据量大于800的所有文件
# 在每个文件中单独统计两种污染物的相关系数，并返回一个相关系数值。
# 该条命令下，函数返回一个长度为31的向量，即表示在332个文件中，有31个文件满足数据量的要求。
# 每个向量值代表文件内两列数据的相关系数。

# 返回值：
# -0.01895754 -0.15782860  0.25905718  0.05774168 -0.03309130 
# -0.14620214 -0.04253357  0.09807386  0.03666592  0.34742157  
#  0.23198440  0.27590363  0.08521742  0.04191777  0.42479896  
#  0.36172843 -0.03509034 -0.05300116  0.29579370  0.01653564
# -0.07449532  0.19014198  0.10955732  0.04621127 -0.06080823
#  0.16086505  0.59834333  0.19183481  0.51882017  0.25397808  
#  0.26878050

```R
corr <- function(directory, threshold = 0) {

  if(grep("specdata", directory) == 1) {
    directory <- ("./specdata/")
  }
  #计算整个表格内部的数据
  complete_table <- complete("specdata", 1:332)
  nobs <- complete_table$nobs
  # 过滤无用NA值
  ids <- complete_table$id[nobs > threshold]
  # get the length of ids vector
  # 计算有效数据的个数
  corr_vector <- rep(0, id_len)
  # 遍历所有文件，找到符合的数据
  all_files <- as.character( list.files(directory) )
  file_paths <- paste(directory, all_files, sep="")
  j <- 1
  # 计算相关系数
  for(i in ids) {
    current_file <- read.csv(file_paths[i], header=T, sep=",")
    corr_vector[j] <- cor(current_file$sulfate, current_file$nitrate, use="complete.obs")
    j <- j + 1
  }
  result <- corr_vector
  return(result)   
}