R进行数据分析

1. 排序

sort(x, decreasing = ): 返回排序好的数据
order(x, decreasing = ): 返回排序好的数据的索引

例子：

v = c(2, 9, 1, 45, -3, 19, -5, 6)

sort(v) # returns ordered v in decreasing order
结果：
# [1] -5 -3 1 2 6 9 19 45
sort(v, decreasing = FALSE) # orders v in increasing order
结果：
# [1] -5 -3 1 2 6 9 19 45

order(v) # returns order of the indexes in v
结果：
# [1] 7 5 3 1 8 2 6 4

order()可以和更加复杂的数据结构配合使用，从而适应更加复杂的场景
如：

#Imagine that you just want to access Sepal Length and Species.
# You can access those values in different ways:
ir[order(ir$Sepal.Length, decreasing = TRUE),c("Sepal.Length", "Species")][1:5,]

ir[order(ir$Sepal.Length, decreasing = TRUE),][1:5, c("Sepal.Length", "Species")]

2. aggregate()

2.1 分组处理数据

aggregate(X, by, FUN, . . . ,simplify = TRUE)

X is an R object (commonly a data frame)
by is a list of the elements by which you will be grouping your data.
FUN is the function that will be applied to each subset.
simplify is a logical value that indicates if results should be simplified into a vector or a matrix.

举例:

针对单一对象，一种分组
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)
aggregate(ir$Sepal.Width, by= list(ir$Species), summary)

针对多个对象，一种分组
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)

针对单一对象，多种分组
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)

2.2 另一版本使用

aggregate(formula, data, FUN, subset)

formula details the in the manner y ~ x or cbind(y1, y2) ~ x1+x2,where y and cbind(y1,y2) are the numeric data to be split and x or x1 + x2 are the grouping variables.
data is the data frame with the variable used in the formula.
FUN is the function that will be applied to each subset.
subset is an optional vector specifying a subset of observations to be used

举例：

aggregate(Sepal.Length ~ Species, data = ir, mean)
针对单一对象，一种分组，相当于：
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)

aggregate(Sepal.Length ~ Species + indoor, data = ir, mean)
针对单一对象，多种分组，相当于：
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)

aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, ir, mean)
针对多个对象，一种分组，相当于：
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)

针对所有对象，多个分组
aggregate(. ~ Species + indoor, data = ir, mean)

使用subset筛选满足条件的子集
aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, data = ir, subset = Petal.Width>0.6, mean)

2.3 自定义函数

我们可以定义一次函数，多次使用

举例：

meanX = function(vec, n){mean(head(vec[order(-vec)], n))}
aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) meanX(x, 5))

也可以定义自己的函数

举例：

aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) mean(head(x[order(-x)], 5)))

设置默认值

举例：

aggregate(ir$Sepal.Length, by = list(ir$Species),
FUN= function(x, n=5) mean(head(x[order(-x)], n)))

aggregate(ir$Sepal.Length, n=5, by = list(ir$Species),
FUN= function(x, n) mean(head(x[order(-x)], n)))

3. 基本的数据分析

3.2 选取样本了解数据

View(dataset) will show the whole dataset in a new window.
tail(dataset,x) will show the last x intances of dataset in the console.
head(dataset,x) will show the first x instances of dataset in the console.
names(dataset) will return the name of the attributes in the dataset.
str(dataset) will return a summary of of the type of each attribute and the first few values.

3.3 中心性检测

均值：mean()、 colMeans()、 rowMeans()
中位数：median()

3.4 离散分析

标准差: sd()
范围: range()
四分位数：IQR = Q3 − Q1
分位数是将总体的全部数据按大小顺序排列后，处于各等分位置的变量值。如果将全部数据分成相等的两部分，它就是中位数；如果分成四等分，就是四分位数。

3.5 相关性

相关性系数： cor()

4. 高级数据分析 dplyr

These are: filter(), arrange(), select(), mutate(), summarize(), sample_n(), sample_frac(), and group_by().

4.1 filter

得到满足条件的行
filter(data frame, condition).

filter(iris, Species=="setosa") # using dplyr
等效于
iris[iris$Species=="setosa",] # using basic R

4.2 arrange

arrange(data frame, attributes)
将数据帧按照属性排序

升序：
arrange(iris, Sepal.Length) #dplyr
iris[order(iris$Sepal.Length, decreasing = FALSE),] # basic R
iris[order(iris$Sepal.Length),] # basic R

降序：
arrange(iris, desc(Sepal.Length))
iris[order(iris$Sepal.Length, decreasing = TRUE),]
iris[order(-iris$Sepal.Length),]
arrange(iris, Sepal.Length,Sepal.Width)[1:5,]

升序与降序结合：
arrange(iris, Sepal.Length, desc(Sepal.Width))[1:5,]
iris[order(iris$Sepal.Length, -iris$Sepal.Width),][1:5,]

4.3 select

select(data frame, var1,. . . ,varX).
选择满足条件的列，或者去掉某几列

select(ir,Petal.Width, Species)
ir[,c("Petal.Length","Species")]

select(ir, -Species)
ir[, -c(5)]

select可以使用类似通配符的相关功能

select(ir,starts_with("Petal")) #Petal.Length and Petal.Width
select(ir, ends_with("Length")) #Sepal.Length and Petal.Length

4.4 mutate

mutate(data frame, expression)
添加新的列到数据帧

ir = mutate(ir, DoubleSepalL = Sepal.Length*2,
PetalRatio = Petal.Length/Petal.Width)
ir$DoubleSepalL = ir$Sepal.Length*2
ir$PetalRatio = ir$Petal.Length/ir$Petal.Width

4.5 summarize、 summarise_all

summarize(data frame, function(var1,. . . ,varX))

可以调用多个函数内置函数： sd(), min(), max(), median(), sum(), cor() (correlation), n() (length of vector)_, first() (first value), last() (last value) and n_distinct() (number of distinct values in vector).

summarise(ir, avg = mean(Sepal.Length), std= sd(Sepal.Length), total=n()) 
其中n()求行数

summary(ir)

summarise_all可以处理多个对象

均值
summarise_all(ir[,1:4],mean)
四分位点
summarise_all(ir[,1:4],quantile, probs=0.75)

4.6 Sample_n

乱序、取某几行的样本

sample_n(iris,5)
iris[sample(1:nrow(iris)),][1:5,]

4.7 Sample_frac

乱序、按比例取样本

sample_frac(iris,0.01) # 取1%的样本
等效于：
iris[sample(1:nrow(iris)),][1:ceiling(nrow(iris)*0.01),]

4.8 group_by

group_by(data frame, variable)
一般与其余函数结合使用，如： summarize

summarize(group_by(ir, Species), sd(Petal.Width))
等效于：
aggregate(ir$Petal.Width, by=list(ir$Species), FUN=sd) #Base R

求相关系数
summarize(group_by(ir, Species), r=cor(Sepal.Length, Sepal.Width))

5. piping with dplyr

将一个函数的输出作为另一个函数的输入

group_by(ir, Species) %>% summarise(avg= mean(Petal.Length))
等效于：
summarise(group_by(ir,Species), r=mean(Petal.Length))

piping将多个函数可以直接结合，使用更加有效

non_virg = ir[ir$Species!="virginica", c("Petal.Length")]
sum(non_virg>3.5)
## [1] 45

#B. Using dplyr with no piping
summarise(filter(ir,Species!="virginica",Petal.Length>3.5), n())
## n()
## 1 45

#C. Using dplyr with piping
ir %>% filter(Species!="virginica", Petal.Length>3.5) %>% nrow()
## [1] 45

注意piping只能将不同的函数连接起来，不适用basing R

ir %>%
mutate(petal_w_l = Petal.Width/Petal.Length) %>%
arrange(desc(petal_w_l)) %>%
head(3) %>% select(Species, petal_w_l)

6. 练习

1. 函数、piping、which的结合使用

注意：which.max(by_spc$Sepal.Length_mean) 中的下划线'_'是by_spc中列的名字，并没有特殊的含义

summ = c(min = min,max = max,mean = mean,median = median, q2={function(x) quantile(x, 0.25)},q3={function(x) quantile(x, 0.75)})
by_spc=group_by(ir, Species) %>% summarise_all(summ)

by_spc
# A tibble: 3 x 25
  Species Sepal.Length_min Sepal.Width_min Petal.Length_min
  <fct>              <dbl>           <dbl>            <dbl>
1 setosa               4.3             2.3              1  
2 versic~              4.9             2                3  
3 virgin~              4.9             2.2             4.5
......

a. Which plants have a higher mean sepal length?
by_spc[which.max(by_spc$Sepal.Length_mean),]$Species

b. Which plants have the sample with the smaller petal width?
by_spc[which.min(by_spc$Petal.Width_min),]$Species

2. 获取数据帧的每一个属性的数据类型

使用sapply、class

sapply(choco, class)

3. 获取帧的mode（数据类别中的最大值）

Obtain the mode from all of the nominal attributes in the dataset.

注意 factor 使用

因子就是用于表示一组数据中的类别，可以记录这组数据中的类别名称及类别数目。
实现研究对象的分组、分类计算
参考链接:https://www.zhihu.com/question/48472404

举例：

choco_nominal = choco[sapply(choco,{function (x) is.factor(x)})==TRUE]
sapply(choco_nominal, {function(x) names(which.max(table(x)))})

4. 多少类别

How many distinct companies have been considered? length(unique(choco$Company))

length(unique(choco$Company))

5. 将factor装换成数值类型

去除一些R不能理解的字符（如2,345 转换成 2345，30% 转换成 30），gsub()
将factor转换成number，as.numeric()

如:

Circulation2004= as.numeric(gsub(pattern = ",", replacement="", x=books$Daily.Circulation..2004))

4、R进行数据分析