Data analysis with R
Vol_0: Digital characteristics and correlation analysis of data
Import Data
Import text table data
Year Nationwide Rural Urban
1978 184 138 405
1979 207 158 434
1980 236 178 496
1981 262 199 562
1982 284 221 576
1983 311 246 603
1984 354 283 662
1985 437 347 802
...
R code:
data <- read.table("./data.txt", header=TRUE)
data
result:
Import CSV data
序号,省市区,11月,1~11月
1,北京,35.22,499.8
2,天津,10.41,161.37
3,河北,17.22,273.29
4,山西,10.7,134.79
5,内蒙古,10.29,90.92
...
R code:
data <- read.csv("./data.csv")
Note: The header title in this data contains Chinese characters and special symbols, which will be automatically processed by R into:
> cat(names(data))
X.序号 省市区 X11月 X1.11月
You can change it manually:
data <- data[-1] # remove "序号" col
names(data) <- c("Province", "X1", "X2")
data
result:
Note: One of the two imported data will be randomly used as an example later.
attach
In order to facilitate calling the data of each column in data.frame, we can:
attach(data)
Then you can directly refer to a column of data with the column name, for example:
print(X1)
It is no longer necessary to fetch the index through data:
print(data[2])
After use, remember to detach:
detach(data)
Mean, variance, standard deviation, coefficient of variation, skewness, kurtosis
Univariate
data: one variable, one "column" of data
x <- c(1, 2, 3, 4, 5)
mean:
x ‾ = 1 n ∑ i = 1 n x i \overline x = \frac{1}{n}\sum_{i=1}^nx_i x=n1i=1∑nxi
mean(x)
方差:
s 2 = 1 n − 1 ∑ i = 1 n ( x i − x ‾ ) 2 s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2 s2=n−11i=1∑n(xi−x)2
var(x)
标准差:
s = s 2 = 1 n − 1 ∑ i = 1 n ( x i − x ‾ ) 2 s = \sqrt{s^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2} s=s2=n−11i=1∑n(xi−x)2
sd(x)
Coefficient of variation:
CV = sx ‾ CV = \frac{s}{\overline x}CV=xs
cv <- function(x) sd(x)/mean(x)
cv(x)
Note: The book is a percentage of CV = 100 × sx ‾ ( %) CV = 100 \times \frac{s}{\overline x} (\%)CV=100×xs(%).
偏度:
g 1 = 1 ( n − 1 ) ( n − 2 ) 1 s 3 ∑ i = 1 n ( x i − x ‾ ) 3 g_1=\frac{1}{(n-1)(n-2)}\frac{1}{s^3}\sum_{i=1}^n(x_i-\overline x)^3 g1=(n−1)(n−2)1s31i=1∑n(xi−x)3
g1 <- function(x) {
n <- length(x)
A <- n / ((n-1) * (n-2))
B <- 1 / sd(x)^3
S <- sum((x - mean(x))^3)
A * B * S
}
g1(x)
峰度:
g 2 = n ( n + 1 ) ( n − 1 ) ( n − 2 ) ( n − 3 ) 1 s 4 ∑ i = 1 n ( x i − x ‾ ) 4 − 3 ( n − 1 ) 2 ( n − 2 ) ( n − 3 ) g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\frac{1}{s^4}\sum_{i=1}^n(x_i-\overline x)^4\\-\frac{3(n-1)^2}{(n-2)(n-3)} g2=(n−1)(n−2)(n−3)n(n+1)s41i=1∑n(xi−x)4−(n−2)(n−3)3(n−1)2
g2 <- function(x) { # 峰度
n <- length(x)
A <- (n * (n+1)) / ((n-1) * (n-2) * (n-3))
B <- 1 / sd(x)^4
S <- sum((x - mean(x))^4)
C <- (3 * (n-1)^2) / ((n-2) * (n-3))
A * B * S - C
}
g2(x)
Act on data.frame
The data we import is all data.frame, and a column can be taken out separately, just like the above x
:
x <- data[[2]] # 取出 data 的第二列数据
mean(x)
But it is annoying to call each column once, so here is another way to apply mean or other functions to each column of data.frame at once (here we use the data imported from the first table above as an example):
apply(data[-1], MARGIN=2, FUN=mean)
result:
illustrate:
- Here the first column of data is character, and the mean is meaningless: use to
data[-N]
remove the Nth column (R1
starts indexing from ) - The second parameter
MARGIN=2
is to process column by column - The third FUN is the function to be applied, here is the mean value. The variance is the same, just change this parameter to
FUN=var
something.
all in one
For convenience, we can encapsulate the process of calculating these things together:
describes <- function(df) {
# TODO(CDFMLR): 优化重复计算
cv <- function(x) sd(x)/mean(x) # 变异系数
g1 <- function(x) { # 偏度
n <- length(x)
A <- n / ((n-1) * (n-2))
B <- 1 / sd(x)^3
S <- sum((x - mean(x))^3)
A * B * S
}
g2 <- function(x) { # 峰度
n <- length(x)
A <- (n * (n+1)) / ((n-1) * (n-2) * (n-3))
B <- 1 / sd(x)^4
S <- sum((x - mean(x))^4)
C <- (3 * (n-1)^2) / ((n-2) * (n-3))
A * B * S - C
}
itm <- matrix(c("均值", "方差", "标准差", "变异系数", "偏度", "峰度"), 6, 1)
res <- apply(df, 2,
function(x) c(mean(x), var(x), sd(x), cv(x), g1(x), g2(x)))
cbind(itm, res)
}
The incoming parameter is a data.frame, and this function will calculate the mean, variance, etc. of each column (for example, for the csv data imported earlier):
describes(data[-1])
result:
It comes out all at once, which is very convenient.
Changeling
Of course, these operations are implemented by third-party packages, such as this one:
Install this package (written directly in R):
install.packages("psych")
Guide package:
library(psych)
Then you can use the contents of this bag.
This package provides kurtosis and skewness, the hardest to write before:
# g1、g2 是用 type=2: see help(skew)
g1 <- function(x) skew(x, type=2)
g2 <- function(x) kurtosi(x, type=2) # help(kurtosi)
This package also provides a describe
function to find most of the previous values at once (similar to what we wrote by hand describes
):
describe(data[-1], type=2)
result:
Median, upper and lower quartiles, quartile range
You can first find five numbers : minimum value, lower quartile, median, upper quartile, maximum
fn <- apply(data[-1], 2, fivenum)
fn
Quartile range:
R1 <- function(Q3, Q1) Q3 - Q1
R1(Q3=fn[4,], Q1=fn[2,])
[Math Time]
p-quantile:
M p = { x ( [ np ] + 1 ) , np is not an integer 1 2 ( x ( np ) + x ( np + 1 ) ) , np is an integer M_p=\left\{\begin{array}{ll} x_ {([np]+1)} ,& np \textrm{not an integer}\\ \frac{1}{2}(x_{(np)}+x_{(np+1)}) ,& np \textrm { is an integer}\\ \end{array}\right.Mp={ x([np]+1),21(x(np)+x(np+1)),n p is not an integer n p is an integer
Upper and lower quartiles:
Q 3 = M 0.75 , Q 1 = M 0.25 Q_3=M_{0.75}, \qquad Q_1=M_{0.25}Q3=M0.75,Q1=M0.25
Interquartile range:
R 1 = Q 3 − Q 1 R_1=Q_3-Q_1R1=Q3−Q1
Note: R is used to calculate quantiles quantile
, see for details help(quantile)
.
With the upper and lower quartiles and the quartile range, you can find an abnormal data:
Definition: lower and upper cutoff points:
Q 1 − 1.5 R 1 , Q 3 + 1.5 R 1 Q_1-1.5R_1,\qquad Q_3+1.5R_1 Q1−1.5R1,Q3+1.5R1
Data greater than the "upper cutoff" and smaller than the "lower stage" are considered outliers :
abnormal <- function(x) {
fn <- fivenum(x);
Q1 <- fn[2]; Q3 <- fn[4];
R1 <- Q3 - Q1
QD <- Q1 - 1.5 * R1
QU <- Q3 + 1.5 * R1
x[(x < QD) | (x > QU)]
}
apply(data[-1], 2, abnormal)
# 若结果为空则没有异常值
Data Distribution Chart
stem and leaf diagram
stem(Nationwide)
histogram
The easiest is to use it directly hist(x)
, but we can make it look better.
Package:
histogram <- function(x, xname="x") {
hist(x, prob=TRUE, main=paste("Histogram of" , xname))
lines(density(x))
rug(x) # show the actual data points
}
transfer:
histogram(X1, "X1")
result:
Empirical Distribution Function Plot
Package:
plot_ecdf <- function(x, xname="x") {
plot(ecdf(x), do.points=FALSE, verticals=TRUE, main=paste("ecdf(" , xname, ")"))
xs <- seq(min(x), max(x), 1/sqrt(length(x)))
lines(xs, pnorm(xs, mean=mean(x), sd=sd(x)), lty=3, col="red")
}
Note xs
here I choose to use 1 n \frac{1}{\sqrt{n}}n1The density, this value is more suitable for my data (the drawing is not too thin but not too dense), this can be changed at will.
transfer:
plot_ecdf(X1, "X1")
result:
Normal QQ plot
qqnorm(X1)
Pearson and Spearman correlation coefficient
Pearson correlation coefficient
2D population: ( X , Y ) T (X,Y)^T(X,Y)T
观测数据: ( x 1 , y 1 ) T , ( x 2 , y 2 ) T , ⋯ , ( x n , y n ) T (x_1,y_1)^T,(x_2,y_2)^T,\cdots,(x_n,y_n)^T (x1,y1)T,(x2,y2)T,⋯,(xn,yn)T
记: x ‾ = 1 n ∑ i = 1 n x i , y ‾ = 1 n ∑ i = 1 n y i \overline x=\frac{1}{n}\sum_{i=1}^nx_i,\quad \overline y=\frac{1}{n}\sum_{i=1}^ny_i x=n1∑i=1nxi,y=n1∑i=1nyi
则X, YX, YX,The variance of the observed data of Y
: sxx = 1 n − 1 ∑ i = 1 n ( xi − x ‾ ) 2 syy = 1 n − 1 ∑ i = 1 n ( yi − y ‾ ) 2 s_{xx}=\frac {1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2 \quad s_{yy}=\frac{1}{n-1}\sum_{i=1} ^n(y_i-\overline y)^2sxx=n−11i=1∑n(xi−x)2syy=n−11i=1∑n(yi−y)2
X , Y X,Y X,The covariance of the observed data of Y
: sxy = 1 n − 1 ∑ i = 1 n ( xi − x ‾ ) 2 ( yi − y ‾ ) 2 s_{xy}=\frac{1}{n-1}\sum_ {i=1}^n(x_i-\overline x)^2(y_i-\overline y)^2sxy=n−11i=1∑n(xi−x)2(yi−y)2
(Note: covariance matrixS = [ sxxsxysyxsyy ] S=\left[\begin{matrix}s_{xx} & s_{xy} \\ s_{yx} & s_{yy}\end{matrix}\right]S=[sxxsyxsxysyy],其中 s y x = s x y s_{yx}=s_{xy} syx=sxy)
Pearson correlation coefficient:
rxy = sxysxxsyy r_{xy}=\frac{s_{xy}}{\sqrt{s_{xx}}\sqrt{s_{yy}}}rxy=sxxsyysxy
This value ∣ rxy ∣ ≤ 1 |r_{xy}|\le1∣rxy∣≤1 , to measure the degree of linear correlation between X and Y:
- r x y → 1 r_{xy}\rightarrow 1 rxy→1 positive correlation
- r x y → 0 r_{xy}\rightarrow 0 rxy→0 is not linearly correlated
- r x y → − 1 r_{xy}\rightarrow -1 rxy→− 1 negative correlation
Use R to calculate the correlation coefficient, and use cor(x, y, method="pearson")
this function to directly find rxy r_{xy}rxyvalue. You can also use the following function, which will output more information:
cor.test(X1, X2, method="pearson")
Output: ( cor = rxy \textrm{cor}=r_{xy}cor=rxy)
[Math Time] Hypothesis testing about the above output:
Let the two-dimensional population ( X , Y ) T (X,Y)^T(X,Y)The distribution function of T is F ( x , y ) F(x,y)F(x,y)
Overall correlation coefficient ρ XY = C ov ( X , Y ) V ar ( X ) V ar ( Y ) \rho_{_{XY}}=\frac{\mathrm{Cov}(X,Y)}{\sqrt {\mathrm{Var}(X)}\sqrt{\mathrm{Var}(Y)}}rXY=V a r ( X )V a r ( Y )C o v ( X , Y ) .
n n When n is sufficiently large, there isρ XY ≈ rxy \rho_{_{XY}} \approx r_{xy}rXY≈rxy
Now the question is:
- rxy r_{xy} can always be obtained for any observation datarxy, and it is generally not 0
- And if the overall XXX、YYY is uncorrelated (ρ XY = 0 \rho_{XY}=0rXY=0 ): At this time, userxy r_{xy}rxyto measure XXX、YYThe relevance of Y is meaningless.
So do a hypothesis test:
H 0 : ρ XY = 0 ↔ H 1 : ρ XY ≠ 0 H_0:\rho_{_{XY}}=0 \quad \leftrightarrow \quad H_1:\rho_{_{XY}} \ne0H0:rXY=0↔H1:rXY=0
If the population is two-dimensional normal, thenH 0 H_0H0When true, the statistic
t = T xyn − 2 1 − rxy 2 ∼ t ( n − 2 ) t=\frac{T_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy} ^2}} \sim t(n-2)t=1−rxy2Txyn−2∼t(n−2 ) tt
calculated from the observed dataThe t value is recorded ast 0 t_0t0then test ppp 值:
p = P H 0 ( ∣ t ∣ > ∣ t 0 ∣ ) = P ( ∣ t ( n − 2 ) ∣ ≥ ∣ t 0 ∣ ) p=P_{H_0}(|t|>|t_0|)=P(|t(n-2)|\ge|t_0|) p=PH0(∣t∣>∣t0∣)=P(∣t(n−2)∣≥∣t0∣ )
Given a significant levelα \alphaα , p < α p<\alpha p<α rejectH 0 H_0H0, consider X , YX,YX,Y is related, you can userxy r_{xy}rxyMeasure relevance.
Spearman correlation coefficient
Spearman is the rank correlation coefficient .
Sample rank: take observations x 1 , x 2 , ⋯ , xn x_1,x_2,\cdots,x_nx1,x2,⋯,xnSort from small to large, xi x_ixiRank R i R_iRiJust how much.
e.g.
x i : 7 − 3 − 1 5 R i : 4 1 2 3 \begin{array}{r} x_i: & 7 & -3 & -1 & 5 \\ R_i: & 4 & 1 & 2 & 3 \end{array} xi:Ri:74−31−1253
remember:
- x 1 , x 2 , ⋯ , x n x_1,x_2,\cdots,x_n x1,x2,⋯,xnThe ranks are: R 1 , R 2 , ⋯ , R n R_1,R_2,\cdots,R_nR1,R2,⋯,Rn
- R ‾ = 1 n ∑ i = 1 n R i = 1 n ∑ i = 1 n i = n + 1 2 \overline R=\frac{1}{n}\sum_{i=1}^nR_i=\frac{1}{n}\sum_{i=1}^n i=\frac{n+1}{2} R=n1∑i=1nRi=n1∑i=1ni=2n+1
- y 1 , y 2 , ⋯ , yn y_1,y_2,\cdots,y_ny1,y2,⋯,ynThe ranks are: S 1 , S 2 , ⋯ , S n S_1,S_2,\cdots,S_nS1,S2,⋯,Sn
- S ‾ = 1 n ∑ i = 1 n S i = n + 1 2 \overline S=\frac{1}{n}\sum_{i=1}^nS_i=\frac{n+1}{2} S=n1∑i=1nSi=2n+1
则定义 Spearman 相关系数:
q x y = ∑ i = 1 n ( R i − R ‾ ) ( S i − S ‾ ) ∑ i = 1 n ( R i − R ‾ ) 2 ∑ i = 1 n ( S i − S ‾ ) 2 \begin{array}{l} q_{xy} &=& \frac{\sum_{i=1}^n(R_i-\overline R)(S_i-\overline S)}{\sqrt{\sum_{i=1}^n(R_i-\overline R)^2}\sqrt{\sum_{i=1}^n(S_i-\overline S)^2}} \end{array} qxy=∑i=1n(Ri−R)2∑i=1n(Si−S)2∑i=1n(Ri−R)(Si−S)
Use R to calculate ( rho = qxy \textrm{rho}=q_{xy} in the outputrho=qxy):
cor.test(X1, X2, method="spearman")
Still the same, with a hypothesis test:
H 0 : ρ XY = 0 ↔ H 1 : ρ XY ≠ 0 H_0:\rho_{_{XY}}=0 \quad \leftrightarrow \quad H_1:\rho_{_{XY }}\ne0H0:rXY=0↔H1:rXY=0
【EOF】
That's all for now. I'm busy recently, and if I have time later, I may write regression analysis, variance analysis... These are a complete set.
CDFMLR 2021.06.07
[PS 2021.07.15] I really don't have time to write, and I feel like it's unfinished again.
( If anyone wants to see the follow-up, you can chime in )
In fact, if necessary, the various codes of the entire data analysis are basically written in github.com/cdfmlr/daex , you can refer to it.