Data analysis with R

Data analysis with R

Vol_0: Digital characteristics and correlation analysis of data

Import Data

Import text table data

Year  Nationwide  Rural  Urban
1978  184         138    405
1979  207         158    434
1980  236         178    496
1981  262         199    562
1982  284         221    576
1983  311         246    603
1984  354         283    662
1985  437         347    802
...

R code:

data <- read.table("./data.txt", header=TRUE)
data

result:

read_table

Import CSV data

 序号,省市区,11月,1~11月
 1,北京,35.22,499.8
 2,天津,10.41,161.37
 3,河北,17.22,273.29
 4,山西,10.7,134.79
 5,内蒙古,10.29,90.92
 ...

R code:

data <- read.csv("./data.csv")

Note: The header title in this data contains Chinese characters and special symbols, which will be automatically processed by R into:

> cat(names(data))
X.序号 省市区 X11月 X1.11月

You can change it manually:

data <- data[-1]  # remove "序号" col
names(data) <- c("Province", "X1", "X2")
data

result:

read_csv

Note: One of the two imported data will be randomly used as an example later.

attach

In order to facilitate calling the data of each column in data.frame, we can:

attach(data)

Then you can directly refer to a column of data with the column name, for example:

print(X1)

It is no longer necessary to fetch the index through data:

print(data[2])

After use, remember to detach:

detach(data)

Mean, variance, standard deviation, coefficient of variation, skewness, kurtosis

Univariate

data: one variable, one "column" of data

x <- c(1, 2, 3, 4, 5)

mean:

x ‾ = 1 n ∑ i = 1 n x i \overline x = \frac{1}{n}\sum_{i=1}^nx_i x=n1i=1nxi

mean(x)

方差:
s 2 = 1 n − 1 ∑ i = 1 n ( x i − x ‾ ) 2 s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2 s2=n11i=1n(xix)2

var(x)

标准差:
s = s 2 = 1 n − 1 ∑ i = 1 n ( x i − x ‾ ) 2 s = \sqrt{s^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2} s=s2 =n11i=1n(xix)2

sd(x)

Coefficient of variation:
CV = sx ‾ CV = \frac{s}{\overline x}CV=xs

cv <- function(x) sd(x)/mean(x)

cv(x)

Note: The book is a percentage of CV = 100 × sx ‾ ( %) CV = 100 \times \frac{s}{\overline x} (\%)CV=100×xs(%).

偏度:
g 1 = 1 ( n − 1 ) ( n − 2 ) 1 s 3 ∑ i = 1 n ( x i − x ‾ ) 3 g_1=\frac{1}{(n-1)(n-2)}\frac{1}{s^3}\sum_{i=1}^n(x_i-\overline x)^3 g1=(n1)(n2)1s31i=1n(xix)3

g1 <- function(x) {
    n <- length(x)
    A <- n / ((n-1) * (n-2))
    B <- 1 / sd(x)^3
    S <- sum((x - mean(x))^3)
    A * B * S
}

g1(x)

峰度:
g 2 = n ( n + 1 ) ( n − 1 ) ( n − 2 ) ( n − 3 ) 1 s 4 ∑ i = 1 n ( x i − x ‾ ) 4 − 3 ( n − 1 ) 2 ( n − 2 ) ( n − 3 ) g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\frac{1}{s^4}\sum_{i=1}^n(x_i-\overline x)^4\\-\frac{3(n-1)^2}{(n-2)(n-3)} g2=(n1)(n2)(n3)n(n+1)s41i=1n(xix)4(n2)(n3)3(n1)2

g2 <- function(x) {  # 峰度
    n <- length(x)
    A <- (n * (n+1)) / ((n-1) * (n-2) * (n-3))
    B <- 1 / sd(x)^4
    S <- sum((x - mean(x))^4)
    C <- (3 * (n-1)^2) / ((n-2) * (n-3))
    A * B * S - C
}

g2(x)

Act on data.frame

The data we import is all data.frame, and a column can be taken out separately, just like the above x:

x <- data[[2]]  # 取出 data 的第二列数据

mean(x)

But it is annoying to call each column once, so here is another way to apply mean or other functions to each column of data.frame at once (here we use the data imported from the first table above as an example):

apply(data[-1], MARGIN=2, FUN=mean)

result:

apply

illustrate:

  • Here the first column of data is character, and the mean is meaningless: use to data[-N]remove the Nth column (R 1starts indexing from )
  • The second parameter MARGIN=2is to process column by column
  • The third FUN is the function to be applied, here is the mean value. The variance is the same, just change this parameter to FUN=varsomething.

all in one

For convenience, we can encapsulate the process of calculating these things together:

describes <- function(df) {
    # TODO(CDFMLR): 优化重复计算
    cv <- function(x) sd(x)/mean(x)  # 变异系数

    g1 <- function(x) {  # 偏度
        n <- length(x)
        A <- n / ((n-1) * (n-2))
        B <- 1 / sd(x)^3
        S <- sum((x - mean(x))^3)
        A * B * S
    }

    g2 <- function(x) {  # 峰度
        n <- length(x)
        A <- (n * (n+1)) / ((n-1) * (n-2) * (n-3))
        B <- 1 / sd(x)^4
        S <- sum((x - mean(x))^4)
        C <- (3 * (n-1)^2) / ((n-2) * (n-3))
        A * B * S - C
    }
    
    itm <- matrix(c("均值", "方差", "标准差", "变异系数", "偏度", "峰度"), 6, 1)
    res <- apply(df, 2, 
                 function(x) c(mean(x), var(x), sd(x), cv(x), g1(x), g2(x)))
    
    cbind(itm, res)
}

The incoming parameter is a data.frame, and this function will calculate the mean, variance, etc. of each column (for example, for the csv data imported earlier):

describes(data[-1])

result:

apply_result

It comes out all at once, which is very convenient.

Changeling

Of course, these operations are implemented by third-party packages, such as this one:

Install this package (written directly in R):

install.packages("psych")

Guide package:

library(psych)

Then you can use the contents of this bag.

This package provides kurtosis and skewness, the hardest to write before:

# g1、g2 是用 type=2: see help(skew)
g1 <- function(x) skew(x, type=2)
g2 <- function(x) kurtosi(x, type=2)  # help(kurtosi)

This package also provides a describefunction to find most of the previous values ​​at once (similar to what we wrote by hand describes):

describe(data[-1], type=2)

result:

psych_describe

Median, upper and lower quartiles, quartile range

You can first find five numbers : minimum value, lower quartile, median, upper quartile, maximum

fn <- apply(data[-1], 2, fivenum)
fn

five

Quartile range:

R1 <- function(Q3, Q1) Q3 - Q1

R1(Q3=fn[4,], Q1=fn[2,])

[Math Time]

p-quantile:

M p = { x ( [ np ] + 1 ) , np is not an integer 1 2 ( x ( np ) + x ( np + 1 ) ) , np is an integer M_p=\left\{\begin{array}{ll} x_ {([np]+1)} ,& np \textrm{not an integer}\\ \frac{1}{2}(x_{(np)}+x_{(np+1)}) ,& np \textrm { is an integer}\\ \end{array}\right.Mp={ x([np]+1),21(x(np)+x(np+1)),n p is not an integer n p is  an integer
Upper and lower quartiles:
Q 3 = M 0.75 , Q 1 = M 0.25 Q_3=M_{0.75}, \qquad Q_1=M_{0.25}Q3=M0.75,Q1=M0.25
Interquartile range:
R 1 = Q 3 − Q 1 R_1=Q_3-Q_1R1=Q3Q1

Note: R is used to calculate quantiles quantile, see for details help(quantile).

With the upper and lower quartiles and the quartile range, you can find an abnormal data:

Definition: lower and upper cutoff points:

Q 1 − 1.5 R 1 , Q 3 + 1.5 R 1 Q_1-1.5R_1,\qquad Q_3+1.5R_1 Q11.5R1,Q3+1.5R1

Data greater than the "upper cutoff" and smaller than the "lower stage" are considered outliers :

abnormal <- function(x) {
    fn <- fivenum(x);
    Q1 <- fn[2];  Q3 <- fn[4];
    
    R1 <- Q3 - Q1
    
    QD <- Q1 - 1.5 * R1
    QU <- Q3 + 1.5 * R1
    
    x[(x < QD) | (x > QU)]
}
apply(data[-1], 2, abnormal)
# 若结果为空则没有异常值

Data Distribution Chart

stem and leaf diagram

stem(Nationwide)

stem

histogram

The easiest is to use it directly hist(x), but we can make it look better.

Package:

histogram <- function(x, xname="x") {
    hist(x, prob=TRUE, main=paste("Histogram of" , xname))
    lines(density(x))
    rug(x) # show the actual data points
}

transfer:

histogram(X1, "X1")

result:

hist

Empirical Distribution Function Plot

Package:

plot_ecdf <- function(x, xname="x") {
    
    
    plot(ecdf(x), do.points=FALSE, verticals=TRUE, main=paste("ecdf(" , xname, ")"))
    
    xs <- seq(min(x), max(x), 1/sqrt(length(x)))
    lines(xs, pnorm(xs, mean=mean(x), sd=sd(x)), lty=3, col="red")
}

Note xshere I choose to use 1 n \frac{1}{\sqrt{n}}n 1The density, this value is more suitable for my data (the drawing is not too thin but not too dense), this can be changed at will.

transfer:

plot_ecdf(X1, "X1")

result:

ecdf

Normal QQ plot

qqnorm(X1)

qqnorm

Pearson and Spearman correlation coefficient

Pearson correlation coefficient

2D population: ( X , Y ) T (X,Y)^T(X,Y)T

观测数据: ( x 1 , y 1 ) T , ( x 2 , y 2 ) T , ⋯   , ( x n , y n ) T (x_1,y_1)^T,(x_2,y_2)^T,\cdots,(x_n,y_n)^T (x1,y1)T,(x2,y2)T,,(xn,yn)T

记: x ‾ = 1 n ∑ i = 1 n x i , y ‾ = 1 n ∑ i = 1 n y i \overline x=\frac{1}{n}\sum_{i=1}^nx_i,\quad \overline y=\frac{1}{n}\sum_{i=1}^ny_i x=n1i=1nxi,y=n1i=1nyi

X, YX, YX,The variance of the observed data of Y
: sxx = 1 n − 1 ∑ i = 1 n ( xi − x ‾ ) 2 syy = 1 n − 1 ∑ i = 1 n ( yi − y ‾ ) 2 s_{xx}=\frac {1}{n-1}\sum_{i=1}^n(x_i-\overline x)^2 \quad s_{yy}=\frac{1}{n-1}\sum_{i=1} ^n(y_i-\overline y)^2sxx=n11i=1n(xix)2syy=n11i=1n(yiy)2
X , Y X,Y X,The covariance of the observed data of Y
: sxy = 1 n − 1 ∑ i = 1 n ( xi − x ‾ ) 2 ( yi − y ‾ ) 2 s_{xy}=\frac{1}{n-1}\sum_ {i=1}^n(x_i-\overline x)^2(y_i-\overline y)^2sxy=n11i=1n(xix)2(yiy)2
(Note: covariance matrixS = [ sxxsxysyxsyy ] S=\left[\begin{matrix}s_{xx} & s_{xy} \\ s_{yx} & s_{yy}\end{matrix}\right]S=[sxxsyxsxysyy],其中 s y x = s x y s_{yx}=s_{xy} syx=sxy

Pearson correlation coefficient:
rxy = sxysxxsyy r_{xy}=\frac{s_{xy}}{\sqrt{s_{xx}}\sqrt{s_{yy}}}rxy=sxx syy sxy

This value ∣ rxy ∣ ≤ 1 |r_{xy}|\le1rxy1 , to measure the degree of linear correlation between X and Y:

  • r x y → 1 r_{xy}\rightarrow 1 rxy1 positive correlation
  • r x y → 0 r_{xy}\rightarrow 0 rxy0 is not linearly correlated
  • r x y → − 1 r_{xy}\rightarrow -1 rxy1 negative correlation

Use R to calculate the correlation coefficient, and use cor(x, y, method="pearson")this function to directly find rxy r_{xy}rxyvalue. You can also use the following function, which will output more information:

cor.test(X1, X2, method="pearson")

Output: ( cor = rxy \textrm{cor}=r_{xy}cor=rxy)

pearson

[Math Time] Hypothesis testing about the above output:

Let the two-dimensional population ( X , Y ) T (X,Y)^T(X,Y)The distribution function of T is F ( x , y ) F(x,y)F(x,y)

Overall correlation coefficient ρ XY = C ov ( X , Y ) V ar ( X ) V ar ( Y ) \rho_{_{XY}}=\frac{\mathrm{Cov}(X,Y)}{\sqrt {\mathrm{Var}(X)}\sqrt{\mathrm{Var}(Y)}}rXY=V a r ( X ) V a r ( Y ) C o v ( X , Y ) .

n n When n is sufficiently large, there isρ XY ≈ rxy \rho_{_{XY}} \approx r_{xy}rXYrxy

Now the question is:

  • rxy r_{xy} can always be obtained for any observation datarxy, and it is generally not 0
  • And if the overall XXXYYY is uncorrelated (ρ XY = 0 \rho_{XY}=0rXY=0 ): At this time, userxy r_{xy}rxyto measure XXXYYThe relevance of Y is meaningless.

So do a hypothesis test:
H 0 : ρ XY = 0 ↔ H 1 : ρ XY ≠ 0 H_0:\rho_{_{XY}}=0 \quad \leftrightarrow \quad H_1:\rho_{_{XY}} \ne0H0:rXY=0H1:rXY=0
If the population is two-dimensional normal, thenH 0 H_0H0When true, the statistic
t = T xyn − 2 1 − rxy 2 ∼ t ( n − 2 ) t=\frac{T_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy} ^2}} \sim t(n-2)t=1rxy2 Txyn2 t(n2 ) tt
calculated from the observed dataThe t value is recorded ast 0 t_0t0then test ppp 值:
p = P H 0 ( ∣ t ∣ > ∣ t 0 ∣ ) = P ( ∣ t ( n − 2 ) ∣ ≥ ∣ t 0 ∣ ) p=P_{H_0}(|t|>|t_0|)=P(|t(n-2)|\ge|t_0|) p=PH0(t>t0)=P(t(n2)t0)
Given a significant levelα \alphaα p < α p<\alpha p<α rejectH 0 H_0H0, consider X , YX,YX,Y is related, you can userxy r_{xy}rxyMeasure relevance.

Spearman correlation coefficient

Spearman is the rank correlation coefficient .

Sample rank: take observations x 1 , x 2 , ⋯ , xn x_1,x_2,\cdots,x_nx1,x2,,xnSort from small to large, xi x_ixiRank R i R_iRiJust how much.

e.g.
x i : 7 − 3 − 1 5 R i : 4 1 2 3 \begin{array}{r} x_i: & 7 & -3 & -1 & 5 \\ R_i: & 4 & 1 & 2 & 3 \end{array} xi:Ri:74311253

remember:

  • x 1 , x 2 , ⋯   , x n x_1,x_2,\cdots,x_n x1,x2,,xnThe ranks are: R 1 , R 2 , ⋯ , R n R_1,R_2,\cdots,R_nR1,R2,,Rn
  • R ‾ = 1 n ∑ i = 1 n R i = 1 n ∑ i = 1 n i = n + 1 2 \overline R=\frac{1}{n}\sum_{i=1}^nR_i=\frac{1}{n}\sum_{i=1}^n i=\frac{n+1}{2} R=n1i=1nRi=n1i=1ni=2n+1
  • y 1 , y 2 , ⋯ , yn y_1,y_2,\cdots,y_ny1,y2,,ynThe ranks are: S 1 , S 2 , ⋯ , S n S_1,S_2,\cdots,S_nS1,S2,,Sn
  • S ‾ = 1 n ∑ i = 1 n S i = n + 1 2 \overline S=\frac{1}{n}\sum_{i=1}^nS_i=\frac{n+1}{2} S=n1i=1nSi=2n+1

则定义 Spearman 相关系数:
q x y = ∑ i = 1 n ( R i − R ‾ ) ( S i − S ‾ ) ∑ i = 1 n ( R i − R ‾ ) 2 ∑ i = 1 n ( S i − S ‾ ) 2 \begin{array}{l} q_{xy} &=& \frac{\sum_{i=1}^n(R_i-\overline R)(S_i-\overline S)}{\sqrt{\sum_{i=1}^n(R_i-\overline R)^2}\sqrt{\sum_{i=1}^n(S_i-\overline S)^2}} \end{array} qxy=i=1n(RiR)2 i=1n(SiS)2 i=1n(RiR)(SiS)
Use R to calculate ( rho = qxy \textrm{rho}=q_{xy} in the outputrho=qxy):

cor.test(X1, X2, method="spearman")

spearman

Still the same, with a hypothesis test:
H 0 : ρ XY = 0 ↔ H 1 : ρ XY ≠ 0 H_0:\rho_{_{XY}}=0 \quad \leftrightarrow \quad H_1:\rho_{_{XY }}\ne0H0:rXY=0H1:rXY=0


【EOF】

That's all for now. I'm busy recently, and if I have time later, I may write regression analysis, variance analysis... These are a complete set.

CDFMLR 2021.06.07


[PS 2021.07.15] I really don't have time to write, and I feel like it's unfinished again.

( If anyone wants to see the follow-up, you can chime in )

In fact, if necessary, the various codes of the entire data analysis are basically written in github.com/cdfmlr/daex , you can refer to it.

Guess you like

Origin blog.csdn.net/u012419550/article/details/118760927