R language - a painstaking summary of basic knowledge

Conceive

Regarding R language, I want to write it in three parts

write in front

Because this is my first time learning R language, it is inevitable that there will be shortcomings. Please bear with me. This document will be updated in time and some personal thoughts will be added.

If you have any questions, you can comment directly or contact me. Of course, the prerequisite for reply isYou’re a bit like it!

In fact, R is not the focus, let alone the end point. R is much better than python, c++, and java, right? R doesn’t require an exam, it just requires you to solve problems.

The point is,You have to have an idea, you need to know what problems you encounter and find solutions with the problems. for example:

If you want to process Excel, then you have to open it, and you can search on BaiduHow to open Excel in R language, if you encounter problems, you can solve them. There are many methods online.

The correct solution: Aaron, why are the csv files garbled when I write them? (Wow, Aaron can handle another ERROR)

Wrong solution: Aaron, I don’t know how to do this homework. Please help me write it. . . (No one calls Aaron unless you give money❤️)

Special thanks to: Teacher Yang Qingyong from the School of Information

Reference: Teacher Yang Qingyong PPT

What can R language do?

When it comes to this, most people just say, draw a picture. Some people will also say analyze data.

Here comes the question: Do you think R language is a water course? I used to think so, and felt that the R language was not important, but later I discovered that it was fun.

In fact, it is a matter of teachers, following the right people and doing the right things. If you follow the wrong person, it will be difficult to do the right thing.

First of all, R language is not limited to drawing. Of course, it is great for drawing, at least much better than python.

Secondly, there are many packages in R, and there are many written functions that we can call, which is very convenient for data analysis and even machine learning. It is also very convenient for later testing and calling models.

I think R is not the key. The key is theoretical knowledge, especially the knowledge of statistical analysis. R is just a tool, and the problem is the most fundamental. There is a famous saying in the computer industry:

Language is just a tool, the problem is the most fundamental.

Basic operations

How to install and load packages in R

1. Installation package: command line input

install.packages("包的名字")
# 其实还有使用工具安装的,这个命令可以安装大部分包,其他个别的有问题都可以百度到解决方案

2. Load the package: enter it on the command line or in the code

library(包的名字) #不用写引号哦

View help code

?solve  # 可以直接使用一个"?"来查看帮助文档
example(solve)  # 使用example()来查看该函数的使用范例
help(solve)  # 查看solve()函数的帮助文档

other

help.start() # 查看全部帮助文档,或者进入http://127.0.0.1:27003/doc/html/index.html
??solve  # 查看某一确定的关键字内容文档,使用两个问号"??" 

output

print cannot specify sep, but cat can

print("我爱帅帅龙");
cat("我爱","帅帅龙",sep="love");

String concatenation

sep can be specified to return the concatenated string, and non-string types are automatically converted to strings

a = paste("我爱","帅帅龙",1,"万年")

Other common operations

ctrl+L  # 清屏
rm(list=ls())  # 清除内存空间
getwd()  # 查看工作目录
setwd()  # 设置临时目录

test code

draw chinese heart

library("fun")
library("rgl")

demo("ChinaHeart2D")
demo("ChinaHeart3D")

word cloud

library(wordcloud2)
wordcloud2(demoFreq)
wordcloud2(demoFreqC)

basic grammar

Assignment

a = 10;# 我个人喜欢这样,新版本兼容很好,大家不用想太多,直接冲就完事了
b <- 10;# 中规中矩的赋值,表示流向,数据流向变量,也可以写成10 -> b

Create irregular vectors

Don’t worry about what a vector is, just treat it as a container, similar to python’s list

a = c("我","爱","帅帅龙")

Create vectors with certain rules

Rep means repeat, which means repetition.

x <- seq(1, 10, by = 0.5)  # 得到1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0
x <- seq(1, 10, length = 21)  # 从1到10,得到21个等间距数
x <- rep(2:5, 2)  # 得到2 3 4 5 2 3 4 5
x <- rep(2:5, rep(2, 4))  # 得到2 2 3 3 4 4 5 5

Create a continuous vector of numbers

a = c(1:5)  # 可以得到1 2 3 4 5

operator

In fact, there is nothing interesting to read from here on. It is recommended not to read this part. Just search it again when you encounter problems or look back.

数值运算符
+  -  *  /
^  # 乘方
%%  # 求余
%/%  # 整除

关系运算符
>  <  ==  !=  >=  <=

逻辑运算符
&&  ||  !

其他运算符
:  # 冒号运算符,用于创建一系列数字的向量。
%in%  # 用于判断元素是否在向量里,返回布尔值,有的话返回 TRUE,没有返回 FALSE。
%*%  # 用于矩阵与它转置的矩阵相乘。

Math functions

Some common mathematical functions are:

function illustrate
sqrt(n) square root of n
exp(n) The natural constant e raised to the nth power,
log(m,n) The logarithmic function of n, returns the power of n equal to m
log10(m) Equivalent to log(m,10)

The round function in R may "round off five" in some cases. When the rounding digit is an even number, five will also be rounded off.

name parametric model meaning
round (n) Round n
round (n, m) Round n to m decimal places
ceiling (n) Round n up
floor (n) Round n down

Trigonometric functions

slightly

Generate uppercase and lowercase letters

a = letters[1:4]  # letters为生成指定范围个小写字母向量。
b = LETTERS[1:4]  # LETTERS为生成指定范围个大写字母向量。

Missing values ​​are converted to 0

x [!is.na(x)]=0

Common constants

  • 26 CAPITAL LETTERS
  • 26 lowercase letters
  • Month abbreviation month.abb
  • Month name month.name
  • π value pi

Interchange numbers and strings

Convert string to numeric type

as.integer("12.3")  # 字符串转整数,得到12
as.double("11.666")  # 字符串转小数,得到11.666

Convert numeric type to text

Use paste()

a = paste(1)  # 

formatC() outputs values ​​as strings

formatC(1/3, format = "e", digits = 4)  # digits表示小数点位数
formatC(1/3, format = "f", digits = 4)

as.character()

a=as.character(66)

process control

if statement

x <- 50L
if(is.integer(x)) {
   print("X 是一个整数")
} else {
   print("X 不是一个整数")
}

switch does not introduce

while loop

a = 1
while(a<5){
    print('hello')
}

for loop

R language is particularly difficult to process for loops. Well, it’s just difficult anyway. I don’t know why.

a = c([1:4])

for(i in a){
    print(i)
}

repeat loop

Repeat means repeat.

a = 1
sum = 0
repeat{
  if(sum>10){
    break  # break终止循环,next继续下一次循环,就好像c++或python的continue
  }
  sum=sum+a
  a=a+1
}
print(sum)

Common data structures

Brother Meng, let’s get to the point! ! !
Insert image description here

Vector:c()

Alas, what exactly is a vector? In fact, it can be simply understood as a python list, but in fact there is an implementation of list in R, which is called list, and it can store different types.

Features:

  • Only one type of element can be stored. If there are numbers and strings, they will be automatically converted to strings.
  • You can use index to get elements (index starts from 1)
  • You can use the slicing operation to intercept a fragment, and both ends are closed intervals.

Create vector

The basic grammar has already been mentioned before.

Add value using append

good_sample_p <- append(good_sample_p,p)

Vector addition, subtraction, multiplication and division operations

One interesting thing is the recycling rule of vectors, such as a=c(1,2,3), b=c(4,5). At this time, if a+b is used, there will be a warning message, but it will not report an error. .

Take a+b as an example: it is actually (1+4, 2+5, 3+4). Now you understand what recycling is.

Some commonly used functions

  • sqrt(x), log(x), exp(x), sin(x), cos(x), tan(x), abs(x) represent square roots, logarithms, exponentials, trigonometric functions and absolute values ​​respectively.
  • sort(x, decreasing=FALSE) returns a resulting vector sorted from small to large by the elements of x.
  • order(x) is a vector of element subscripts such that x is arranged from small to large
  • sort(x) is equivalent to x[order(x)]
  • numeric(n): represents a zero vector of length n
  • all(log(10 * x) > x): Determine whether a logical vector is true
  • any(log(10 * x) > x): determine whether there is a true value
  • is.na(c(1, 2, NA)): Determine whether each element of x is a missing value

logical vector

Vectors can take on logical values, such as

y <- c(TRUE, TRUE, FALSE)
x = c(1, 4, 6.25)
y = x > 3
# y的值是
[1] FALSE TRUE TRUE

Two vectors can also be compared

x = c(1, 4, 6.25)
log(10 * x)
[1] 2.302585 3.688879 4.135167
log(10 * x) > x
[1]  TRUE FALSE FALSE
比较运算符:<,<=,>,>=,==(相等),!=(不等)
逻辑向量可以进行与(&)[表示同时满足],或(|)[两者之一]运算.

You can also force the logical value to be converted into an integer value, such as: TRUE becomes 1, FALSE becomes 0,

x = c(1, 4, 6.25)

c(0, 1)[(x > 3) + 1]  # 下面我会对这句话解释一下
[1] 0 1 1

(x>3)+1
[1] 1 2 2

I will explain this line of code c(0, 1)[(x > 3) + 1] here.

(x > 3) will get the logical vector [F ,T ,T]

(x>3)+1 will force the logical value to an integer value to get [1,2,2]

Then use it as the index of the previous vector, c(0,1), to get [0,1,1]

character vector

That is, there are characters in the vector (Isn’t this explanation very straightforward?)

a = c("我爱",'帅帅龙')  # 如果同时出现了字符串和数字,数字会转为字符串的哦

The paste function is used to concatenate its independent variables into a string, with corresponding delimiters in the middle. The previous string splicing has been introduced, okay?

complex vector

It’s not used much, so why don’t we just,,, not introduce it, okay?

vector index

Brother cute! Wake up, this is very important!

The subscripts of vectors in R start from 1, which is inconsistent with common statistical or mathematical software. The subscripts of vectors in C language, Python and other programming languages ​​start from 0!

Don’t be led away by python. R’s negative index indicates which element to delete.

x = c(42, 7, 64, 9)
x[2]  # 访问第2个元素
x[3] = -1  # 修改第三个数据的值
x[-4]  # 删除第四个元素
x[x < 10]  # 表示选取x<10的元素
x[c(1, 4)]  # 向量索引,是不是很神奇?

When defining a vector, you can add names to the elements.

ages <- c(Li = 23, Zhang = 33, Wang = 45)
# ages为
Li   Zhang  Wang 
23    33         45 
# 访问时可以用通常的方法,还可以用元素名访问
ages["Zhang"]
# 还可以定义向量后,再后加上名字
age1 = c(21, 34, 56)
names(age1) = c("Zhang", "Ding", "Liu")

matrix:matrix

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

  • data is the data of the matrix, usually a vector
  • nrow is the number of rows, ncol is the number of columns
  • When byrow is TRUE, it will be 1, 2, 3, 4 horizontally, otherwise it will be 1, 2, 3 vertically.

Create matrix

matrix(1:12,ncol=4,byrow=TRUE)
# 得到的数据
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Commonly used functions

  • head(a,10) View the first 10 rows of the matrix

  • tail(a,10) View the last ten rows of the matrix

  • cbind(): merge up and down

  • rbind(): merge left and right

  • c(A): Displays all vectors of A, which are vectors straightened by columns.

  • det(A): Find the value of the determinant

  • solve(A): find the inverse

  • eigen(A): eigenvalues ​​and eigenvectors

B=rbind(c(1,2),c(3,4))
C=cbind(c(11,12),c(13,14))
D=rbind(B,C)
E=cbind(B,C)

Matrix Operations

Similar to addition, subtraction, multiplication and division of vectors

Participating operations generally have the same shape, and vectors and matrices with inconsistent shapes can also be used for four arithmetic operations. The rule is that the data of the matrix are operated on the corresponding elements of the vector (straightened by columns).

Access matrix elements and submatrices

  • A[2,3]# access is the (2,3) element 7 of the matrix
  • A[i,] #Access the i-th row, A[,j] #Access the j-th column
  • A[,c(1,2,3)] first three columns
  • A[,c('name1','name2')] specifies the column name

Rename the row and column labels of the matrix.

rownames(A)  <- c("a", "b", "c")
colnames(A) <- paste("X", 1:4, sep="")

apply function

If you want to perform some calculation on a certain row (column) of a matrix, you can use the apply function: apply(x, margin, fun, …)

x represents the matrix, margin=1 represents calculation for each row, margin=2 represents calculation for each column, and fun is the function used for calculation.

apply(A, 1, sum)
apply(A, 2, mean)

factor: factor

factor(x, levels = sort(unique(x), na.last = TRUE), labels, exclude = NA, ordered = FALSE)

Used to encode a vector into a factor

Create factors

sex = c("M","F","M","M","F")
sexf = factor(sex);sexf

Commonly used functions

  • is.factor() checks whether the object is a factor
  • as.factor() converts vectors into factors
  • levels(x) can get the levels of the factors
  • table(x) counts the frequency of various types of data

tapply() function

tapply(x, INDEX, FUN=NULL,…,simplify=TRUE)

  • x is an object, usually a vector
  • INDEX is a factor with the same length as X
  • FUN is the function to be calculated

Knowing the gender of 5 students, and also knowing the height of these 5 students, find the average height of the groups.

sex = c("M","F","M","M","F")
height = c(174, 165, 180, 171, 160)
tapply(height, sex, mean)

gl() function

gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)

gl() can be used to conveniently generate factors

  • n is the number of levels
  • k is the number of repetitions
  • length is the length of the result
  • labels is an n-dimensional vector representing factor levels
  • ordered is a logical variable indicating whether it is an ordered factor. The default value is FALSE.

List: list

Create list

rec <- list(name="黎明", age=30, scores=c(85,76,90));rec
# 得到的数据
$name
[1] "黎明"

$age
[1] 30

$scores
[1] 85 76 90

Quoting and modifying lists

List elements can be referenced with "list name[[subscript]]". Lists are different from vectors. Only one element can be referenced at a time. For example, rec[[1:2]] is not allowed.

rec <- list(name="黎明", age=30, scores=c(85,76,90));rec
rec[[2]]  # 得到30
rec[[3]][2]  # 得到第三个元素的第二个元素,即76
# 若指定了元素的名字,则引用列表元素还可以用它的名字作为下标,
rec$age
rec[["age"]]
rec[[2]]=11  # 把30修改为11

Note: The usage of "list name [subscript]" or "list name [subscript range]" is allowed, but unlike the previous meaning, it is still a list

Data frame: data.frame

Many important points

A data frame is usually a matrix of data, but the columns of a matrix can be of different types. Each column of the data frame is a variable and each row is an observation.

This is similar to the DataFrame in python's pandas.

Generate data frame

d = data.frame(name=c('黎明','周杰伦','刘德华'),age=c(30,35,28),height=c(180,175,173))
# d的值
    name age height
1   黎明  30    180
2 周杰伦  35    175
3 刘德华  28    173

as.data.frame(list) can convert the list into data.frame(), because the list can not specify a name.

Data frame reference

d = data.frame(name=c('黎明','周杰伦','刘德华'),age=c(30,35,28),height=c(180,175,173))
d[1:2, 2:3]  # 得到前两行,2,3列的数据
d[["age"]]  # 获取age这一列的数据
# 等价于
d$height  # 获取height这一列的数据
rownames(d) = c("one", "two", "three")  # 各行也可以定义名字,指定index

Modify value

d$name[1] = "我爱你"  #将name的第一个值修改为我爱你   常用
d[1,2] = "女"  #将第1行第2列的值修改为“女”   常用
d[[1]][2] = "我爱你"  #将第一列第二个值改为“我爱你”

Add and delete rows and columns

d = df1[-2,] #删除第2行数据
d = df1[,-3] #删除第3列的数据
d = df1[-c(1,3),] #删除第1行和第3行的数据
d$r = age/weight  # 添加列r

attach() function

R provides the function attach() to transfer variables into memory. Just like our d$height or d[["age"]], you don't need to write it like this. Just write age directly. Isn't it cool?

d = data.frame(name=c('黎明','周杰伦','刘德华'),age=c(30,35,28),height=c(180,175,173))
attach(d)
r <- age/height  # /对r进行修改不会影响d的数据
# r的值
[1] 0.1666667 0.2000000 0.1618497
detach(d) #取消连接

merge()

Merge multiple data frames into one data frame

merge(data1, data2, by='ID')

Exception handling tryCatch()

Just like most programming languages, I won’t go into details here. What you need to know is that there is another thing called withCallingHandlers()

withCallingHandlers() is a variant of tryCatch(), but the context conditions of the operation are different. It is rarely used, but it is very useful.

Reading and saving data

Read txt: read.table()

read.table(“filename.txt”)

Read xlsx: read.xlsx()

You need to install the xlsx package first, and then import the xlsx package

data <- read.xlsx(“filename.xlsx”,n)

Save csv: write.csv()

write.csv(data,file = "file name")

保存 xlsx:write.xlsx()

write.xlsx(data, “data.xlsx”,sheet.name=“sheet1”)

Save as image or pdf file in R

Take png as an example

png(file="myplot.png", bg="transparent")  #文件不指定地址,默认放在getwd()里了

# 这里写你的画图程序#

dev.off()  # 记得off

# 下面是一个实例
png(file="myplot.png")
plot(1:10)
rect(1, 5, 3, 7)
dev.off()

If you save it as jpeg or pdf, just change it to png.

Some common functions of R

Brother Meng, get ready. Remember these things first. We are going to start drawing pictures and analyzing statistics.

Glossary

mean, median, mode

No, no, no, really there are people who don’t know what these three words mean. . .

variance

The variance (sample variance) is the average of the squared differences between each sample value and the mean of the entire sample value.

standard deviation

By default, we use the population standard deviation, which is the square root of the variance.

normal distribution

The normal curve is bell-shaped, low at both ends, high in the middle, and symmetrical. Because the curve is bell-shaped, people often call it a bell-shaped curve.

If the random variable X obeys a normal distribution with mathematical expectation μ and variance σ^2, it is recorded as N(μ, σ^2). (Expectation is Σxn*pn, xn represents frequency, and pn represents probability)

Its probability density function is a normal distribution. The expected value μ determines its position, and its standard deviation σ determines the amplitude of the distribution. The normal distribution when μ = 0, σ = 1 is the standard normal distribution.
Insert image description here

Regarding the meaning of normal distribution, you can click here for reference.

mean: get the mean

a=c(1:6)
mean(a)

median: get the median

a=c(1:6)
median(a)

Get the mode

There is no special function in R language to obtain the mode, so you have to write it by hand

# 创建函数
getmode = function(v) {
   uniqv = unique(v)  # unique主要是返回一个把重复元素或行给删除的向量、数据框或数组
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
 
# 创建向量
v = c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
 
# 调用函数
result = getmode(v)
print(result)

quantile(): percentile, default is 5

a=c(1:6)
quantile(a)
# 得到的结果
> quantile(a)
  0%  25%  50%  75% 100% 
1.00 2.25 3.50 4.75 6.00 

summary(): Descriptive statistics

summary(): Obtains descriptive statistics, which can provide the minimum value, maximum value, quartile and mean value of numerical variables, as well as frequency statistics of factor vectors and logistic vectors, etc.

The results are interpreted as follows:

a=c(1:6)
summary(a)
# 得到的结果
> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.25    3.50    3.50    4.75    6.00 

var(): Calculate variance

a = c(1:5)
var(a)

sd(): standard deviation

a = c(1:5)
sd(a)

coefficient of variation

Variance divided by mean

When you need to compare the degree of dispersion of two sets of data, if the measurement scales of the two sets of data are too different, or the data dimensions are different, it is not appropriate to directly use the standard deviation for comparison. You can use the coefficient of variation.

sort, order: sort, specify sorting rules

x = c(1,7,5,4,4,6,9)
x = sort(x,decreasing=FALSE)  # 返回升序排列结果,当decreasing为TRUE时为降序排列
# 或者
x_order = order(x,decreasing=FALSE)  # 返回升序后的下标, decreasing为TRUE时为降序排列
x = x[x_order]

To sort a matrix

x[order(x[,1],x[,2]),]

Note: For descending order, just add the decreasing parameter.

Handling missing values

For NA values, some calculations will be saved. We need to ignore NA and add the following parameters: na.rm=TRUE, for example

mean(height,na.rm=TRUE)
[1] 5.855

cor(): Calculate the correlation coefficient between two variables (optional)

cor(height,log(height))

cov(): covariance between two variables (optional)

cov(height,log(height))

shapiro.test(): Determine whether the data satisfies the normal distribution

Generally speaking, when the returned p-value is greater than 0.05, it is satisfied.

Of course there are special circumstances, haha, because in most scientific research cases 0.05 is not rigorous and may be designated as 0.01. If the question does not specify alpha, just default to 0.05.

Guess you like

Origin blog.csdn.net/m0_46521785/article/details/109089346