R language data acquisition operation

Articles and codes have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

In fact, there are a large number of built-in data sets in R that can be used for analysis and practice, and we can also create data in R that simulate specific distributions. In actual work, data analysts often face external data from various data sources, that is, data files with various extensions, such as .txt, .csv, .xlsx, .xls, etc. Files with different extensions represent different file formats, which often confuses analysts.

R provides a wide range of data import tools.

1. Get the built-in dataset

The built-in datasets in R exist in each package, and the basic package datasetscontains only datasets without functions. This package provides nearly 100 datasets covering various fields such as medicine, nature, sociology, etc.

You can check with the following command:

data(package = "datasets")

If you want to call a dataset, you can use data( )the function. Run the following command, R will load the dataset iris into the workspace.

data(iris)

Besides the datasets package, many other packages in R also come with datasets. If it is not the basic package that is automatically loaded after running R, we need to install and load these packages before using the data in them. The following takes the data set bacteria in the MASS package as an example to illustrate the data calling process:

library(MASS)
data(bacteria)

2. Simulate data with a specific distribution

R provides a series of functions that can be used for numerical simulations. These functions rstart with , commonly used ones are: rnorm( ), runif( ), rbinom( ), and rpois( ), etc. For example:

# 后续可视化部分会详细介绍直方图
r1 <- rnorm(n = 100, mean = 0, sd = 1)
# head(r1) # 取前 5 个值看看
hist(r1)

r2 <- runif(n = 10000, min = 0, max = 100)
hist(r2)

r3 <- rbinom(n = 80, size = 100, prob = 0.1)
hist(r3)

r4 <- rpois(n = 50, lambda = 1)
hist(r4)

3. Get data in other formats

3.1 txt and csv format

If the data source is an ASCII format file created with Windows Notepad or other plain text editors, we can use the function to read.table( )read the data and return a data frame.

For example, assuming that the data file of the created data frame patients patients.txtis stored in the current working directory, we can use the following command to read the data:

# getwd() # 获取当前工作目录
# 临时创建下 patients.txt 数据文件
ID <- 1:5
sex <- c("male", "female", "male", "female", "male")
age <- c(25, 34, 38, 28, 52)
pain <- c(1, 3, 2, 2, 3)      
pain.f <- factor(pain, levels = 1:3, labels = c("mild", "medium", "severe"))   
patients <- data.frame(ID, sex, age, pain.f)
write.table(patients, "patients.txt", row.names = FALSE)

patients.data <- read.table("patients.txt", header = TRUE)
patients.data

Delimited text files are often generated in spreadsheet and database applications , where .csv files are comma-separated values ​​(Comma Separated Values). The function read.csv( )is a variant of the function read.table( ) designed for reading .csv files.

read.table ( )and read.csv ( )The default values ​​of the parameters in the two functions are different.
In the function read.table ( ), the default value of the parameter header is FALSE, that is, the first line of the file is considered to be data instead of variable names.
In the function read.csv ( ), the default value of parameter header is TRUE. Therefore, before reading in the data, it is recommended to open the original file for viewing, and then set the appropriate parameters to read in the data correctly.

write.csv(patients, "patients.csv", row.names=FALSE)
patients.data <- read.csv("patients.csv")
patients.data

3.2 xls or xlsx format

There are many ways to read spreadsheet data, the easiest of which is to save the data file as a comma-delimited (.csv) file in Excel and then read it into R using the method described above for reading .csv files. Data files in xlsx or xls format can also be read directly with the help of third-party packages such as openxlsx package, readxl package and gdata package .

Take the openxlsx package as an example:

library(openxlsx)
write.xlsx(patients, "patients.xlsx")
patients.data <- read.xlsx("patients.xlsx", sheet = 1)
patients.data

3.3 Import data from other statistical software

Sometimes we need to read data generated by other statistical software, such as SPSS, SAS, Stata, Minitab, etc. One way is to output the data as text files from other statistical software and then read the data into R using the functions read.table( ) or read.csv( ). Another method is to use extension packages, such as the foreign package, the main function of which is to read and write data from other statistical software.

The following is an example of importing SPSS data files.

Assuming the data file patients.savis stored in the current working directory, we can use the following command to read the data set into R:

# 为了节约附件数量,让我们直接从下载到工作区
URL <- "http://download.kesci.com/qlhatmok4/patients.sav"
download.file(URL, destfile = "./patients.sav", method="curl")

library(foreign)
# 函数 `read.spss( )` 中的参数 `to.data.frame` 默认为 FALSE,如果不设置为 TRUE,返回的将是一个列表形式数据。
patients.data <- read.spss("patients.sav" , to.data.frame = TRUE)
patients.data

The process of importing data files of software such as SAS and Stata with the foreign package is similar to the above, please refer to the documentation for details .

4. Data entry

It is possible to enter data directly in R, but if the data volume is large (more than 10 columns or more than 30 rows), entering data in R is not the best choice. We can choose spreadsheet software to enter small-scale data, such as Excel.

However, if the amount of data is large, the probability of error in manually entering data using spreadsheet software is also high. At this time, program software specially designed for data entry is more suitable, such as the free software EpiData. The software can not only conveniently set the constraints of data entry, such as range checking, word wrapping, etc., but also add labels to each variable and variable value.

The functions in the foreign package read.epiinfo( )can directly read the .rec files generated by EpiData, but it is recommended to export the entered data as Stata data files in EpiData first , and then use the function read.dta( ) to read the data in R. The advantage of this is that the attributes of variables preset in EpiData, such as variable labels and descriptions, can be preserved.

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132439380