Clinical database mining series 3-teach you to use R language to clean the seed database

Insert picture description here
After downloading the data from the seed database, some people will feel at a loss. This is because the data has not been cleaned and organized, and cannot be transformed into the form recognized by our statistical software and cannot be analyzed. Today we will teach you how to clean the seed database using R language, so that the data can be used by us.
First, import the R package we need, which requires foreign, car, and stringr, which must be downloaded first.
Insert picture description here
Then we imported the downloaded data into R. There are more than 200,000 data, which is impossible to modify manually.

be<-read.csv("E:/r/test/seer4.csv",sep=',',header=TRUE)

Insert picture description here
Look at the name and form of the data and the variable
head(be)
names(be)
Insert picture description here
Insert picture description here
feels too messy, some names are very long, change all the names

colnames(be)<-c("sex","time","rezult","rezult1","status","race","Subtype","nodes","Lymph.Invasion",
                "tumor.size","extension","lymph.nodes","age","ajcc")#数据太长,重新命名

Let’s take a look at the data set again. This time it’s much refreshing.
Insert picture description here
We found that there are 14 variables in total, of which Lymph.Invasion is missing data, which cannot be analyzed at all and can only be deleted. This is the frustration of public databases.

be<-be[,-9]#删掉第9列Lymph.Invasion,因为都是缺失的数据

Insert picture description here
Many variables in the data are strings, which do not meet the requirements, we have to turn them into numbers

be$sex<-ifelse(be$sex=="Female",1,ifelse(be$sex=="Male",2,NA))#性别转换成1和2,缺失的使用NA表示,其他的相同
be$rezult1 <-ifelse(be$rezult1 =="Alive or dead due to cancer",1,
                    ifelse(be$rezult1 =="Dead (attributable to causes other than this cancer dx)",
                           2,NA))
be$status<-ifelse(be$status=="Alive",0,ifelse(be$status=="Dead",1,NA))
be$race<-ifelse(be$race=="White",1,ifelse(be$race=="Black",2,3))
be$Subtype<-recode(be$Subtype,"'HR-/HER2- (Triple Negative)'=1;
       'HR-/HER2+ (HER2 enriched)'=2;'HR+/HER2- (Luminal A)'=3;
       'HR+/HER2+ (Luminal B)'=4;else=NA")#这里是4个分类变量,使用ifelse函数套叠胎麻烦,改用car函数
be$nodes[be$nodes=="Blank(s)"]=NA#让数据中的Blank(s)变为缺失值,下面同理
be$tumor.size[be$tumor.size=="Blank(s)"]=NA
be$extension[be$extension=="Blank(s)"]=NA
be$lymph.nodes[be$lymph.nodes=="Blank(s)"]=NA
be$age<-str_extract(be$age, "\\d+")#把年龄里面的数字提取出来
be$ajcc[be$ajcc=="Blank(s)"]=NA

Insert picture description here
OK, the conversion is almost done. Let’s take a look. Rezult is useless. We ignore him and delete it later. What we need is rezult1.
Insert picture description here
Ajcc. We didn’t convert, because we don’t need to use it yet. When we talk about exploring interaction effects Let's talk about it during analysis. Now ignore him first. If you have obsessive-compulsive disorder, you can also convert it according to our code above.
OK, is it done now? No, there is another important variable that has not been generated, which is the outcome of competitive risk.
Let’s generate it now.

be$status1<-ifelse(be$status==0,0,ifelse(be$rezult1==1,1,2))

Finally the data comes out,
Insert picture description here
output it as 1.csv

write.csv(be,file = "1.csv")

Finally, open 1.csv and sort it out. This is the data we want to publish.
Insert picture description here
More than 200,000 pieces of data. Sending a Chinese core or low-scoring SCI is not easy, just like playing.
If you want to learn more about the data mining process, please pay attention to my scientific research tutorials. For
more exciting articles, please pay attention to the public account : zero-based research
Insert picture description here

Guess you like

Origin blog.csdn.net/dege857/article/details/112795092