After downloading the data from the seed database, some people will feel at a loss. This is because the data has not been cleaned and organized, and cannot be transformed into the form recognized by our statistical software and cannot be analyzed. Today we will teach you how to clean the seed database using R language, so that the data can be used by us.
First, import the R package we need, which requires foreign, car, and stringr, which must be downloaded first.
Then we imported the downloaded data into R. There are more than 200,000 data, which is impossible to modify manually.
be<-read.csv("E:/r/test/seer4.csv",sep=',',header=TRUE)
Look at the name and form of the data and the variable
head(be)
names(be)
feels too messy, some names are very long, change all the names
colnames(be)<-c("sex","time","rezult","rezult1","status","race","Subtype","nodes","Lymph.Invasion",
"tumor.size","extension","lymph.nodes","age","ajcc")#数据太长,重新命名
Let’s take a look at the data set again. This time it’s much refreshing.
We found that there are 14 variables in total, of which Lymph.Invasion is missing data, which cannot be analyzed at all and can only be deleted. This is the frustration of public databases.
be<-be[,-9]#删掉第9列Lymph.Invasion,因为都是缺失的数据
Many variables in the data are strings, which do not meet the requirements, we have to turn them into numbers
be$sex<-ifelse(be$sex=="Female",1,ifelse(be$sex=="Male",2,NA))#性别转换成1和2,缺失的使用NA表示,其他的相同
be$rezult1 <-ifelse(be$rezult1 =="Alive or dead due to cancer",1,
ifelse(be$rezult1 =="Dead (attributable to causes other than this cancer dx)",
2,NA))
be$status<-ifelse(be$status=="Alive",0,ifelse(be$status=="Dead",1,NA))
be$race<-ifelse(be$race=="White",1,ifelse(be$race=="Black",2,3))
be$Subtype<-recode(be$Subtype,"'HR-/HER2- (Triple Negative)'=1;
'HR-/HER2+ (HER2 enriched)'=2;'HR+/HER2- (Luminal A)'=3;
'HR+/HER2+ (Luminal B)'=4;else=NA")#这里是4个分类变量,使用ifelse函数套叠胎麻烦,改用car函数
be$nodes[be$nodes=="Blank(s)"]=NA#让数据中的Blank(s)变为缺失值,下面同理
be$tumor.size[be$tumor.size=="Blank(s)"]=NA
be$extension[be$extension=="Blank(s)"]=NA
be$lymph.nodes[be$lymph.nodes=="Blank(s)"]=NA
be$age<-str_extract(be$age, "\\d+")#把年龄里面的数字提取出来
be$ajcc[be$ajcc=="Blank(s)"]=NA
OK, the conversion is almost done. Let’s take a look. Rezult is useless. We ignore him and delete it later. What we need is rezult1.
Ajcc. We didn’t convert, because we don’t need to use it yet. When we talk about exploring interaction effects Let's talk about it during analysis. Now ignore him first. If you have obsessive-compulsive disorder, you can also convert it according to our code above.
OK, is it done now? No, there is another important variable that has not been generated, which is the outcome of competitive risk.
Let’s generate it now.
be$status1<-ifelse(be$status==0,0,ifelse(be$rezult1==1,1,2))
Finally the data comes out,
output it as 1.csv
write.csv(be,file = "1.csv")
Finally, open 1.csv and sort it out. This is the data we want to publish.
More than 200,000 pieces of data. Sending a Chinese core or low-scoring SCI is not easy, just like playing.
If you want to learn more about the data mining process, please pay attention to my scientific research tutorials. For
more exciting articles, please pay attention to the public account : zero-based research