取子集
- 利用索引
> setwd(dir='E:/Rworking')
> car32<-read.csv('mtcars.csv',sep=',',header = T,nrows = 10)
> car_1<-car32[c(1:4),c(2,4,8)]
> car_1
mpg disp qsec
1 21.0 160 16.46
2 21.0 160 17.02
3 22.8 108 18.61
4 21.4 258 19.44
- 也可以利用逻辑值索引
> car_2<-car32[which(car_1$mpg==21),]
> car_2
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#利用which可以找出满足条件的数据的索引,但在数据框中,默认的索引是列,因此要加逗号
> car_3<-car32[which(car32$mpg>18 & car32$mpg<=21),]
> car_3
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
- 利用subset 函数可以轻松取子集
subset(x, ...)
x |
输入一个对象,可以是矩阵,也可以是数据框 |
subset |
指示要保留的元素或行的逻辑表达式:默认值将为false。 |
select |
表达式,指示要从数据框中选择的列。 |
drop |
传递给[索引操作符 |
... |
further arguments to be passed to or from other methods. |
> car_4<-subset(car32,(car32$mpg>18 & car32$mpg<=21))
> car_4
> car_4
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
>
> car_4<-subset(car32,select = 1:4)
> car_4
X mpg cyl disp
1 Mazda RX4 21.0 6 160.0
2 Mazda RX4 Wag 21.0 6 160.0
3 Datsun 710 22.8 4 108.0
4 Hornet 4 Drive 21.4 6 258.0
5 Hornet Sportabout 18.7 8 360.0
6 Valiant 18.1 6 225.0
7 Duster 360 14.3 8 360.0
8 Merc 240D 24.4 4 146.7
9 Merc 230 22.8 4 140.8
10 Merc 280 19.2 6 167.6
随机抽样(sample函数)
sample(x, size, replace = FALSE, prob = NULL)
sample.int(n, size = n, replace = FALSE, prob = NULL,
useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7))
Arguments
x |
一个需要抽样的向量或者是从数据中选择出来的向量 |
n |
一个正数,有多少项可供选择 |
size |
非负数,给出的样本个数 |
replace |
是否放回抽样,默认不放回 |
prob |
一种概率权重向量,用于获取被采样向量的元素。 |
useHash |
|
> x<-1:100
> sample(x,size=30)
[1] 17 63 73 68 37 9 3 2 32 64 15 57 69 82 54 62 77
[18] 79 98 87 34 24 7 20 46 40 27 30 33 13
> sort(sample(x,size=30,replace = T))
[1] 2 4 5 5 10 11 17 17 19 23 26 35 35 42 52 54 56
[18] 56 57 60 63 69 70 71 71 72 91 93 94 97
#以下的几次报错有助于我们理解sample.int
> sample.int(x,n=3,size=30)
Error in sample.int(x, n = 3, size = 30) : invalid 'replace' argument
> sample.int(n=3,size=30)
Error in sample.int(n = 3, size = 30) :
cannot take a sample larger than the population when 'replace = FALSE'
> sample.int(n=3,size=30,replace = T)
[1] 2 1 3 3 3 2 2 3 1 2 2 1 1 2 1 2 1 3 2 2 1 1 2 1 1 1 1 2 2 3
> sample.int(n=80,size=30,replace = F)
[1] 66 13 3 27 43 48 47 39 74 59 19 25 7 52 75 58 10 38 46 78 68 2 45
[24] 9 50 23 64 32 12 54
>
如果要对数据框进行随机抽样
#who是将mtcars数据集全部导入
> a<-sample(rownames(who),size=5,replace = F)
[1] 22 25 31 23 17
> who[a,]
X mpg cyl disp hp drat wt qsec vs am gear carb
22 Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
25 Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
31 Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
23 AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
17 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
>
删除指定的行列
当然负索引是很好的一种方式
mt<-mtcars[-1:-5,]#删除1到5行
mt<-mtcars[-1:-5]#删除1到5列
也可以采取给某一列赋值为NULL的方式
> mtcars$mpg=NULL
> head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
数据框的合并
#USArrests美国各个省份的四种犯罪类型的数量指标
#state.division美国各个省份的地理分区
#拼接最简单的方式是直接用data.frame
> b<-state.division[1:6]
> data.frame(a,b)
Murder Assault UrbanPop Rape b
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
California 9.0 276 91 40.6 Pacific
Colorado 7.9 204 78 38.7 Mountain
> cbind(a,b)
Murder Assault UrbanPop Rape b
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
California 9.0 276 91 40.6 Pacific
Colorado 7.9 204 78 38.7 Mountain
>
还有行拼接,但是在行拼接的时候,两个需拼接的元素需要有相同的列名
> c<-head(USArrests)
> d<-tail(USArrests)
> rbind(c,d)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Vermont 2.2 48 32 11.2
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
>
然而在这种情况下
> x<-USArrests
> x=head(x,5)
> y=head(x,5)
> z=tail(x,5)
> rbind(y,z)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Alabama1 13.2 236 58 21.2
Alaska1 10.0 263 48 44.5
Arizona1 8.1 294 80 31.0
Arkansas1 8.8 190 50 19.5
California1 9.0 276 91 40.6
>
完全合并,并没有剔除重复项
Excel中有去除重复项的函数,在R中也可以实现
> data1<-rbind(y,z)
> rownames(data1)#取出行名
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Alabama1"
[7] "Alaska1" "Arizona1" "Arkansas1" "California1"
> duplicated(rownames(data1))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> duplicated(data1)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> data1[duplicated(data1),]#取出重复部分
Murder Assault UrbanPop Rape
Alabama1 13.2 236 58 21.2
Alaska1 10.0 263 48 44.5
Arizona1 8.1 294 80 31.0
Arkansas1 8.8 190 50 19.5
California1 9.0 276 91 40.6
> data1[!duplicated(data1),]#取出非重复部分
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
当然也可以一步做到
> unique(data1)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
当然,cbind()和rbind()也可以用来进行矩阵的合并