R语言初学者——数据转换(二)

取子集

  • 利用索引
> setwd(dir='E:/Rworking')
> car32<-read.csv('mtcars.csv',sep=',',header = T,nrows = 10)
> car_1<-car32[c(1:4),c(2,4,8)]
> car_1
   mpg disp  qsec
1 21.0  160 16.46
2 21.0  160 17.02
3 22.8  108 18.61
4 21.4  258 19.44
  • 也可以利用逻辑值索引
> car_2<-car32[which(car_1$mpg==21),]
> car_2
              X mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  21   6  160 110  3.9 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
#利用which可以找出满足条件的数据的索引,但在数据框中,默认的索引是列,因此要加逗号
> car_3<-car32[which(car32$mpg>18 & car32$mpg<=21),]
> car_3
                   X  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1          Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2      Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
5  Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6            Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
10          Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
  • 利用subset 函数可以轻松取子集 
subset(x, ...)
x

输入一个对象,可以是矩阵,也可以是数据框

subset 指示要保留的元素或行的逻辑表达式:默认值将为false。
select 表达式,指示要从数据框中选择的列。
drop 传递给[索引操作符
...

further arguments to be passed to or from other methods.

> car_4<-subset(car32,(car32$mpg>18 & car32$mpg<=21))
> car_4

> car_4
                   X  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1          Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2      Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
5  Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6            Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
10          Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
> 
> car_4<-subset(car32,select = 1:4)
> car_4
                   X  mpg cyl  disp
1          Mazda RX4 21.0   6 160.0
2      Mazda RX4 Wag 21.0   6 160.0
3         Datsun 710 22.8   4 108.0
4     Hornet 4 Drive 21.4   6 258.0
5  Hornet Sportabout 18.7   8 360.0
6            Valiant 18.1   6 225.0
7         Duster 360 14.3   8 360.0
8          Merc 240D 24.4   4 146.7
9           Merc 230 22.8   4 140.8
10          Merc 280 19.2   6 167.6

随机抽样(sample函数)

sample(x, size, replace = FALSE, prob = NULL)

sample.int(n, size = n, replace = FALSE, prob = NULL,
           useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7))

Arguments

x

一个需要抽样的向量或者是从数据中选择出来的向量

n 一个正数,有多少项可供选择
size

非负数,给出的样本个数

replace

是否放回抽样,默认不放回

prob 一种概率权重向量,用于获取被采样向量的元素。
useHash

logical indicating if the hash-version of the algorithm should be used. Can only be used for replace = FALSEprob = NULL, and size <= n/2, and really should be used for large n, asuseHash=FALSE will use memory proportional to n.

> x<-1:100
> sample(x,size=30)
 [1] 17 63 73 68 37  9  3  2 32 64 15 57 69 82 54 62 77
[18] 79 98 87 34 24  7 20 46 40 27 30 33 13

> sort(sample(x,size=30,replace = T))
 [1]  2  4  5  5 10 11 17 17 19 23 26 35 35 42 52 54 56
[18] 56 57 60 63 69 70 71 71 72 91 93 94 97

#以下的几次报错有助于我们理解sample.int
> sample.int(x,n=3,size=30)
Error in sample.int(x, n = 3, size = 30) : invalid 'replace' argument
> sample.int(n=3,size=30)
Error in sample.int(n = 3, size = 30) : 
  cannot take a sample larger than the population when 'replace = FALSE'

> sample.int(n=3,size=30,replace = T)
 [1] 2 1 3 3 3 2 2 3 1 2 2 1 1 2 1 2 1 3 2 2 1 1 2 1 1 1 1 2 2 3
> sample.int(n=80,size=30,replace = F)
 [1] 66 13  3 27 43 48 47 39 74 59 19 25  7 52 75 58 10 38 46 78 68  2 45
[24]  9 50 23 64 32 12 54
> 

如果要对数据框进行随机抽样

#who是将mtcars数据集全部导入
> a<-sample(rownames(who),size=5,replace = F)
[1] 22 25 31 23 17
> who[a,]
                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
22  Dodge Challenger 15.5   8  318 150 2.76 3.520 16.87  0  0    3    2
25  Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2
31     Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
23       AMC Javelin 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
17 Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
> 

删除指定的行列

当然负索引是很好的一种方式

mt<-mtcars[-1:-5,]#删除1到5行
mt<-mtcars[-1:-5]#删除1到5列

也可以采取给某一列赋值为NULL的方式

> mtcars$mpg=NULL
> head(mtcars)
                  cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1

数据框的合并

#USArrests美国各个省份的四种犯罪类型的数量指标
#state.division美国各个省份的地理分区

#拼接最简单的方式是直接用data.frame
> b<-state.division[1:6]
> data.frame(a,b)
           Murder Assault UrbanPop Rape                  b
Alabama      13.2     236       58 21.2 East South Central
Alaska       10.0     263       48 44.5            Pacific
Arizona       8.1     294       80 31.0           Mountain
Arkansas      8.8     190       50 19.5 West South Central
California    9.0     276       91 40.6            Pacific
Colorado      7.9     204       78 38.7           Mountain
> cbind(a,b)
           Murder Assault UrbanPop Rape                  b
Alabama      13.2     236       58 21.2 East South Central
Alaska       10.0     263       48 44.5            Pacific
Arizona       8.1     294       80 31.0           Mountain
Arkansas      8.8     190       50 19.5 West South Central
California    9.0     276       91 40.6            Pacific
Colorado      7.9     204       78 38.7           Mountain
> 

还有行拼接,但是在行拼接的时候,两个需拼接的元素需要有相同的列名

> c<-head(USArrests)
> d<-tail(USArrests)
> rbind(c,d)
              Murder Assault UrbanPop Rape
Alabama         13.2     236       58 21.2
Alaska          10.0     263       48 44.5
Arizona          8.1     294       80 31.0
Arkansas         8.8     190       50 19.5
California       9.0     276       91 40.6
Colorado         7.9     204       78 38.7
Vermont          2.2      48       32 11.2
Virginia         8.5     156       63 20.7
Washington       4.0     145       73 26.2
West Virginia    5.7      81       39  9.3
Wisconsin        2.6      53       66 10.8
Wyoming          6.8     161       60 15.6
> 

 然而在这种情况下

> x<-USArrests
> x=head(x,5)
> y=head(x,5)
> z=tail(x,5)
> rbind(y,z)
            Murder Assault UrbanPop Rape
Alabama       13.2     236       58 21.2
Alaska        10.0     263       48 44.5
Arizona        8.1     294       80 31.0
Arkansas       8.8     190       50 19.5
California     9.0     276       91 40.6
Alabama1      13.2     236       58 21.2
Alaska1       10.0     263       48 44.5
Arizona1       8.1     294       80 31.0
Arkansas1      8.8     190       50 19.5
California1    9.0     276       91 40.6
> 

完全合并,并没有剔除重复项 

Excel中有去除重复项的函数,在R中也可以实现

> data1<-rbind(y,z)
> rownames(data1)#取出行名
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Alabama1"   
 [7] "Alaska1"     "Arizona1"    "Arkansas1"   "California1"
> duplicated(rownames(data1))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> duplicated(data1)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> data1[duplicated(data1),]#取出重复部分
            Murder Assault UrbanPop Rape
Alabama1      13.2     236       58 21.2
Alaska1       10.0     263       48 44.5
Arizona1       8.1     294       80 31.0
Arkansas1      8.8     190       50 19.5
California1    9.0     276       91 40.6
> data1[!duplicated(data1),]#取出非重复部分
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6

当然也可以一步做到

> unique(data1)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6

当然,cbind()和rbind()也可以用来进行矩阵的合并

猜你喜欢

转载自blog.csdn.net/qq_43264642/article/details/88419576