R语言初学者——数据转换(四)

首先导入数据集并将矩阵形式转化成数据框形式

> WorldPhones
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076
> x<-as.data.frame(WorldPhones)

分别计算行和和列平均

> rs<-rowSums(x)
> rs
  1951   1956   1957   1958   1959   1960   1961 
 74494 102199 110001 118399 124801 133709 141700 
> mean<-colMeans(x)
> mean
    N.Amer     Europe       Asia     S.Amer    Oceania     Africa   Mid.Amer 
66747.5714 34343.4286  6229.2857  2772.2857  2625.0000  1484.0000   841.7143

在将其用cbind()和rbind()函数将其合并

> total<-cbind(x,rs)
> total_1<-rbind(total,mean)
> total_1
       N.Amer   Europe     Asia   S.Amer Oceania Africa  Mid.Amer        rs
1951 45939.00 21574.00 2876.000 1815.000    1646     89  555.0000  74494.00
1956 60423.00 29990.00 4708.000 2568.000    2366   1411  733.0000 102199.00
1957 64721.00 32510.00 5230.000 2695.000    2526   1546  773.0000 110001.00
1958 68484.00 35218.00 6662.000 2845.000    2691   1663  836.0000 118399.00
1959 71799.00 37598.00 6856.000 3000.000    2868   1769  911.0000 124801.00
1960 76036.00 40341.00 8220.000 3145.000    3054   1905 1008.0000 133709.00
1961 79831.00 43173.00 9053.000 3338.000    3224   2005 1076.0000 141700.00
8    66747.57 34343.43 6229.286 2772.286    2625   1484  841.7143  66747.57

我们发现，最后一列并没有计算平均，最后一个空值用第一个数替代了。

在R中提供了apply系列函数。

apply(X, MARGIN, FUN, ...)

Arguments

`X`	一个向量，包括矩阵
`MARGIN`	维度的下标，取值为1或者2 ，1代表对行进行处理，2代表对列进行处理
`FUN`	是函数，表示要对数据进行的操作
`...`	optional arguments to `FUN`.

> apply(x,MARGIN = 1,FUN = sum)
  1951   1956   1957   1958   1959   1960   1961 
 74494 102199 110001 118399 124801 133709 141700 
> apply(x,MARGIN = 2,FUN=mean)
    N.Amer     Europe       Asia     S.Amer    Oceania     Africa 
66747.5714 34343.4286  6229.2857  2772.2857  2625.0000  1484.0000 
  Mid.Amer 
  841.7143

lapply()返回的是列表，sapply()返回的是向量或者矩阵

lapply(X, FUN, ...)

sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

replicate(n, expr, simplify = "array")

simplify2array(x, higher = TRUE)

Arguments

`X`	对象
`FUN`	函数
`...`	optional arguments to `FUN`.
`simplify`	逻辑值或者是字符串 ; 结果是否该被简化为向量、矩阵或者高维数组，对于 `sapply` 这个参数不能省略默认值, `TRUE`, 返回矩阵或者向量, 如果 `simplify = "array"` 结果将是一个数组 “rank” (=`length(dim(.))`) one higher than the result of `FUN(X[[i]])`.
`USE.NAMES`	逻辑值; i如果为真并且X是字符型使用X为结果的名字除非已经有名字了。
`FUN.VALUE`	a (generalized) vector; a template for the return value from FUN. See ‘Details’.
`n`	integer: the number of replications.
`expr`	the expression (a language object, usually a call) to evaluate repeatedly.
`x`	一个列表一般是由lapply返回的
`higher`	逻辑值;如果为真, `simplify2array()` 将在合适的时候输出高阶数组为假会仅仅返回一个向量或者矩阵T

如

> state.center
$`x`
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730
 [6] -105.5130  -72.3573  -74.9841  -81.6850  -83.3736
[11] -126.2500 -113.9300  -89.3776  -86.0808  -93.3714
[16]  -98.1156  -84.7674  -92.2724  -68.9801  -76.6459
[21]  -71.5800  -84.6870  -94.6043  -89.8065  -92.5137
[26] -109.3200  -99.5898 -116.8510  -71.3924  -74.2336
[31] -105.9420  -75.1449  -78.4686 -100.0990  -82.5963
[36]  -97.1239 -120.0680  -77.4500  -71.1244  -80.5056
[41]  -99.7238  -86.4560  -98.7857 -111.3300  -72.5450
[46]  -78.2005 -119.7460  -80.6665  -89.9941 -107.2560

$y
 [1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928
 [8] 38.6777 27.8744 32.3329 31.7500 43.5648 40.0495 40.0495
[15] 41.9358 38.4204 37.3915 30.6181 45.6226 39.2778 42.3645
[22] 43.1361 46.3943 32.6758 38.3347 46.8230 41.3356 39.1063
[29] 43.3934 39.9637 34.4764 43.1361 35.4195 47.2517 40.2210
[36] 35.5053 43.9078 40.9069 41.5928 33.6190 44.3365 35.6767
[43] 31.3897 39.1063 44.2508 37.5630 47.4231 38.4204 44.5937
[50] 43.0504

> lapply(state.center,FUN=length)
$`x`
[1] 50

$y
[1] 50
> class(lapply(state.center,FUN=length))
[1] "list"
> sapply(state.center,FUN=length)
 x  y 
50 50 
> class(sapply(state.center,FUN=length))
[1] "integer"

tapply()作用于因子

tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

Arguments

`X`	数据集
`INDEX`	必须是一个因子数据类型，长度和X相同，利用这个因子来实现对X的分组
`FUN`	函数
`...`	optional arguments to `FUN`: the Note section.
`default`	(only in the case of simplification to an array) the value with which the array is initialized as `array(default, dim = ..)`. Before R 3.4.0, this was hard coded to `array()`'s default `NA`. If it is `NA` (the default), the missing value of the answer type, e.g. `NA_real_`, is chosen (`as.raw(0)` for `"raw"`). In a numerical case, it may be set, e.g., to `FUN(integer(0))`, e.g., in the case of `FUN = sum` to `0` or `0L`.
`simplify`	logical; if `FALSE`, `tapply` always returns an array of mode `"list"`; in other words, a `list` with a `dim` attribute. If `TRUE` (the default), then if `FUN`always returns a scalar, `tapply` returns an array with the mode of the scalar.

> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"       
> state.division
 [1] East South Central Pacific            Mountain           West South Central
 [5] Pacific            Mountain           New England        South Atlantic    
 [9] South Atlantic     South Atlantic     Pacific            Mountain          
[13] East North Central East North Central West North Central West North Central
[17] East South Central West South Central New England        South Atlantic    
[21] New England        East North Central West North Central East South Central
[25] West North Central Mountain           West North Central Mountain          
[29] New England        Middle Atlantic    Mountain           Middle Atlantic   
[33] South Atlantic     West North Central East North Central West South Central
[37] Pacific            Middle Atlantic    New England        South Atlantic    
[41] West North Central East South Central West South Central Mountain          
[45] New England        South Atlantic     Pacific            South Atlantic    
[49] East North Central Mountain          
9 Levels: New England Middle Atlantic South Atlantic ... Pacific

利用第一个向量和第二个因子来实现对美国各个分区所含大洲的统计

> tapply(state.name,state.division,FUN=length)
       New England    Middle Atlantic     South Atlantic East South Central 
                 6                  3                  8                  4 
West South Central East North Central West North Central           Mountain 
                 4                  5                  7                  8 
           Pacific 
                 5

R中的中心化和标准化处理

中心化：各项数据减去数据平均值

标准化：各项数据除以标准差

使用scale函数

> scale(h,center=T,scale=T)
           Population     Income  Illiteracy     Life Exp      Murder    HS Grad      Frost        Area
Alabama    -0.2200732 -0.9501180  1.09913001 -1.236374542  1.66820651 -1.1946743 -0.7660111 -0.62635214
Alaska     -0.6346639  1.5643230 -0.03140371 -1.023017873  0.36563430  0.9548932  1.1417902  1.99851088
Arizona    -0.3990488 -0.1035615  0.53386315 -0.005470684 -0.83410325  0.2270869 -0.8382763 -0.30718426
Arkansas   -0.4120606 -1.1799777  0.72228544  0.084795599 -0.04570429 -1.3131544 -0.1156243 -0.62005622
California  2.0229261  0.4421218 -0.78509287  0.946428300  0.02285214  0.6079157 -0.7660111 -0.08861363
Colorado   -0.3570795  0.2272123 -1.53878202  1.233639200 -1.17688541  0.7179330  1.3441327 -0.35630463
attr(,"scaled:center")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
5.340167e+03 4.640833e+03 1.516667e+00 7.055667e+01 1.023333e+01 5.541667e+01 7.300000e+01 1.737715e+05 
attr(,"scaled:scale")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
7.839057e+03 1.070218e+03 5.307228e-01 1.218617e+00 2.917305e+00 1.181633e+01 6.918959e+01 1.964765e+05

R语言初学者——数据转换(四)

Arguments

Arguments

Arguments

R中的中心化和标准化处理

猜你喜欢