Pandas study notes (third bomb)

NOTE: This tutorial tutorials This section connected in front of the first elastic
jump topic addresses to
jump to the previous chapter
This chapter describes some of the data acquisition mode in the subset of pandas

14 Select a subset of data

14.1 Select the row of Series

14.1.1 Get a column from the DataFrame as Series

city = college_data["CITY"]
print(city)
print("<"+"="*75+">")
print("类型为:",type(city))
INSTNM
Alabama A & M University                                            Normal
University of Alabama at Birmingham                             Birmingham
Amridge University                                              Montgomery
University of Alabama in Huntsville                             Huntsville
                                                                ...       
Rasmussen College - Overland Park                            Overland Park
National Personal Training Institute of Cleveland         Highland Heights
Bay Area Medical Academy - San Jose Satellite Location            San Jose
Excel Learning Center-San Antonio South                        San Antonio
Name: CITY, Length: 7535, dtype: object
<===========================================================================>
类型为: <class 'pandas.core.series.Series'>

14.1.2 iloc usage

14.1.2.1 Select an incoming integer index

city.iloc[0]
'Normal'

14.1.2.2 Incoming integer list to select a new Series

# 当传入列表时发现获取出来的数据结构依然是一个Series
city.iloc[[0,1,2,3]]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

14.1.2.3 Shard acquisition

# 获取整数索引 [0,10) 步长为2,这样选出的依然为Series
city[0:10:2]
INSTNM
Alabama A & M University                     Normal
Amridge University                       Montgomery
Alabama State University                 Montgomery
Central Alabama Community College    Alexander City
Auburn University at Montgomery          Montgomery
Name: CITY, dtype: object

14.1.3 loc usage

Pass in an index to select one

city["Alabama A & M University"]
'Normal'

14.1.3.1 Select multiple lines from the tag list

# 这样通过传入的标签索引列表选取多行,返回的依然是Series
city[["Alabama A & M University","Amridge University"]]
INSTNM
Alabama A & M University        Normal
Amridge University          Montgomery
Name: CITY, dtype: object

14.1.3.2 Shard selection

# 选取标签索引 [start_target,end_target] 步长为1的行,返回的是Series,注意这里是端点值都能取到
city["Alabama A & M University":"University of Alabama in Huntsville":1]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

14.2 Select rows of DataFrame

14.2.1 iloc usage

14.2.1.1 Pass in an integer index value to get a row of data (return type is Series)

college_data.iloc[0]
CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
                       ...  
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, Length: 26, dtype: object

14.2.1.2 Pass in an integer index list and return multiple rows of data of type (DataFrame)

college_data.iloc[[1,3,5,7,9]]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Athens State University Athens AL 0.0 0.0 ... 0.6296 0.6410 39000 18595
Auburn University Auburn AL 0.0 0.0 ... 0.3494 0.0415 45700 21831

5 rows × 26 columns

14.2.1.3 Shard Acquisition

# 获取索引为 [1,10) 步长为2 中的数据行,返回为DataFrame
college_data.iloc[1:10:2]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Athens State University Athens AL 0.0 0.0 ... 0.6296 0.6410 39000 18595
Auburn University Auburn AL 0.0 0.0 ... 0.3494 0.0415 45700 21831

5 rows × 26 columns

14.2.2 loc usage

14.2.2.1 Passing a label to get a row

# 获取标签索引对应的数据行,返回类型为Series
college_data.loc["University of Alabama at Birmingham"]
CITY                  Birmingham
STABBR                        AL
HBCU                           0
MENONLY                        0
                         ...    
PCTFLOAN                  0.5214
UG25ABV                   0.2422
MD_EARN_WNE_P10            39700
GRAD_DEBT_MDN_SUPP       21941.5
Name: University of Alabama at Birmingham, Length: 26, dtype: object

14.2.2.2 传入一个标签列表获取多行

# 根据传入的标签列表返回相应的数据行,返回类型为DataFrame
college_data.loc[["University of Alabama at Birmingham","The University of Alabama"]]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750

2 rows × 26 columns

14.2.2.3 分片获取

# 获取[start_target,end_tartget] 步长为 1 的数据行,返回为DataFrame
college_data.loc["University of Alabama at Birmingham":"University of Alabama in Huntsville":1]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097

3 rows × 26 columns

14.3 同时选取DataFrame的行和列

14.3.1 获取前n行m列

14.3.1.1 用 iloc 方法实现

# 获取前面两行三列数据
college_data.iloc[:2,:3]
CITY STABBR HBCU
INSTNM
Alabama A & M University Normal AL 1.0
University of Alabama at Birmingham Birmingham AL 0.0

14.3.1.2 用 loc方法 实现

# 获取行索引从[start_target,end_target]的行,和列索引为[start,end]的列
college_data.loc[:"University of Alabama at Birmingham",:"HBCU"]
CITY STABBR HBCU
INSTNM
Alabama A & M University Normal AL 1.0
University of Alabama at Birmingham Birmingham AL 0.0

14.3.2 获取全部行中的前n列

14.3.2.1 用 iloc 方法实现

college_data.iloc[:,:2]
CITY STABBR
INSTNM
Alabama A & M University Normal AL
University of Alabama at Birmingham Birmingham AL
Amridge University Montgomery AL
University of Alabama in Huntsville Huntsville AL
... ... ...
Rasmussen College - Overland Park Overland Park KS
National Personal Training Institute of Cleveland Highland Heights OH
Bay Area Medical Academy - San Jose Satellite Location San Jose CA
Excel Learning Center-San Antonio South San Antonio TX

7535 rows × 2 columns

14.3.2.2 用 loc 方法实现

college_data.loc[:,:"STABBR"]
CITY STABBR
INSTNM
Alabama A & M University Normal AL
University of Alabama at Birmingham Birmingham AL
Amridge University Montgomery AL
University of Alabama in Huntsville Huntsville AL
... ... ...
Rasmussen College - Overland Park Overland Park KS
National Personal Training Institute of Cleveland Highland Heights OH
Bay Area Medical Academy - San Jose Satellite Location San Jose CA
Excel Learning Center-San Antonio South San Antonio TX

7535 rows × 2 columns

14.3.3 选取不连续的行和列

14.3.3.1 用 iloc 方法实现

college_data.iloc[[1,3,5,7],[2,4,6,8]]
HBCU WOMENONLY SATVRMID DISTANCEONLY
INSTNM
University of Alabama at Birmingham 0.0 0.0 570.0 0.0
University of Alabama in Huntsville 0.0 0.0 595.0 0.0
The University of Alabama 0.0 0.0 555.0 0.0
Athens State University 0.0 0.0 NaN 0.0

14.3.3.2 用 loc 方法实现

# 这里实现的需求同上
college_data.loc[["University of Alabama at Birmingham","University of Alabama in Huntsville","The University of Alabama","Athens State University"],
                ["HBCU","WOMENONLY","SATVRMID","DISTANCEONLY"]]
HBCU WOMENONLY SATVRMID DISTANCEONLY
INSTNM
University of Alabama at Birmingham 0.0 0.0 570.0 0.0
University of Alabama in Huntsville 0.0 0.0 595.0 0.0
The University of Alabama 0.0 0.0 555.0 0.0
Athens State University 0.0 0.0 NaN 0.0

14.3.4 选取某一个标量的值

14.3.4.1 用 iloc方法实现

# 选取第四行四列的值
college_data.iloc[3,3]
0.0

14.3.4.2 用 loc 方法实现

# 实现的需求同上
college_data.loc["Athens State University","MENONLY"]
0.0

14.3.4.3 使用 iat 快速获取标量

%timeit college_data.iloc[1000,3]
# 可以看到使用iat方法,时间上大概节约了一半
%timeit college_data.iat[1000,3]
7.95 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.74 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

14.3.4.4 使用at快速获取标量

%timeit college_data.loc["Rasmussen College - Overland Park","CITY"]
# 同样发现使用at方法比loc的速度也快
%timeit college_data.at["Rasmussen College - Overland Park","CITY"]
6.41 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.16 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

14.4 补充

14.4.1 惰性切片

# 同样试用于Series
college_data[2:10:2]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
Alabama State University Montgomery AL 1.0 0.0 ... 0.7554 0.1270 26600 33118.5
Central Alabama Community College Alexander City AL 0.0 0.0 ... 0.3977 0.3153 27500 16127
Auburn University at Montgomery Montgomery AL 0.0 0.0 ... 0.5803 0.2930 35000 21335

4 rows × 26 columns

# 利用标签索引获取
college_data[:"Central Alabama Community College"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Alabama A & M University Normal AL 1.0 0.0 ... 0.8284 0.1049 30300 33888
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
Alabama State University Montgomery AL 1.0 0.0 ... 0.7554 0.1270 26600 33118.5
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Central Alabama Community College Alexander City AL 0.0 0.0 ... 0.3977 0.3153 27500 16127

7 rows × 26 columns

14.4.2 按照字母分片

# 按照字母分片必须先对标签索引进行排序
college_data.sort_index(ascending=True)["A":"E"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
A & W Healthcare Educators New Orleans LA 0.0 0.0 ... 0.8596 0.6667 NaN 19022.5
A T Still University of Health Sciences Kirksville MO 0.0 0.0 ... NaN NaN 219800 PrivacySuppressed
ABC Beauty Academy Garland TX 0.0 0.0 ... 0.0000 0.8286 NaN PrivacySuppressed
ABC Beauty College Inc Arkadelphia AR 0.0 0.0 ... 1.0000 0.4688 PrivacySuppressed 16500
... ... ... ... ... ... ... ... ... ...
Durham Technical Community College Durham NC 0.0 0.0 ... 0.1796 0.5961 27200 11069.5
Dutchess BOCES-Practical Nursing Program Poughkeepsie NY 0.0 0.0 ... 0.6275 0.5430 36500 9500
Dutchess Community College Poughkeepsie NY 0.0 0.0 ... 0.1936 0.1806 32500 10250
Dyersburg State Community College Dyersburg TN 0.0 0.0 ... 0.2493 0.3097 26800 7475

1900 rows × 26 columns

# 当然还可以反向获取
college_data.sort_index()["E":"F"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
E Q School of Hair Design Council Bluffs IA 0.0 0.0 ... 0.6737 0.1471 18100 7830
ECPI University Virginia Beach VA 0.0 0.0 ... 0.5001 0.6633 37000 20000
ECPI University-Charleston North Charleston SC NaN NaN ... NaN NaN NaN 20000
ECPI University-Charlotte Charlotte NC NaN NaN ... NaN NaN NaN 20000
... ... ... ... ... ... ... ... ... ...
Excelsior College Albany NY 0.0 0.0 ... 0.0800 0.9337 PrivacySuppressed 11010
Expertise Cosmetology Institute Las Vegas NV 0.0 0.0 ... 1.0000 0.4828 PrivacySuppressed 8450
Exposito School of Hair Design Amarillo TX 0.0 0.0 ... 0.6267 0.3966 15100 PrivacySuppressed
Expression College for Digital Arts Emeryville CA 0.0 0.0 ... 0.7736 0.3955 PrivacySuppressed 35662

381 rows × 26 columns

14.4.3 更换索引

college_data.set_index("CITY")
STABBR HBCU MENONLY WOMENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
CITY
Normal AL 1.0 0.0 0.0 ... 0.8284 0.1049 30300 33888
Birmingham AL 0.0 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Montgomery AL 0.0 0.0 0.0 ... 0.7795 0.8540 40100 23370
Huntsville AL 0.0 0.0 0.0 ... 0.4596 0.2640 45500 24097
... ... ... ... ... ... ... ... ... ...
Overland Park KS NaN NaN NaN ... NaN NaN NaN 21163
Highland Heights OH NaN NaN NaN ... NaN NaN NaN 6333
San Jose CA NaN NaN NaN ... NaN NaN NaN PrivacySuppressed
San Antonio TX NaN NaN NaN ... NaN NaN NaN 12125

7535 rows × 25 columns

14.4.4 复原索引

college_data.reset_index()
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
... ... ... ... ... ... ... ... ... ...
7531 Rasmussen College - Overland Park Overland Park KS NaN ... NaN NaN NaN 21163
7532 National Personal Training Institute of Cleveland Highland Heights OH NaN ... NaN NaN NaN 6333
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose CA NaN ... NaN NaN NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio TX NaN ... NaN NaN NaN 12125

7535 rows × 27 columns

发布了35 篇原创文章 · 获赞 78 · 访问量 1万+

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105510244