python数据清洗工具、方法、过程整理归纳(四、数据清洗之数据转换——日期格式数据处理、高阶函数数据处理、字符串数据处理)

5 数据转换

5.1 日期格式数据处理

  • pandas中使用to_datetime()方法将文本格式转换为日期格式
  • dataframe数据类型如果为datatime64,可以使用dt方法取出年月日等
  • 对于时间差数据,可以使用datedelta函数将其转化为指定时间单位的数值
  • 时间差数据,可以使用dt方法访问其常用属性
import numpy as np

import pandas as pd

import os 

os.chdir(r'D:\code\jupyter\course\代码和数据')

df = pd.read_csv('baby_trade_history.csv', encoding = 'utf-8', dtype = {'user_id':str})

 df.head()

	user_id 	auction_id 	cat_id 	cat1 	property 	buy_mount 	day
0 	786295544 	41098319944 	50014866 	50022520 	21458:86755362;13023209:3593274;10984217:21985... 	2 	20140919
1 	532110457 	17916191097 	50011993 	28 	21458:11399317;1628862:3251296;21475:137325;16... 	1 	20131011
2 	249013725 	21896936223 	50012461 	50014815 	21458:30992;1628665:92012;1628665:3233938;1628... 	1 	20131011
3 	917056007 	12515996043 	50018831 	50014815 	21458:15841995;21956:3494076;27000458:59723383... 	2 	20141023
4 	444069173 	20487688075 	50013636 	50008168 	21458:30992;13658074:3323064;1628665:3233941;1... 	1 	20141103

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 7 columns):
user_id       29971 non-null object
auction_id    29971 non-null int64
cat_id        29971 non-null int64
cat1          29971 non-null int64
property      29827 non-null object
buy_mount     29971 non-null int64
day           29971 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.6+ MB

df['buy_date'] = pd.to_datetime(df['day'], format = '%Y%m%d', errors = 'coerce')#新建一列(buy_date),内容为:把day这一列从整数型转换为时间格式,指定读取的格式为年月日,如果不是这个格式,则自动填为空值,不会报错

​

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 8 columns):
user_id       29971 non-null object
auction_id    29971 non-null int64
cat_id        29971 non-null int64
cat1          29971 non-null int64
property      29827 non-null object
buy_mount     29971 non-null int64
day           29971 non-null int64
buy_date      29971 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(2)
memory usage: 1.8+ MB

df.head()

	user_id 	auction_id 	cat_id 	cat1 	property 	buy_mount 	day 	buy_date
0 	786295544 	41098319944 	50014866 	50022520 	21458:86755362;13023209:3593274;10984217:21985... 	2 	20140919 	2014-09-19
1 	532110457 	17916191097 	50011993 	28 	21458:11399317;1628862:3251296;21475:137325;16... 	1 	20131011 	2013-10-11
2 	249013725 	21896936223 	50012461 	50014815 	21458:30992;1628665:92012;1628665:3233938;1628... 	1 	20131011 	2013-10-11
3 	917056007 	12515996043 	50018831 	50014815 	21458:15841995;21956:3494076;27000458:59723383... 	2 	20141023 	2014-10-23
4 	444069173 	20487688075 	50013636 	50008168 	21458:30992;13658074:3323064;1628665:3233941;1... 	1 	20141103 	2014-11-03

df['buy_date'].dt.year#提取年份

0        2014
1        2013
2        2013
3        2014
4        2014
5        2014
6        2012
7        2012
8        2012
9        2012
10       2012
11       2012
12       2012
13       2012
14       2013
15       2013
16       2013
17       2013
18       2013
19       2014
20       2014
21       2014
22       2014
23       2014
24       2014
25       2013
26       2014
27       2015
28       2014
29       2014
         ... 
29941    2013
29942    2013
29943    2013
29944    2014
29945    2014
29946    2014
29947    2014
29948    2013
29949    2013
29950    2013
29951    2014
29952    2014
29953    2015
29954    2015
29955    2014
29956    2014
29957    2014
29958    2014
29959    2014
29960    2013
29961    2013
29962    2013
29963    2012
29964    2012
29965    2012
29966    2014
29967    2014
29968    2013
29969    2013
29970    2013
Name: buy_date, Length: 29971, dtype: int64

df['buy_date'].dt.month

0         9
1        10
2        10
3        10
4        11
5        11
6        12
7        12
8        11
9        11
10       11
11       11
12       11
13       11
14       10
15       10
16       11
17       11
18       11
19        2
20        2
21        5
22        5
23        8
24        8
25        2
26        9
27        2
28       10
29       10
         ..
29941     1
29942     1
29943     1
29944    12
29945    12
29946    10
29947    10
29948     7
29949     7
29950     7
29951     3
29952     3
29953     1
29954     1
29955     8
29956     8
29957    10
29958    11
29959    11
29960     9
29961     9
29962     9
29963    12
29964    12
29965    12
29966     1
29967     1
29968    10
29969    10
29970    10
Name: buy_date, Length: 29971, dtype: int64

df['diff_day'] = pd.datetime.now() - df['buy_date']

df.head()

	user_id 	auction_id 	cat_id 	cat1 	property 	buy_mount 	day 	buy_date 	diff_day
0 	786295544 	41098319944 	50014866 	50022520 	21458:86755362;13023209:3593274;10984217:21985... 	2 	20140919 	2014-09-19 	1974 days 14:12:47.176183
1 	532110457 	17916191097 	50011993 	28 	21458:11399317;1628862:3251296;21475:137325;16... 	1 	20131011 	2013-10-11 	2317 days 14:12:47.176183
2 	249013725 	21896936223 	50012461 	50014815 	21458:30992;1628665:92012;1628665:3233938;1628... 	1 	20131011 	2013-10-11 	2317 days 14:12:47.176183
3 	917056007 	12515996043 	50018831 	50014815 	21458:15841995;21956:3494076;27000458:59723383... 	2 	20141023 	2014-10-23 	1940 days 14:12:47.176183
4 	444069173 	20487688075 	50013636 	50008168 	21458:30992;13658074:3323064;1628665:3233941;1... 	1 	20141103 	2014-11-03 	1929 days 14:12:47.176183

df.dtypes

user_id                object
auction_id              int64
cat_id                  int64
cat1                    int64
property               object
buy_mount               int64
day                     int64
buy_date       datetime64[ns]
diff_day      timedelta64[ns]
dtype: object

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 9 columns):
user_id       29971 non-null object
auction_id    29971 non-null int64
cat_id        29971 non-null int64
cat1          29971 non-null int64
property      29827 non-null object
buy_mount     29971 non-null int64
day           29971 non-null int64
buy_date      29971 non-null datetime64[ns]
diff_day      29971 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(5), object(2), timedelta64[ns](1)
memory usage: 2.1+ MB

df['diff_day'].dt.days

0        1974
1        2317
2        2317
3        1940
4        1929
5        1929
6        2620
7        2620
8        2661
9        2661
10       2661
11       2639
12       2639
13       2639
14       2323
15       2323
16       2268
17       2268
18       2268
19       2195
20       2195
21       2114
22       2114
23       2020
24       2020
25       2544
26       1975
27       1839
28       1954
29       1954
         ... 
29941    2594
29942    2594
29943    2594
29944    1871
29945    1871
29946    1947
29947    1947
29948    2391
29949    2391
29950    2391
29951    2155
29952    2155
29953    1860
29954    1860
29955    2009
29956    2009
29957    1936
29958    1928
29959    1928
29960    2346
29961    2346
29962    2346
29963    2626
29964    2626
29965    2626
29966    2227
29967    2227
29968    2320
29969    2320
29970    2320
Name: diff_day, Length: 29971, dtype: int64

df['diff_day'].head()

0   1974 days 14:12:47.176183
1   2317 days 14:12:47.176183
2   2317 days 14:12:47.176183
3   1940 days 14:12:47.176183
4   1929 days 14:12:47.176183
Name: diff_day, dtype: timedelta64[ns]

df['diff_day'].dt.seconds

0        51167
1        51167
2        51167
3        51167
4        51167
5        51167
6        51167
7        51167
8        51167
9        51167
10       51167
11       51167
12       51167
13       51167
14       51167
15       51167
16       51167
17       51167
18       51167
19       51167
20       51167
21       51167
22       51167
23       51167
24       51167
25       51167
26       51167
27       51167
28       51167
29       51167
         ...  
29941    51167
29942    51167
29943    51167
29944    51167
29945    51167
29946    51167
29947    51167
29948    51167
29949    51167
29950    51167
29951    51167
29952    51167
29953    51167
29954    51167
29955    51167
29956    51167
29957    51167
29958    51167
29959    51167
29960    51167
29961    51167
29962    51167
29963    51167
29964    51167
29965    51167
29966    51167
29967    51167
29968    51167
29969    51167
29970    51167
Name: diff_day, Length: 29971, dtype: int64

14*3600+12*60+47

51167

df['diff_day'].dt.microseconds

0        176183
1        176183
2        176183
3        176183
4        176183
5        176183
6        176183
7        176183
8        176183
9        176183
10       176183
11       176183
12       176183
13       176183
14       176183
15       176183
16       176183
17       176183
18       176183
19       176183
20       176183
21       176183
22       176183
23       176183
24       176183
25       176183
26       176183
27       176183
28       176183
29       176183
          ...  
29941    176183
29942    176183
29943    176183
29944    176183
29945    176183
29946    176183
29947    176183
29948    176183
29949    176183
29950    176183
29951    176183
29952    176183
29953    176183
29954    176183
29955    176183
29956    176183
29957    176183
29958    176183
29959    176183
29960    176183
29961    176183
29962    176183
29963    176183
29964    176183
29965    176183
29966    176183
29967    176183
29968    176183
29969    176183
29970    176183
Name: diff_day, Length: 29971, dtype: int64

df['时间差'] = df['diff_day'] / pd.Timedelta('1 D')

df['时间差'].head()

0    1974.592213
1    2317.592213
2    2317.592213
3    1940.592213
4    1929.592213
Name: 时间差, dtype: float64

df['时间差'].round(decimals = 3)

0        1974.592
1        2317.592
2        2317.592
3        1940.592
4        1929.592
5        1929.592
6        2620.592
7        2620.592
8        2661.592
9        2661.592
10       2661.592
11       2639.592
12       2639.592
13       2639.592
14       2323.592
15       2323.592
16       2268.592
17       2268.592
18       2268.592
19       2195.592
20       2195.592
21       2114.592
22       2114.592
23       2020.592
24       2020.592
25       2544.592
26       1975.592
27       1839.592
28       1954.592
29       1954.592
           ...   
29941    2594.592
29942    2594.592
29943    2594.592
29944    1871.592
29945    1871.592
29946    1947.592
29947    1947.592
29948    2391.592
29949    2391.592
29950    2391.592
29951    2155.592
29952    2155.592
29953    1860.592
29954    1860.592
29955    2009.592
29956    2009.592
29957    1936.592
29958    1928.592
29959    1928.592
29960    2346.592
29961    2346.592
29962    2346.592
29963    2626.592
29964    2626.592
29965    2626.592
29966    2227.592
29967    2227.592
29968    2320.592
29969    2320.592
29970    2320.592
Name: 时间差, Length: 29971, dtype: float64

df['diff_day'].astype('timedelta64[D]')#转化为整数天(D)

0        1974.0
1        2317.0
2        2317.0
3        1940.0
4        1929.0
5        1929.0
6        2620.0
7        2620.0
8        2661.0
9        2661.0
10       2661.0
11       2639.0
12       2639.0
13       2639.0
14       2323.0
15       2323.0
16       2268.0
17       2268.0
18       2268.0
19       2195.0
20       2195.0
21       2114.0
22       2114.0
23       2020.0
24       2020.0
25       2544.0
26       1975.0
27       1839.0
28       1954.0
29       1954.0
          ...  
29941    2594.0
29942    2594.0
29943    2594.0
29944    1871.0
29945    1871.0
29946    1947.0
29947    1947.0
29948    2391.0
29949    2391.0
29950    2391.0
29951    2155.0
29952    2155.0
29953    1860.0
29954    1860.0
29955    2009.0
29956    2009.0
29957    1936.0
29958    1928.0
29959    1928.0
29960    2346.0
29961    2346.0
29962    2346.0
29963    2626.0
29964    2626.0
29965    2626.0
29966    2227.0
29967    2227.0
29968    2320.0
29969    2320.0
29970    2320.0
Name: diff_day, Length: 29971, dtype: float64

5.2 高阶函数数据处理

  • 在dataframe中使用apply方法,调用自定义函数对数据进行处理
  • 函数apply,axis = 0表示对行进行操作,1表示对列进行操作
  • 可以使用astype函数对数据进行转换
  • 可以使用map函数进行数据转换
df2 = pd.read_csv('sam_tianchi_mum_baby.csv', dtype = str, encoding = 'utf-8')

df2.head()

	user_id 	birthday 	gender
0 	2757 	20130311 	1
1 	415971 	20121111 	0
2 	1372572 	20120130 	1
3 	10339332 	20110910 	0
4 	10642245 	20130213 	0

def f(x):

    if '0' in str(x):

        return '女'

    elif '1' in str(x):

        return '男'

    else:

        return '未知'

df2['性别'] = df2['gender'].apply(f)

df2.head()

	user_id 	birthday 	gender 	性别
0 	2757 	20130311 	11 	415971 	20121111 	02 	1372572 	20120130 	13 	10339332 	20110910 	04 	10642245 	20130213 	0 	女

df2[df2['gender'] == '2']

	user_id 	birthday 	gender 	性别
46 	49167150 	20130818 	2 	未知
47 	49983255 	20140206 	2 	未知
51 	52529655 	20130611 	2 	未知
58 	57711375 	20130420 	2 	未知
106 	99665637 	20130926 	2 	未知
132 	151742451 	20140121 	2 	未知
163 	277629531 	20130830 	2 	未知
235 	690267186 	20140104 	2 	未知
283 	902549448 	20130615 	2 	未知
284 	914074800 	20140219 	2 	未知
292 	1060518588 	20130815 	2 	未知
333 	17830874 	20130927 	2 	未知
611 	1134007073 	20131231 	2 	未知
646 	16033918 	20131024 	2 	未知
673 	45002392 	20131212 	2 	未知
701 	64581913 	20130827 	2 	未知
718 	80273467 	20150125 	2 	未知
735 	101254996 	20130717 	2 	未知
744 	115378612 	20140412 	2 	未知
754 	133546192 	20130707 	2 	未知
761 	144253378 	20131021 	2 	未知
764 	156156175 	20130824 	2 	未知
778 	198627397 	20130617 	2 	未知
814 	363860560 	20140216 	2 	未知
917 	930683689 	20131013 	2 	未知
931 	1609674769 	20140703 	2 	未知

del df2['性别']

df2['性别'] = df2['gender'].map({'0':'女','1':'男','2':'未知'})#map映射

df2.head()

	user_id 	birthday 	gender 	性别
0 	2757 	20130311 	11 	415971 	20121111 	02 	1372572 	20120130 	13 	10339332 	20110910 	04 	10642245 	20130213 	0del df2['性别']

df2['性别'] = df2['gender'].map(f)#也可以用自己定义的函数

df2.head()

	user_id 	birthday 	gender 	性别
0 	2757 	20130311 	11 	415971 	20121111 	02 	1372572 	20120130 	13 	10339332 	20110910 	04 	10642245 	20130213 	0 	女

df2['user_id'].apply(lambda x:str(x).replace(x[1:3],'**'))#对user_id的第2位到第3位做脱敏处理

0            2**7
1          4**971
2         1**2572
3        1**39332
4        1**42245
5        1**23201
6        1**68880
7        1**19465
8        1**50574
9        1**35440
10       1**10892
11       1**05422
12       1**86531
13       1**65490
14       1**31245
15       1**90851
16       2**87991
17       2**70454
18       2**37271
19       2**15917
20       2**87268
21       2**02471
22       2**08537
23       2**27133
24       2**29944
25       2**07593
26       2**29842
27       2**13666
28       3**95206
29       3**35454
          ...    
923    1**3363684
924    1**0284684
925    1**1311286
926    1**7643545
927    1**4274418
928    1**3802043
929    1**3628211
930    1**758**31
931    1**9674769
932    1**5254677
933    1**9930323
934    1**1493753
935    1**3328510
936    1**2276954
937    1**1109878
938    1**0825303
939    1**4594929
940    1**2586633
941    1**0529477
942    1**6568285
943    1**2594497
944    1**0948217
945    1**1509424
946    1**3092345
947    1**3235563
948    2**0957900
949    2**0304899
950    2**4469016
951    2**6831536
952    2**4611367
Name: user_id, Length: 953, dtype: object

df2['birthday'].apply(lambda x: x[4:8])

0      0311
1      1111
2      0130
3      0910
4      0213
5      0830
6      0107
7      0705
8      0708
9      0323
10     0812
11     0429
12     0922
13     1209
14     0115
15     0101
16     0808
17     1017
18     0204
19     0801
20     0526
21     0601
22     0416
23     1029
24     0826
25     1122
26     1124
27     0413
28     0918
29     0319
       ... 
923    1114
924    1229
925    1031
926    0629
927    0803
928    0301
929    0903
930    0321
931    0703
932    0123
933    1004
934    1222
935    0817
936    0821
937    0823
938    0721
939    0826
940    0309
941    0617
942    1123
943    0105
944    0917
945    0531
946    0303
947    0709
948    0430
949    0713
950    0416
951    0519
952    1031
Name: birthday, Length: 953, dtype: object

5.3 字符串数据处理

  • pandas中提供了字符串的函数,但只能对字符型变量使用
  • 通过str方法访问相关属性
  • 可以使用字符串的相关方法进行数据处理
df1 = pd.read_csv('MotorcycleData.csv',encoding = 'gbk')

df1.head()

	Condition 	Condition_Desc 	Price 	Location 	Model_Year 	Mileage 	Exterior_Color 	Make 	Warranty 	Model 	... 	Vehicle_Title 	OBO 	Feedback_Perc 	Watch_Count 	N_Reviews 	Seller_Status 	Vehicle_Tile 	Auction 	Buy_Now 	Bid_Count
0 	Used 	mint!!! very low miles 	$11,412 	McHenry, Illinois, United States 	2013.0 	16,000 	Black 	Harley-Davidson 	Unspecified 	Touring 	... 	NaN 	FALSE 	8.1 	NaN 	2427 	Private Seller 	Clear 	True 	FALSE 	28.0
1 	Used 	Perfect condition 	$17,200 	Fort Recovery, Ohio, United States 	2016.0 	60 	Black 	Harley-Davidson 	Vehicle has an existing warranty 	Touring 	... 	NaN 	FALSE 	100 	17 	657 	Private Seller 	Clear 	True 	TRUE 	0.0
2 	Used 	NaN 	$3,872 	Chicago, Illinois, United States 	1970.0 	25,763 	Silver/Blue 	BMW 	Vehicle does NOT have an existing warranty 	R-Series 	... 	NaN 	FALSE 	100 	NaN 	136 	NaN 	Clear 	True 	FALSE 	26.0
3 	Used 	CLEAN TITLE READY TO RIDE HOME 	$6,575 	Green Bay, Wisconsin, United States 	2009.0 	33,142 	Red 	Harley-Davidson 	NaN 	Touring 	... 	NaN 	FALSE 	100 	NaN 	2920 	Dealer 	Clear 	True 	FALSE 	11.0
4 	Used 	NaN 	$10,000 	West Bend, Wisconsin, United States 	2012.0 	17,800 	Blue 	Harley-Davidson 	NO WARRANTY 	Touring 	... 	NaN 	FALSE 	100 	13 	271 	OWNER 	Clear 	True 	TRUE 	0.0

5 rows × 22 columns

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition         7493 non-null object
Condition_Desc    1656 non-null object
Price             7493 non-null object
Location          7491 non-null object
Model_Year        7489 non-null float64
Mileage           7468 non-null object
Exterior_Color    6778 non-null object
Make              7489 non-null object
Warranty          5109 non-null object
Model             7370 non-null object
Sub_Model         2426 non-null object
Type              6011 non-null object
Vehicle_Title     268 non-null object
OBO               7427 non-null object
Feedback_Perc     6611 non-null object
Watch_Count       3517 non-null object
N_Reviews         7487 non-null object
Seller_Status     6868 non-null object
Vehicle_Tile      7439 non-null object
Auction           7476 non-null object
Buy_Now           7256 non-null object
Bid_Count         2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB

df1['Price'].head()

0    $11,412 
1    $17,200 
2     $3,872 
3     $6,575 
4    $10,000 
Name: Price, dtype: object

df1['价格'] = df1['Price'].str.strip('$')

df1['价格']

0       11,412 
1       17,200 
2        3,872 
3        6,575 
4       10,000 
5        1,500 
6       24,900 
7        1,400 
8        5,100 
9        8,000 
10       2,125 
11      11,100 
12       1,125 
13       3,550 
14       5,500 
15       9,000 
16       8,100 
17      14,000 
18      20,000 
19      13,000 
20      10,900 
21       9,999 
22       5,700 
23       7,400 
24      26,500 
25       9,850 
26      45,900 
27      10,600 
28      12,000 
29      20,000 
         ...   
7463     7,800 
7464     4,400 
7465     7,900 
7466     6,500 
7467     7,400 
7468     8,800 
7469    13,570 
7470     7,900 
7471     6,200 
7472     7,500 
7473     5,500 
7474     3,400 
7475     3,900 
7476     8,500 
7477     4,900 
7478     9,900 
7479     2,900 
7480     6,500 
7481    10,500 
7482     7,200 
7483     7,500 
7484     5,000 
7485     5,000 
7486     8,900 
7487     4,900 
7488     3,900 
7489     8,900 
7490     7,800 
7491     7,900 
7492    12,970 
Name: 价格, Length: 7493, dtype: object

df1['价格'] = df1['价格'].str.replace(',','')#把逗号转化为空字符串

df1['价格']

0       11412 
1       17200 
2        3872 
3        6575 
4       10000 
5        1500 
6       24900 
7        1400 
8        5100 
9        8000 
10       2125 
11      11100 
12       1125 
13       3550 
14       5500 
15       9000 
16       8100 
17      14000 
18      20000 
19      13000 
20      10900 
21       9999 
22       5700 
23       7400 
24      26500 
25       9850 
26      45900 
27      10600 
28      12000 
29      20000 
         ...  
7463     7800 
7464     4400 
7465     7900 
7466     6500 
7467     7400 
7468     8800 
7469    13570 
7470     7900 
7471     6200 
7472     7500 
7473     5500 
7474     3400 
7475     3900 
7476     8500 
7477     4900 
7478     9900 
7479     2900 
7480     6500 
7481    10500 
7482     7200 
7483     7500 
7484     5000 
7485     5000 
7486     8900 
7487     4900 
7488     3900 
7489     8900 
7490     7800 
7491     7900 
7492    12970 
Name: 价格, Length: 7493, dtype: object

df1['价格'] = df1['价格'].astype(int)

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 23 columns):
Condition         7493 non-null object
Condition_Desc    1656 non-null object
Price             7493 non-null object
Location          7491 non-null object
Model_Year        7489 non-null float64
Mileage           7468 non-null object
Exterior_Color    6778 non-null object
Make              7489 non-null object
Warranty          5109 non-null object
Model             7370 non-null object
Sub_Model         2426 non-null object
Type              6011 non-null object
Vehicle_Title     268 non-null object
OBO               7427 non-null object
Feedback_Perc     6611 non-null object
Watch_Count       3517 non-null object
N_Reviews         7487 non-null object
Seller_Status     6868 non-null object
Vehicle_Tile      7439 non-null object
Auction           7476 non-null object
Buy_Now           7256 non-null object
Bid_Count         2190 non-null float64
价格                7493 non-null int32
dtypes: float64(2), int32(1), object(20)
memory usage: 1.3+ MB

df1.head()

	Condition 	Condition_Desc 	Price 	Location 	Model_Year 	Mileage 	Exterior_Color 	Make 	Warranty 	Model 	... 	OBO 	Feedback_Perc 	Watch_Count 	N_Reviews 	Seller_Status 	Vehicle_Tile 	Auction 	Buy_Now 	Bid_Count 	价格
0 	Used 	mint!!! very low miles 	$11,412 	McHenry, Illinois, United States 	2013.0 	16,000 	Black 	Harley-Davidson 	Unspecified 	Touring 	... 	FALSE 	8.1 	NaN 	2427 	Private Seller 	Clear 	True 	FALSE 	28.0 	11412
1 	Used 	Perfect condition 	$17,200 	Fort Recovery, Ohio, United States 	2016.0 	60 	Black 	Harley-Davidson 	Vehicle has an existing warranty 	Touring 	... 	FALSE 	100 	17 	657 	Private Seller 	Clear 	True 	TRUE 	0.0 	17200
2 	Used 	NaN 	$3,872 	Chicago, Illinois, United States 	1970.0 	25,763 	Silver/Blue 	BMW 	Vehicle does NOT have an existing warranty 	R-Series 	... 	FALSE 	100 	NaN 	136 	NaN 	Clear 	True 	FALSE 	26.0 	3872
3 	Used 	CLEAN TITLE READY TO RIDE HOME 	$6,575 	Green Bay, Wisconsin, United States 	2009.0 	33,142 	Red 	Harley-Davidson 	NaN 	Touring 	... 	FALSE 	100 	NaN 	2920 	Dealer 	Clear 	True 	FALSE 	11.0 	6575
4 	Used 	NaN 	$10,000 	West Bend, Wisconsin, United States 	2012.0 	17,800 	Blue 	Harley-Davidson 	NO WARRANTY 	Touring 	... 	FALSE 	100 	13 	271 	OWNER 	Clear 	True 	TRUE 	0.0 	10000

5 rows × 23 columns

df1['Location'].str.split(',')#字符串的分割。df1['Location'].str.split(',').str[0]提取第一个元素

0             [McHenry,  Illinois,  United States]
1           [Fort Recovery,  Ohio,  United States]
2             [Chicago,  Illinois,  United States]
3          [Green Bay,  Wisconsin,  United States]
4          [West Bend,  Wisconsin,  United States]
5          [Watervliet,  Michigan,  United States]
6            [Sterling,  Illinois,  United States]
7         [Williamston,  Michigan,  United States]
8            [Palatine,  Illinois,  United States]
9             [Chicago,  Illinois,  United States]
10        [Williamston,  Michigan,  United States]
11           [Oak Park,  Illinois,  United States]
12          [Pewaukee,  Wisconsin,  United States]
13           [Madison,  Wisconsin,  United States]
14              [Davenport,  Iowa,  United States]
15            [Plymouth,  Indiana,  United States]
16              [Goshen,  Indiana,  United States]
17          [Plainfield,  Indiana,  United States]
18               [Bellevue,  Iowa,  United States]
19        [Des Plaines,  Illinois,  United States]
20             [Ottawa,  Illinois,  United States]
21        [Orland Park,  Illinois,  United States]
22      [Cross Plains,  Wisconsin,  United States]
23         [Watervliet,  Michigan,  United States]
24            [Kewanee,  Illinois,  United States]
25            [Chicago,  Illinois,  United States]
26        [South Haven,  Michigan,  United States]
27           [Wheeling,  Illinois,  United States]
28            [Chicago,  Illinois,  United States]
29            [Chicago,  Illinois,  United States]
                           ...                    
7463     [Raymond,  New Hampshire,  United States]
7464     [Raymond,  New Hampshire,  United States]
7465     [Raymond,  New Hampshire,  United States]
7466     [Raymond,  New Hampshire,  United States]
7467     [Raymond,  New Hampshire,  United States]
7468     [Raymond,  New Hampshire,  United States]
7469       [Scott City,  Missouri,  United States]
7470     [Raymond,  New Hampshire,  United States]
7471     [Raymond,  New Hampshire,  United States]
7472     [Raymond,  New Hampshire,  United States]
7473     [Raymond,  New Hampshire,  United States]
7474     [Raymond,  New Hampshire,  United States]
7475     [Raymond,  New Hampshire,  United States]
7476     [Raymond,  New Hampshire,  United States]
7477     [Raymond,  New Hampshire,  United States]
7478     [Raymond,  New Hampshire,  United States]
7479     [Raymond,  New Hampshire,  United States]
7480     [Raymond,  New Hampshire,  United States]
7481     [Raymond,  New Hampshire,  United States]
7482     [Raymond,  New Hampshire,  United States]
7483     [Raymond,  New Hampshire,  United States]
7484     [Raymond,  New Hampshire,  United States]
7485     [Raymond,  New Hampshire,  United States]
7486     [Raymond,  New Hampshire,  United States]
7487     [Raymond,  New Hampshire,  United States]
7488     [Raymond,  New Hampshire,  United States]
7489     [Raymond,  New Hampshire,  United States]
7490     [Raymond,  New Hampshire,  United States]
7491     [Raymond,  New Hampshire,  United States]
7492       [Scott City,  Missouri,  United States]
Name: Location, Length: 7493, dtype: object

df1['Location'].str.len()

0       32.0
1       34.0
2       32.0
3       35.0
4       35.0
5       35.0
6       33.0
7       36.0
8       33.0
9       32.0
10      36.0
11      33.0
12      34.0
13      33.0
14      30.0
15      32.0
16      30.0
17      34.0
18      29.0
19      36.0
20      31.0
21      36.0
22      38.0
23      35.0
24      32.0
25      32.0
26      36.0
27      33.0
28      32.0
29      32.0
        ... 
7463    37.0
7464    37.0
7465    37.0
7466    37.0
7467    37.0
7468    37.0
7469    35.0
7470    37.0
7471    37.0
7472    37.0
7473    37.0
7474    37.0
7475    37.0
7476    37.0
7477    37.0
7478    37.0
7479    37.0
7480    37.0
7481    37.0
7482    37.0
7483    37.0
7484    37.0
7485    37.0
7486    37.0
7487    37.0
7488    37.0
7489    37.0
7490    37.0
7491    37.0
7492    35.0
Name: Location, Length: 7493, dtype: float64

欢迎阅读数据清洗系列文章python数据清洗工具、方法、过程整理归纳

发布了5 篇原创文章 · 获赞 0 · 访问量 39

猜你喜欢

转载自blog.csdn.net/crystal_sugar/article/details/104382401