数据清洗之 字符串数据处理

字符串数据处理

  • Pandas中提供了字符串的函数,但只能对字符型变量进行使用
  • 通过str方法访问相关属性
  • 可以使用字符串的相关方法进行数据处理
函数名称 说明
contains() 返回表示各str是否含有指定模式的字符串
replace() 替换字符串
lower() 返回字符串的副本,其中所有字母都转换为小写
upper() 返回字符串的副本,其中所有字母都转换为大写
split() 返回字符串中的单词列表
strip() 删除前导和后置空格
join() 返回一个字符串,该字符串是给定序列中所有字符串的连接
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据转换'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk')
df.head(5)
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles $11,412 McHenry, Illinois, United States 2013.0 16,000 Black Harley-Davidson Unspecified Touring ... NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition $17,200 Fort Recovery, Ohio, United States 2016.0 60 Black Harley-Davidson Vehicle has an existing warranty Touring ... NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN $3,872 Chicago, Illinois, United States 1970.0 25,763 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME $6,575 Green Bay, Wisconsin, United States 2009.0 33,142 Red Harley-Davidson NaN Touring ... NaN FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0
4 Used NaN $10,000 West Bend, Wisconsin, United States 2012.0 17,800 Blue Harley-Davidson NO WARRANTY Touring ... NaN FALSE 100 13 271 OWNER Clear True TRUE 0.0

5 rows × 22 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition         7493 non-null object
Condition_Desc    1656 non-null object
Price             7493 non-null object
Location          7491 non-null object
Model_Year        7489 non-null float64
Mileage           7468 non-null object
Exterior_Color    6778 non-null object
Make              7489 non-null object
Warranty          5109 non-null object
Model             7370 non-null object
Sub_Model         2426 non-null object
Type              6011 non-null object
Vehicle_Title     268 non-null object
OBO               7427 non-null object
Feedback_Perc     6611 non-null object
Watch_Count       3517 non-null object
N_Reviews         7487 non-null object
Seller_Status     6868 non-null object
Vehicle_Tile      7439 non-null object
Auction           7476 non-null object
Buy_Now           7256 non-null object
Bid_Count         2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB
# 里面有字符串,不能进行转换
# df['Price'].astype(float)
# .str 方法可用于提取字符
df['Price'].str[1:3].head(5)
0    11
1    17
2    3,
3    6,
4    10
Name: Price, dtype: object
# 首先要对字符串进行相关处理
df['价格'] = df['Price'].str.strip('$')
df['价格'].head(5)
0    11,412 
1    17,200 
2     3,872 
3     6,575 
4    10,000 
Name: 价格, dtype: object
df['价格'] = df['价格'].str.replace(',', '')
df['价格'].head(5)
0    11412 
1    17200 
2     3872 
3     6575 
4    10000 
Name: 价格, dtype: object
df['价格'] = df['价格'].astype(float)
df['价格'].head(5)
0    11412.0
1    17200.0
2     3872.0
3     6575.0
4    10000.0
Name: 价格, dtype: float64
df.dtypes
Condition          object
Condition_Desc     object
Price              object
Location           object
Model_Year        float64
Mileage            object
Exterior_Color     object
Make               object
Warranty           object
Model              object
Sub_Model          object
Type               object
Vehicle_Title      object
OBO                object
Feedback_Perc      object
Watch_Count        object
N_Reviews          object
Seller_Status      object
Vehicle_Tile       object
Auction            object
Buy_Now            object
Bid_Count         float64
价格                float64
dtype: object
# 字符串分割
df['Location'].str.split(',').str[0].head(5)
0          McHenry
1    Fort Recovery
2          Chicago
3        Green Bay
4        West Bend
Name: Location, dtype: object
# 计算字符串的长度
df['Location'].str.len().head(5)
0    32.0
1    34.0
2    32.0
3    35.0
4    35.0
Name: Location, dtype: float64
发布了266 篇原创文章 · 获赞 335 · 访问量 4万+

猜你喜欢

转载自blog.csdn.net/qq_29339467/article/details/105565319