introduction
The two main types of text are string
and object
. If the type is not specifically specified as string
, the text type is generally object
.
Text operations are mainly implemented through accessorstr
. The function is very powerful, but you need to pay attention to the following points before using it.
-
The accessor can only be used on
Series
data structures. In addition to regular column variablesdf.col
, you can also usedf.Index
anddf.columns
for index types /span> -
Ensure that the object type accessed is string
str
. If not, you need toastype(str)
convert the type first, otherwise an error will be reported -
The accessor can be used with multiple connections. For example,
df.col.str.lower().str.upper()
, this and the one-line operation inDataframe
are the same principle
The various text manipulation operations are formally introduced below, which can basically cover 95% of daily data cleaning needs, with a total of 8 scenarios.
The following operations are based on the following data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['jordon', 'MIKE', 'Kelvin', 'xiaoLi', 'qiqi','Amei'],
'Age':[18, 30, 45, 23, 45, 62],
'level':['high','Low','M','L','middle',np.nan],
'Email':['[email protected]','[email protected]','[email protected]','[email protected]',np.nan,'[email protected]']})
--------------------------------------------
name Age level Email
0 jordon 18 high [email protected]
1 MIKE 30 Low [email protected]
2 Kelvin 45 M [email protected]
3 xiaoLi 23 L [email protected]
4 qiqi 45 middle NaN
5 Amei 62 NaN [email protected]
1. Text format
Case conversion
# 字符全部变成小写
s.str.lower()
# 字符全部大写
s.str.upper()
# 每个单词首字母大写
s.str.title()
# 字符串第一个字母大写
s.str.capitalize()
# 大小写字母转换
s.str.swapcase()
The above usage of is relatively simple, so instead of giving examples one by one, here is an example of changing columns
to lower case.
df.columns.str.lower()
--------------------------------------------------------
Index(['name', 'age', 'level', 'email'], dtype='object')
Format judgment
The following are all judgment operations, so they return Boolean values.
s.str.isalpha # 是否为字母
s.str.isnumeric # 是否为数字0-9
s.str.isalnum # 是否由字母和数字组成
s.str.isupper # 是否为大写
s.str.islower # 是否为小写
s.str.isdigit # 是否为数字
Alignment
# 居中对齐,宽度为8,其余用’*’填充
s.str.center(, fillchar='*')
# 左对齐,宽度为8,其余用’*’填充
s.str.ljust(8, fillchar='*')
# 右对齐,宽度为8,其余用’*’填充
s.str.rjust(8, fillchar='*')
# 自定义对齐方式,参数可调整宽度、对齐方向、填充字符
s.str.pad(width=8, side='both',fillchar='*')
# 举例
df.name.str.center(8, fillchar='*')
-------------
0 *jordon*
1 **MIKE**
2 *Kelvin*
3 *xiaoLi*
4 **qiqi**
5 **Amei**
Counting and coding
s.str.count('b') # 字符串种包括指定字母的数量
s.str.len() # 字符串长度
s.str.encode('utf-8') # 字符编码
s.str.decode('utf-8') # 字符解码
2. Text splitting
By using thesplit
method, a specified character can be used as a split point to split the text. Among them, the expand
parameter can expand the split content to form separate columns, and the n
parameter can specify the split position to control how many columns are formed.
The following will split theemail
variables according to@
.
# 使用方法
s.str.split('x', expand=True, n=1)
# 举例
df.Email.str.split('@')
----------------------------
0 [jordon, sohu.com]
1 [Mike, 126.cn]
2 [KelvinChai, gmail.com]
3 [xiaoli, 163.com]
4 NaN
5 [amei, qq.com]
# expand可以让拆分的内容扩展成单独一列
df.Email.str.split('@' ,expand=True)
----------------------------
0 1
0 jordon sohu.com
1 Mike 126.cn
2 KelvinChai gmail.com
3 xiaoli 163.com
4 NaN NaN
5 amei qq.com
More complex splitting can use regular expressions. For example, if you want to split by @
and .
at the same time, you can do it like this.
df.Email.str.split('\@|\.',expand=True)
----------------------------
0 1 2
0 jordon sohu com
1 Mike 126 cn
2 KelvinChai gmail com
3 xiaoli 163 com
4 NaN NaN NaN
5 amei qq com
3. Text replacement
There are several methods for text replacement of : replace
, slice_replace
, repeat
replace replace
replace
The method is the most commonly used replacement method, and the parameters are as follows:
-
pal
: is the content string to be replaced, or can be a regular expression -
repl
: It is a new content string, or it can be a called function -
regex
: used to set whether to support regular expressions, the default isTrue
# 将email种的com都替换为cn
df.Email.str.replace('com','cn')
------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]
A bit more complicated, such as writing the old content asregular expression.
#将@之前的名字都替换成xxx
df.Email.str.replace('(.*?)@','xxx@')
------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]
Or write the new content asthe called function.
df.Email.str.replace('(.*?)@', lambda x:x.group().upper())
-------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]
slice replacement
slice_replace
Replacement is achieved through slicing. Specified characters can be retained or deleted through slicing. The parameters are as follows.
-
start
:starting point -
stop
: end position -
repl
: New content to be replaced
Replace after start
slice position and before stop
slice position. If stop is not set, then after start
All are replaced. Similarly, if start
is not set, then all before stop
are replaced.
df.Email.str.slice_replace(start=1,stop=2,repl='XX')
-------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]
Repeat replacement
repeat
can realize the function of repeated replacement, and the parameter repeats
sets the number of repetitions.
df.name.str.repeat(repeats=2)
-------------------------
0 jordonjordon
1 MIKEMIKE
2 KelvinKelvin
3 xiaoLixiaoLi
4 qiqiqiqi
5 AmeiAmei
4. Text splicing
Text splicing is achieved through thecat
method, parameters:
-
others
: The sequence that needs to be spliced. IfNone
is not set, the current sequence will be automatically spliced into a string -
sep
: Delimiter used for splicing -
na_rep
: By default, null values are not processed. Here, the replacement character for null values is set. -
join
: tangential direction, inclusiveleft
,right
,outer
,inner
,默认为left
There are mainly the following splicing methods.
1. Concatenate a single sequence into a complete string
As mentioned above, when the ohters
parameter is not set, this method will merge the current sequence into a new string.
df.name.str.cat()
-------------------------------
'jordonMIKEKelvinxiaoLiqiqiAmei'
# 设置sep分隔符为`-`
df.name.str.cat(sep='-')
-------------------------------
'jordon-MIKE-Kelvin-xiaoLi-qiqi-Amei'
# 将缺失值赋值为`*`
df.level.str.cat(sep='-',na_rep='*')
-----------------------
'high-Low-M-L-middle-*'
2. Splice sequences and other list-like objects into new sequences
Next, first concatenate the name column and the *
column, and then concatenate the level
column to form a new sequence.
# str.cat多级连接实现多列拼接
df.name.str.cat(['*']*6).str.cat(df.level)
----------------
0 jordon*high
1 MIKE*Low
2 Kelvin*M
3 xiaoLi*L
4 qiqi*middle
5 NaN
# 也可以直接多列拼接
df.name.str.cat([df.level,df.Email],na_rep='*')
--------------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 qiqimiddle*
5 Amei*[email protected]
Splice a sequence with multiple objects into a new sequence
5. Text extraction
Text extraction is mainly achieved throughextract
.
extract
parameter:
-
pat
: Implement an extracted pattern through regular expressions -
flags
: The identifier in the regular libraryre
, such asre.IGNORECASE
-
expand
: When the regular expression extracts only one content, ifexpand=True
will be expanded and returned to oneDataFrame
, otherwise it will return oneSeries
# 提取email中的两个内容
df.Email.str.extract(pat='(.*?)@(.*).com')
--------------------
0 1
0 jordon sohu
1 vMike NaN
2 KelvinChai gmail
3 xiaoli 163
4 NaN NaN
5 amei qq
6. Text query
is implemented through two methods: find
and findall
.
find
The parameters are very simple. You can directly enter the string to be queried, and the position in the original string will be returned. If no result is found, -1
will be returned.
df['@position'] = df.Email.str.find('@')
df[['Email','@position']]
-------------------------------------
Email @position
0 [email protected] 6.0
1 [email protected] 4.0
2 [email protected] 10.0
3 [email protected] 6.0
4 NaN NaN
5 [email protected] 4.0
The above example returns the position of @
in the email variable.
Another way to find it isfindall
findall
parameter:
-
pat
: Content to be found, supports regular expressions -
flag
: The identifier in the regular libraryre
, such asre.IGNORECASE
findall
The difference between and find
is that it supports regular expressions and returns specific content. This method is somewhat similar to extract
and can also be used for extraction, but it is not as convenient as extract
.
df.Email.str.findall('(.*?)@(.*).com')
--------------------------
0 [(jordon, sohu)]
1 []
2 [(KelvinChai, gmail)]
3 [(xiaoli, 163)]
4 NaN
5 [(amei, qq)]
The above example returns the two parts of the regular search as a list of tuples.
7. Text contains
Text inclusion is implemented through the contains
method, which returns a Boolean value. It is generally used in conjunction with the loc
query function. Parameters:
-
pat
: Matches strings, supports regular expressions -
case
: Is it case sensitive?True
indicates the difference -
flags
: The identifier in the regular libraryre
, such asre.IGNORECASE
-
na
: Fill in missing values -
regex
: Whether to support regular expressions, defaultTrue
supported
df.Email.str.contains('jordon|com',na='*')
----------
0 True
1 False
2 True
3 True
4 *
5 True
#
df.loc[df.Email.str.contains('jordon|com', na=False)]
------------------------------------------
name Age level Email @position
0 jordon 18 high [email protected] 6.0
2 Kelvin 45 M [email protected] 10.0
3 xiaoLi 23 L [email protected] 6.0
5 Amei 62 NaN [email protected] 4.0
needs to be noted here. If used in conjunction with loc
, be careful not to have missing values, otherwise an error will be reported. The query can be completed by settingna=False
to ignore missing values.
8. Text dummy variables
get_dummies
A column variable can be automatically generated into a dummy variable (dummy variable). This method is often used in feature derivation.
df.name.str.get_dummies()
-------------------------------
Amei Kelvin MIKE jordon qiqi xiaoLi
0 0 0 0 1 0 0
1 0 0 1 0 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 1 0 0 0 0 0