Pandas Advanced: Text Processing

introduction

The two main types of text are string and object. If the type is not specifically specified as string, the text type is generally object.

Text operations are mainly implemented through accessorstr . The function is very powerful, but you need to pay attention to the following points before using it.

  1. The accessor can only be used onSeries data structures.  In addition to regular column variables df.col, you can also use df.Index and df.columns for index types /span>

  2. Ensure that the object type accessed is stringstr.  If not, you need to astype(str) convert the type first, otherwise an error will be reported

  3. The accessor can be used with multiple connections.  For example, df.col.str.lower().str.upper(), this and the one-line operation in Dataframe are the same principle

The various text manipulation operations are formally introduced below, which can basically cover 95% of daily data cleaning needs, with a total of 8 scenarios.

picture

The following operations are based on the following data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'name':['jordon', 'MIKE', 'Kelvin', 'xiaoLi', 'qiqi','Amei'],
                   'Age':[18, 30, 45, 23, 45, 62],
                   'level':['high','Low','M','L','middle',np.nan],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]',np.nan,'[email protected]']})
--------------------------------------------
   name   Age   level    Email
0  jordon  18    high    [email protected]
1  MIKE    30     Low    [email protected]
2  Kelvin  45       M    [email protected]
3  xiaoLi  23       L    [email protected]
4  qiqi    45  middle    NaN
5  Amei    62     NaN    [email protected]

1. Text format

Case conversion

# 字符全部变成小写
s.str.lower()
# 字符全部大写
s.str.upper()
# 每个单词首字母大写
s.str.title()
# 字符串第一个字母大写
s.str.capitalize()
# 大小写字母转换
s.str.swapcase()

The above usage of is relatively simple, so instead of giving examples one by one, here is an example of changing columns to lower case.

df.columns.str.lower()
--------------------------------------------------------
Index(['name', 'age', 'level', 'email'], dtype='object')

Format judgment

The following are all judgment operations, so they return Boolean values.

s.str.isalpha # 是否为字母
s.str.isnumeric # 是否为数字0-9
s.str.isalnum # 是否由字母和数字组成
s.str.isupper # 是否为大写
s.str.islower # 是否为小写
s.str.isdigit # 是否为数字

Alignment

# 居中对齐,宽度为8,其余用’*’填充
s.str.center(, fillchar='*')
# 左对齐,宽度为8,其余用’*’填充
s.str.ljust(8, fillchar='*')
# 右对齐,宽度为8,其余用’*’填充
s.str.rjust(8, fillchar='*')
# 自定义对齐方式,参数可调整宽度、对齐方向、填充字符
s.str.pad(width=8, side='both',fillchar='*')
# 举例
df.name.str.center(8, fillchar='*')
-------------
0    *jordon*
1    **MIKE**
2    *Kelvin*
3    *xiaoLi*
4    **qiqi**
5    **Amei**

Counting and coding

s.str.count('b') # 字符串种包括指定字母的数量
s.str.len() # 字符串长度
s.str.encode('utf-8') # 字符编码
s.str.decode('utf-8') # 字符解码

2. Text splitting

By using thesplit method, a specified character can be used as a split point to split the text. Among them, the expand parameter can expand the split content to form separate columns, and the n parameter can specify the split position to control how many columns are formed.

The following will split theemail variables according to@.

# 使用方法
s.str.split('x', expand=True, n=1)
# 举例
df.Email.str.split('@')
----------------------------
0         [jordon, sohu.com]
1            [Mike, 126.cn]
2    [KelvinChai, gmail.com]
3          [xiaoli, 163.com]
4                        NaN
5             [amei, qq.com]
# expand可以让拆分的内容扩展成单独一列
df.Email.str.split('@' ,expand=True)
----------------------------
   0          1
0  jordon      sohu.com
1  Mike        126.cn
2  KelvinChai  gmail.com
3  xiaoli      163.com
4  NaN         NaN
5  amei        qq.com

More complex splitting can use regular expressions. For example, if you want to split by @ and . at the same time, you can do it like this.

df.Email.str.split('\@|\.',expand=True)
----------------------------
   0           1      2
0  jordon      sohu   com
1  Mike        126    cn
2  KelvinChai  gmail  com
3  xiaoli      163    com
4  NaN         NaN    NaN
5  amei        qq     com

3. Text replacement

There are several methods for text replacement of : replace, slice_replace, repeat

replace replace

replaceThe method is the most commonly used replacement method, and the parameters are as follows:

  • pal: is the content string to be replaced, or can be a regular expression

  • repl: It is a new content string, or it can be a called function

  • regex: used to set whether to support regular expressions, the default isTrue

# 将email种的com都替换为cn
df.Email.str.replace('com','cn')
------------------------
0         [email protected]
1            [email protected]
2    [email protected]
3          [email protected]
4                    NaN
5             [email protected]

A bit more complicated, such as writing the old content asregular expression.

#将@之前的名字都替换成xxx
df.Email.str.replace('(.*?)@','xxx@')
------------------
0     [email protected]
1       [email protected]
2    [email protected]
3      [email protected]
4              NaN
5       [email protected]

Or write the new content asthe called function.

df.Email.str.replace('(.*?)@', lambda x:x.group().upper())
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                     NaN
5             [email protected]

slice replacement

slice_replaceReplacement is achieved through slicing. Specified characters can be retained or deleted through slicing. The parameters are as follows.

  • start:starting point

  • stop: end position

  • repl: New content to be replaced

Replace after start slice position and before stop slice position. If stop is not set, then after start All are replaced. Similarly, if start is not set, then all before stop are replaced.

df.Email.str.slice_replace(start=1,stop=2,repl='XX')
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                      NaN
5             [email protected]

Repeat replacement

repeat can realize the function of repeated replacement, and the parameter repeats sets the number of repetitions.

df.name.str.repeat(repeats=2)
-------------------------
0    jordonjordon
1        MIKEMIKE
2    KelvinKelvin
3    xiaoLixiaoLi
4        qiqiqiqi
5        AmeiAmei

4. Text splicing

Text splicing is achieved through thecat method, parameters:

  • others: The sequence that needs to be spliced. If None is not set, the current sequence will be automatically spliced ​​into a string

  • sep: Delimiter used for splicing

  • na_rep: By default, null values ​​are not processed. Here, the replacement character for null values ​​is set.

  • join: tangential direction, inclusiveleftrightouterinner ,默认为left

There are mainly the following splicing methods.

1. Concatenate a single sequence into a complete string

As mentioned above, when the ohters parameter is not set, this method will merge the current sequence into a new string.

df.name.str.cat()
-------------------------------
'jordonMIKEKelvinxiaoLiqiqiAmei'
# 设置sep分隔符为`-`
df.name.str.cat(sep='-')
-------------------------------
'jordon-MIKE-Kelvin-xiaoLi-qiqi-Amei'
# 将缺失值赋值为`*`
df.level.str.cat(sep='-',na_rep='*')
-----------------------
'high-Low-M-L-middle-*'

2. Splice sequences and other list-like objects into new sequences

Next, first concatenate the name column and the * column, and then concatenate the level column to form a new sequence.

# str.cat多级连接实现多列拼接
df.name.str.cat(['*']*6).str.cat(df.level)
----------------
0    jordon*high
1       MIKE*Low
2       Kelvin*M
3       xiaoLi*L
4    qiqi*middle
5            NaN
# 也可以直接多列拼接
df.name.str.cat([df.level,df.Email],na_rep='*')
--------------------------------
0      [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                    qiqimiddle*
5               Amei*[email protected]

Splice a sequence with multiple objects into a new sequence

5. Text extraction

Text extraction is mainly achieved throughextract.

extractparameter:

  • pat: Implement an extracted pattern through regular expressions

  • flags: The identifier in the regular libraryre, such asre.IGNORECASE

  • expand: When the regular expression extracts only one content, ifexpand=True will be expanded and returned to oneDataFrame, otherwise it will return oneSeries

# 提取email中的两个内容
df.Email.str.extract(pat='(.*?)@(.*).com')
--------------------
   0          1
0  jordon      sohu
1  vMike      NaN
2  KelvinChai  gmail
3  xiaoli      163
4  NaN         NaN
5  amei        qq

6. Text query

is implemented through two methods: find and findall.

find The parameters are very simple. You can directly enter the string to be queried, and the position in the original string will be returned. If no result is found, -1 will be returned.

df['@position'] = df.Email.str.find('@')
df[['Email','@position']]
-------------------------------------
    Email                   @position
0   [email protected]         6.0
1   [email protected]             4.0
2   [email protected]    10.0
3   [email protected]          6.0
4   NaN                     NaN
5   [email protected]             4.0

The above example returns the position of @ in the email variable.

Another way to find it isfindall

findallparameter:

  • pat: Content to be found, supports regular expressions

  • flag: The identifier in the regular libraryre, such asre.IGNORECASE

findallThe difference between and find is that it supports regular expressions and returns specific content. This method is somewhat similar to extract and can also be used for extraction, but it is not as convenient as extract.

df.Email.str.findall('(.*?)@(.*).com')
--------------------------
0         [(jordon, sohu)]
1                       []
2    [(KelvinChai, gmail)]
3          [(xiaoli, 163)]
4                      NaN
5             [(amei, qq)]

The above example returns the two parts of the regular search as a list of tuples.

7. Text contains

Text inclusion is implemented through the contains method, which returns a Boolean value. It is generally used in conjunction with the loc query function. Parameters:

  • pat: Matches strings, supports regular expressions

  • case: Is it case sensitive? True indicates the difference

  • flags: The identifier in the regular libraryre, such asre.IGNORECASE

  • na: Fill in missing values

  • regex: Whether to support regular expressions, defaultTruesupported

df.Email.str.contains('jordon|com',na='*')
----------
0     True
1    False
2     True
3     True
4        *
5     True
df.loc[df.Email.str.contains('jordon|com', na=False)]
------------------------------------------
   name    Age  level  Email                 @position
0  jordon  18   high   [email protected]        6.0
2  Kelvin  45   M      [email protected]   10.0
3  xiaoLi  23   L      [email protected]         6.0
5  Amei    62   NaN    [email protected]            4.0

needs to be noted here. If used in conjunction with loc, be careful not to have missing values, otherwise an error will be reported. The query can be completed by settingna=False to ignore missing values.

8. Text dummy variables

get_dummiesA column variable can be automatically generated into a dummy variable (dummy variable). This method is often used in feature derivation.

df.name.str.get_dummies()
-------------------------------
  Amei Kelvin MIKE jordon qiqi xiaoLi
0   0     0     0     1     0     0
1   0     0     1     0     0     0
2   0     1     0     0     0     0
3   0     0     0     0     0     1
4   0     0     0     0     1     0
5   1     0     0     0     0     0

Guess you like

Origin blog.csdn.net/qq_39312146/article/details/134700781