42_Pandas string to extract regular expressions to generate new columns

42_Pandas string to extract regular expressions to generate new columns

How to generate a new column by extracting a specific string with a regular expression from a pandas.DataFrame column and pandas.Series with strings as elements.

Use the following string methods.

  • str.extract(): extract only the first match
  • str.extractall(): extract all matches

If you want to extract from a column of pandas.DataFrame, you can specify the column (= pandas.Series), such as df['column name'], and then call str.extract() or str.extractall() to determine.

str.extract(): extract only the first match

Take pandas.Series as an example.

import pandas as pd

s_org = pd.Series(['[email protected]', '[email protected]', '[email protected]'], index=['A', 'B', 'C'])
print(s_org)
# A      [email protected]
# B      [email protected]
# C    [email protected]
# dtype: object

Use the str.extract() method to extract only the first match of the regular expression.

If a regular expression pattern is specified in the first argument to str.extract(), strings matching the group part enclosed in () will be extracted.
If the parameter expand is True, a new object will be created as pandas.DataFrame, if the parameter is False, a new object will be created as pandas.Series.

df_single = s_org.str.extract('(.+)@', expand=True)
print(df_single)
print(type(df_single))
#      0
# A  aaa
# B  bbb
# C  ccc
# <class 'pandas.core.frame.DataFrame'>

s = s_org.str.extract('(.+)@', expand=False)
print(s)
print(type(s))
# A    aaa
# B    bbb
# C    ccc
# dtype: object
# <class 'pandas.core.series.Series'>

In version 0.22.0, expand=False is the default, but in the future, expand=True will be the default. Since the result will vary depending on the version, it is safe to specify the extension explicitly.

FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame)
but in a future version of pandas this will be changed to expand=True (return DataFrame)

If you use a named group (?P...) for the regex pattern, the name becomes the column name (column_name) as-is.

df_name = s_org.str.extract('(?P<local>.+)@', expand=True)
print(df_name)
print(type(df_name))
#   local
# A   aaa
# B   bbb
# C   ccc
# <class 'pandas.core.frame.DataFrame'>

If multiple groups are included in (), a pandas.DataFrame will be returned with the extracted parts of each group as columns. In this case, regardless of whether the pandas.DataFrame parameter is True or False. Sequence numbers starting with 0 are column names by default, or if named groups (?P...) are used.

print(s_org.str.extract('(.+)@(.+)'))
#      0          1
# A  aaa    xxx.com
# B  bbb    yyy.net
# C  ccc  zzz.co.jp

print(s_org.str.extract('(?P<local>.+)@(?P<domain>.+)'))
#   local     domain
# A   aaa    xxx.com
# B   bbb    yyy.net
# C   ccc  zzz.co.jp

NaN if there is no matching part.

print(s_org.str.extract('(a+)', expand=True))
#      0
# A  aaa
# B  NaN
# C  NaN

str.extractall(): extract all matches

Take pandas.Series as an example.

s_org2 = pd.Series(['[email protected], [email protected]', '[email protected], [email protected]', '[email protected]'],
                   index=['A', 'B', 'C'])
print(s_org2)
# A    [email protected], [email protected]
# B    [email protected], [email protected]
# C               [email protected]
# dtype: object

Since str.extract() returns only the first matching part, the result is as follows.

print(s_org2.str.extract('([a-z]+)@([a-z.]+)', expand=True))
#      0          1
# A  aaa    xxx.com
# B  bbb    yyy.net
# C  ccc  zzz.co.jp

Use the str.extractall() method to extract all matches.

The result of str.extractall() is as follows. str.extractall() has no extension argument and always returns a pandas.DataFrame with multiindex indices.

df_all = s_org2.str.extractall('([a-z]+)@([a-z.]+)')
print(df_all)
#            0          1
#   match                
# A 0      aaa    xxx.com
#   1      iii    xxx.com
# B 0      bbb    yyy.net
#   1      jjj    yyy.net
# C 0      ccc  zzz.co.jp

print(df_all.index)
# MultiIndex(levels=[['A', 'B', 'C'], [0, 1]],
#            labels=[[0, 0, 1, 1, 2], [0, 1, 0, 1, 0]],
#            names=[None, 'match'])

See the following article for specifying and selecting elements of a multi-indexed pandas.DataFrame.

Note that even if there is only one matching part, the index will be a multi-index. I'm using the Series used in the str.extract() example.

print(s_org.str.extractall('([a-z]+)@([a-z.]+)'))
#            0          1
#   match                
# A 0      aaa    xxx.com
# B 0      bbb    yyy.net
# C 0      ccc  zzz.co.jp

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/116301296