pandas text manipulation

1. Common text operations
reference

1. Common text operations

Personally, I think there are three kinds of text operations, matching and replacing, matching extraction, separation, and splicing. Let's analyze in turn.

1.1 Text replacement

This refers to the process of replacing the content of a certain pattern of a string with specified content. For example, replace the letter with the ※ number. When making replacements, you need to have a certain understanding of python's regular expressions.

Definition: Regular expression is a special sequence of characters, which can help you easily check whether a string matches a certain pattern. Python has added the re module since version 1.5, which provides Perl-style regular expression patterns. The re module makes the Python language have all the regular expression functions.

Regular expressions are represented by a string of strings, and match strings in the form of a combination of exact matching and fuzzy matching. Exact matching is expressed by the letters to be matched, such as'py', and fuzzy expressions are expressed by symbols such as escape characters'\' and letters. Common fuzzy matching symbols are as follows:

\w	Match numbers and letters underscore
\W	Match non-numeric letters underscore
\s	Match any blank character, equivalent to [\t\n\r\f].
\S	Match any non-blank character
\d	Match any number, equivalent to [0-9].
\D	Match any non-digit
\A	Start of match string
\WITH	The matched string ends. If there is a newline, only the ending string before the newline is matched.
\with	End of match string
\G	Match the position where the last match was completed.
\b	Match a word boundary, that is, the position between the word and the space. For example,'er\b' can match the'er' in "never" but not the'er' in "verb"
\B	Match non-word boundaries. 'er\B' can match the'er' in "verb" but not the'er' in "never".
\n, \t, etc	Match a newline character. Matches a tab character, etc.

Well, here are some common methods:

Common usage of str.replace

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 
'CABA', 'dog', 'cat'],dtype="string")
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string
#使用***替换A或B,从头开始匹配
s.str.replace(r'^[AB]','***')
0       ***
1       ***
2         C
3    ***aba
4    ***aca
5          
6      <NA>
7      CABA
8       dog
9       cat
dtype: string

Detailed parameter explanation:
s.str.replace(pat, repl, n=-1, case=None, flags=0, regex=True)

pat	Matching rules, generally regular expressions (the re module here does not need to be imported)
repl	Replacement content, can be lambda expression
n	Replacement times, the default is -1, which means that all matched content is replaced
case	Determine whether the replacement is case sensitive (if it is True, it is case sensitive, at this time "pat" is a string, not a regular expression, False is not case sensitive, but if it `pat`is a compiled regular expression, it cannot be set. )
flags	Regular expression module flags, such as IGNORECASE. If it `pat`is compiled, you cannot set a regular expression. (Generally don't care)
regex	bool, the default is True to determine whether to assume the incoming pattern is a regular expression:-If it is True, the incoming pattern is assumed to be a regular expression. -If False, treat the pattern as a literal string

Replace f. with ba, the string generally represents an exact match.

>>> pd.Series(['f.o', 'fuz', np.nan]).str.replace('f.', 'ba', regex=False)
0    bao
1    fuz
2    NaN
dtype: object

str.replace与replace

First of all, it must be clear that str.replace and replace are not the same thing:

str.replace is for the object type or string type, and the default operation is regular expressions. Currently, it is not supported on DataFrame.
Replace is for any type of sequence or data frame. If you want to replace with a regular expression, you need to set regex=True. This method supports multi-column replacement through a dictionary. But now due to the initial introduction of the string type, there have been some problems in usage. These issues are expected to be fixed in future versions.

Other considerations
(a) The str.replace assignment parameter must not be pd.NA.
This sounds very unreasonable. For example, replacing a string that meets certain regular conditions with a missing value, and directly changing it to a missing value will cause an error in the current version

In [28]:
#pd.Series(['A','B'],dtype='string').str.replace(r'[A]',pd.NA) #报错
#pd.Series(['A','B'],dtype='O').str.replace(r'[A]',pd.NA) #报错
此时，可以先转为object类型再转换回来，曲线救国：
In [29]:
pd.Series(['A','B'],dtype='string').astype('O').replace(r'[A]',pd.NA,regex=True).astype('string')
Out[29]:
0    <NA>
1       B
dtype: string

(B) For string type Series, regular expressions cannot be used when using the replace function.
This bug has not been fixed yet

In [30]:
pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True)
Out[30]:
0    A
1    B
dtype: string
In [31]:
pd.Series(['A','B'],dtype='O').replace(r'[A]','C',regex=True)
Out[31]:
0    C
1    B
dtype: object

In [32]:
#pd.Series(['A',np.nan],dtype='string').replace('A','B') #报错
In [33]:
pd.Series(['A',np.nan],dtype='string').str.replace('A','B')
Out[33]:
0       B
1    <NA>
dtype: string

In summary, unless the replacement value is a null value, please use str.replace to replace it.

1.2 Specific string extraction

（a） str.extract方法
pd.Series.str.extract(pat, flags=0, expand=True)

example:

pd.Series(['10-87', '10-88', '10-89'],
dtype="string").str.extract(r'([\d]{2})-([\d]{2})')
Out[34]:
	0	1
0	10	87
1	10	88
2	10	89

This means that the first two numbers are extracted as one group, and the last two numbers are another group. Here you need to pay attention to the concept of grouping in regular expressions. Roughly speaking, the part enclosed in brackets is a single group.
Detailed parameters:

parameter	effect
pat	Regular expression pattern with capturing groups. (Regular expression with grouping)
flags	int, default 0 (no flags) is a statement of regular expressions, generally you can ignore them, such as `re.IGNORECASE`
expand	bool, default True, True indicates whether to return a DataFrame. For a subgroup of Series, if expand is set to False, then the Series is returned, but if the result is greater than a subgroup, the expand parameter is invalid and all are returned.

(B) The str.extractall method is
different from extract which only matches the first qualified expression. Extractall will find all qualified strings and build a multi-level index (even if only one is found)

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string")
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)

	letter	digit
A	a	1
B	b	1
C	c	1

s.str.extractall(two_groups)

		letter	digit
	match		
A	0	a	1
1	a	2
B	0	b	1
C	0	c	1

If you want to check the match of the i-th layer, you can use the xs method

s = pd.Series(["a1a2", "b1b2", "c1c2"], 
index=["A", "B", "C"],dtype="string")
s.str.extractall(two_groups).xs(1,level='match')
Out[48]:
letter	digit
A	a	2
B	b	2
C	c	2

1.3 Segmentation and splicing

The str.split method
(a) the slice of the
splitter and str Split:

In [6]:
s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
s
Out[6]:
0    a_b_c
1    c_d_e
2     <NA>
3    f_g_h
dtype: string
根据某一个元素分割，默认为空格

Note here that the type after split is object, because now the elements in the Series are not strings, but include lists, and the string type can only contain strings.
The str method can be used to select elements. If the cell element is a list, Then str[i] means to take out the i-th element. If it is a single element, first turn the element into a list and take it out.

In [8]:
s.str.split('_').str[1]
Out[8]:
0       b
1       d
2    <NA>
3       g
dtype: object
In [9]:
pd.Series(['a_b_c', ['a','b','c']], dtype="object").str[1]
#第一个元素先转为['a','_','b','_','c']
Out[9]:
0    _
1    b
dtype: object

(B) Other parameters The
expand parameter controls whether to split the column, and the n parameter represents the maximum number of splits

s.str.split('_',expand=True)
0	1	2
0	a	b	c
1	c	d	e
2	<NA>	<NA>	<NA>
3	f	g	h

s.str.split('_',n=1)

0    [a, b_c]
1    [c, d_e]
2        <NA>
3    [f, g_h]
dtype: object

s.str.split('_',expand=True,n=1)
0	1
0	a	b_c
1	c	d_e
2	<NA>	<NA>
3	f	g_h

str.cat method
(a) The splicing mode of different objects. The
cat method has different effects on different objects. The objects include: single column, double column, and multiple columns.
① For a single Series, it means that all elements carry characters. Merge into one string, and one column into one string.

s = pd.Series(['ab',None,'d'],dtype='string')
s

0      ab
1    <NA>
2       d
dtype: string

s.str.cat()

'abd'
#其中可选sep分隔符参数，和缺失值替代字符na_rep参数

s.str.cat(sep=',')

'ab,d'

s.str.cat(sep=',',na_rep='*')

'ab,*,d'

② For the merger of two Series, the elements of the corresponding index are merged

In [17]:
s2 = pd.Series(['24',None,None],dtype='string')
s2
Out[17]:
0      24
1    <NA>
2    <NA>
dtype: string
In [18]:
s.str.cat(s2)
Out[18]:
0    ab24
1    <NA>
2    <NA>
dtype: string
同样也有相应参数，需要注意的是两个缺失值会被同时替换
In [19]:
s.str.cat(s2,sep=',',na_rep='*')
Out[19]:
0    ab,24
1      *,*
2      d,*
dtype: string

③ Multi-column splicing can be divided into table splicing and multi-Series splicing

表的拼接
In [20]:
s.str.cat(pd.DataFrame({
    
    0:['1','3','5'],1:['5','b',None]},dtype='string'),na_rep='*')
Out[20]:
0    ab15
1     *3b
2     d5*
dtype: string

Multiple Series stitching

s.str.cat([s+'0',s*2])

0    abab0abab
1         <NA>
2        dd0dd
dtype: string

(B) Index alignment in cat In the
current version, if the merged indexes on both sides are not the same and the join parameter is not specified, the default is left join, set join='left'

In [22]:
s2 = pd.Series(list('abc'),index=[1,2,3],dtype='string')
s2
Out[22]:
1    a
2    b
3    c
dtype: string
In [23]:
s.str.cat(s2,na_rep='*')
Out[23]:
0    ab*
1     *a
2     db
dtype: string

reference

1.joyful-pandas

Pandas text operation learning-taskTwo