Pandas study notes (seven) - Pandas text data

leading


For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/


Import the required libraries and files:

>>> import pandas as pd
>>> import numpy as np

1. The nature of the string type

(1) The difference between string and object

There are three differences between the string type and object:

  • String accessor methods (such as str.count) will return the Nullable type of the corresponding data, and object will change the return type with the existence of missing values

  • Certain Series methods cannot be used on strings, for example: Series.str.decode(), because strings are stored instead of bytes

  • When the string type stores or operates missing values, the type will be broadcast as pd.NA instead of the floating-point type np.nan

The rest of the content is exactly the same in the current version, but to cater to the development model of Pandas, we still use strings to operate strings

(2) Conversion of string type

If you convert a container of other types directly to the string type, an error may occur:

# pd.Series([1,'1.']).astype('string') # 报错
# pd.Series([1,2]).astype('string') # 报错
# pd.Series([True,False]).astype('string') # 报错

The current correct method is to convert in two parts, first to str type object, and then to string type:

>>> pd.Series([1,'1.']).astype('str').astype('string')
0     1
1    1.
dtype: string
>>> pd.Series([1,2]).astype('str').astype('string')
0    1
1    2
dtype: string
>>> pd.Series([True,False]).astype('str').astype('string')
0     True
1    False
dtype: string

2. Splitting and splicing

(1) str.split method

1. Separator and str position element selection

>>> s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
>>> s
0    a_b_c
1    c_d_e
2     <NA>
3    f_g_h
dtype: string

Split according to a certain element, the default is a space

>>> s.str.split('_')
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

It should be noted here that the type after split is object, because now the elements in the Series are not strings, but contain lists, and the string type can only contain strings

For the str method, element selection can be performed. If the cell element is a list, then str[i] means to take out the i-th element. If it is a single element, first convert the element into a list before taking it out.

>>> s.str.split('_').str[1]
0       b
1       d
2    <NA>
3       g
dtype: object

>>> pd.Series(['a_b_c', ['a','b','c']], dtype="object").str[1]	# 第一个元素先转为['a','_','b','_','c']
0    _
1    b
dtype: object

2. Other parameters

The expand parameter controls whether to split the column, and the n parameter represents the maximum number of splits

>>> s.str.split('_',expand=True)
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h
>>> s.str.split('_',n=1)
0    [a, b_c]
1    [c, d_e]
2        <NA>
3    [f, g_h]
dtype: object
>>> s.str.split('_',expand=True,n=1)
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

(2) str.cat method

1. Stitching modes for different objects

The cat method has different effects on different objects, and the objects include: single column, double column, multi-column

(a) For a single Series, it means that all elements are merged into a character string

>>> s = pd.Series(['ab',None,'d'],dtype='string')
>>> s
0      ab
1    <NA>
2       d
dtype: string
>>> s.str.cat()
'abd'

where the optional sep separator parameter, and the missing value replacement character na_rep parameter:

>>> s.str.cat(sep=',')
'ab,d'
>>> s.str.cat(sep=',',na_rep='*')
'ab,*,d'

(b) For the merger of two Series, the elements of the corresponding index are merged

>>> s2 = pd.Series(['24',None,None],dtype='string')
>>> s2
0      24
1    <NA>
2    <NA>
dtype: string
>>> s.str.cat(s2)
0    ab24
1    <NA>
2    <NA>
dtype: string

There are also corresponding parameters. It should be noted that two missing values ​​​​will be replaced at the same time:

>>> s.str.cat(s2,sep=',',na_rep='*')
0    ab,24
1      *,*
2      d,*
dtype: string

(c) Multi-column splicing can be divided into table splicing and multi-Series splicing

Table splicing:

>>> s.str.cat(pd.DataFrame({
    
    0:['1','3','5'],1:['5','b',None]},dtype='string'),na_rep='*')
0    ab15
1     *3b
2     d5*
dtype: string

Multiple Series splicing:

>>> s.str.cat([s+'0',s*2])
0    abab0abab
1         <NA>
2        dd0dd
dtype: string

2. Index alignment in cat

In the current version, if the merged indexes on both sides are different and the join parameter is not specified, the default is left join, set join='left'

>>> s2 = pd.Series(list('abc'),index=[1,2,3],dtype='string')
>>> s2
1    a
2    b
3    c
dtype: string
>>> s.str.cat(s2,na_rep='*')
0    ab*
1     *a
2     db
dtype: string

3. Replacement

Replacement in a broad sense refers to the application of the str.replace function, and fillna is for the replacement of missing values, as mentioned in the previous chapter

When it comes to replacement, you will inevitably come into contact with regular expressions. By default, readers have mastered the knowledge points of common regular expressions. If you don’t know them yet, you can use this information to get familiar with them.

(1) Common usage of str.replace

>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'],dtype="string")
>>> s
0       A
1       B
2       C
3    Aaba
4    Baca
5
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

The first value writes the regular expression starting with r, and the latter writes the replaced string

>>> s.str.replace(r'^[AB]','***')
0       ***
1       ***
2         C
3    ***aba
4    ***aca
5
6      <NA>
7      CABA
8       dog
9       cat
dtype: string

(2) Subgroup and function replacement

Subgroups are called by positive integers (0 returns the character itself, starting with 1 is a subgroup)

>>> s.str.replace(r'([ABC])(\w+)',lambda x:x.group(2)[1:]+'*')
0       A
1       B
2       C
3     ba*
4     ca*
5
6    <NA>
7     BA*
8     dog
9     cat
dtype: string

Use the ?P<…> expression to name and call subgroups

>>> s.str.replace(r'(?P<one>[ABC])(?P<two>\w+)',lambda x:x.group('two')[1:]+'*')
0       A
1       B
2       C
3     ba*
4     ca*
5
6    <NA>
7     BA*
8     dog
9     cat
dtype: string

(3) Notes on str.replace

First, make it clear that str.replace and replace are not the same thing:

str.replace is aimed at the object type or string type. The default operation is a regular expression. Currently, it does not support the use of DataFrame

replace is aimed at any type of sequence or data frame. If you want to replace it with a regular expression, you need to set regex=True. This method supports multi-column replacement through a dictionary

But now due to the initial introduction of the string type, there are some problems in usage, and these issues are expected to be fixed in future versions

1. The assignment parameter of str.replace cannot be pd.NA

This sounds very unreasonable. For example, if you replace a string that meets certain regular conditions with a missing value, if you directly change it to a missing value, an error will be reported in the current version.

# pd.Series(['A','B'],dtype='string').str.replace(r'[A]',pd.NA) #报错
# pd.Series(['A','B'],dtype='O').str.replace(r'[A]',pd.NA) #报错

At this point, you can first convert to object type and then convert back, the curve saves the country:

>>> pd.Series(['A','B'],dtype='string').astype('O').replace(r'[A]',pd.NA,regex=True).astype('string')
0    <NA>
1       B
dtype: string

As for why the regex replacement of the replace function is not used (but the non-regular replacement of the string type replace is possible), the reason is as follows

2. For string type Series, regular expression cannot be used when using the replace function (this bug has been fixed now):

# >>> pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True)
# 0    A
# 1    B
# dtype: string

>>> pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True)
0    C
1    B
dtype: string

>>> pd.Series(['A','B'],dtype='O').replace(r'[A]','C',regex=True)
0    C
1    B
dtype: object

3. If there is a missing value in the string type sequence, you can use replace to replace it

>>> pd.Series(['A',np.nan],dtype='string').replace('A','B')
0       B
1    <NA>
dtype: string
>>> pd.Series(['A',np.nan],dtype='string').str.replace('A','B')
0       B
1    <NA>
dtype: string

To sum up, generally speaking, unless the assignment element needs to be a missing value (converted to object and then converted back), please use the str.replace method

4. Substring matching and extraction

(1) str.extract method

1. Common usage

>>> pd.Series(['10-87', '10-88', '10-89'],dtype="string").str.extract(r'([\d]{2})-([\d]{2})')
    0   1
0  10  87
1  10  88
2  10  89

Use subgroup names as column names:

>>> pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})')
  name_1 name_2
0     10     87
1     10     88
2   <NA>   <NA>

Use the ? regular mark to select part of the extraction

>>> pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})?-(?P<name_2>[\d]{2})')
  name_1 name_2
0     10     87
1     10     88
2   <NA>     89
>>> pd.Series(['10-87', '10-88', '10-'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})?')
  name_1 name_2
0     10     87
1     10     88
2     10   <NA>

2. expand parameter (default is True)

For a subgroup of Series, if expand is set to False, then return Series, if it is larger than one subgroup, the expand parameter is invalid, all returns DataFrame

For the Index of a subgroup, if expand is set to False, the extracted Index will be returned; if it is larger than a subgroup and expand is False, an error will be reported

>>> s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")
>>> s.index
Index(['A11', 'B22', 'C33'], dtype='object')
>>> s.str.extract(r'([\w])')
A11  a
B22  b
C33  c
>>> s.str.extract(r'([\w])',expand=False)
A11    a
B22    b
C33    c
dtype: string
>>> s.index.str.extract(r'([\w])')
   0
0  A
1  B
2  C
>>> s.index.str.extract(r'([\w])',expand=False)
Index(['A', 'B', 'C'], dtype='object')
>>> s.index.str.extract(r'([\w])([\d])')
   0  1
0  A  1
1  B  2
2  C  3
# s.index.str.extract(r'([\w])([\d])',expand=False) # 报错

(2) str.extractall method

Unlike extract, which only matches the first qualifying expression, extractall will find all qualifying strings and build a multi-level index (even if only one is found)

>>> s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string")
>>> two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
>>> s.str.extract(two_groups, expand=True)
  letter digit
A      a     1
B      b     1
C      c     1
>>> s.str.extractall(two_groups)
        letter digit
  match
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1
>>> s['A']='a1'
>>> s.str.extractall(two_groups)
        letter digit
  match
A 0          a     1
B 0          b     1
C 0          c     1

If you want to see the i-th layer match, you can use the xs method:

>>> s = pd.Series(["a1a2", "b1b2", "c1c2"], index=["A", "B", "C"],dtype="string")
>>> s.str.extractall(two_groups).xs(1,level='match')
  letter digit
A      a     2
B      b     2
C      c     2

(3) str.contains and str.match

The role of the former is to detect whether it contains a certain regular pattern:

>>> pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains(r'[0-9][a-z]')
0    False
1     <NA>
2     True
3     True
4     True
dtype: boolean

The optional argument is na:

>>> pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains('a', na=False)
0    False
1    False
2     True
3    False
4    False
dtype: boolean

The difference between str.match is that match relies on python's re.match to detect whether the regular pattern is included from the beginning:

>>> pd.Series(['1', None, '3a_', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False)
0    False
1    False
2     True
3     True
4    False
dtype: boolean
>>> pd.Series(['1', None, '_3a', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False)
0    False
1    False
2    False
3     True
4    False
dtype: boolean

Five, common string methods

(1) Filtering method

1. str.strip

Commonly used to filter spaces:

>>> pd.Series(list('abc'),index=[' space1  ','space2  ','  space3'],dtype="string").index.str.strip()
Index(['space1', 'space2', 'space3'], dtype='object')

2. str.lower和str.upper

>>> pd.Series('A',dtype="string").str.lower()
0    a
dtype: string
>>> pd.Series('a',dtype="string").str.upper()
0    A
dtype: string

3. str.swapcase和str.capitalize

Indicates swapping lettercase and uppercase initials, respectively:

>>> pd.Series('abCD',dtype="string").str.swapcase()
0    ABcd
dtype: string
>>> pd.Series('abCD',dtype="string").str.capitalize()
0    Abcd
dtype: string

(2) isnumeric method

Check if each digit is a number:

>>> pd.Series(['1.2','1','-0.3','a',np.nan],dtype="string").str.isnumeric()
0    False
1     True
2    False
3    False
4     <NA>
dtype: boolean

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/125020302