String Manipulation related with pandas

String Manipulation related with pandas

String object Methods

import pandas as pd
import numpy as np
val='a,b, guido'
val.split(',') # normal python built-in method split
['a', 'b', ' guido']
pieces=[x.strip() for x in val.split(',')];pieces  # strip whitespace
['a', 'b', 'guido']
'::'.join(pieces)
'a::b::guido'
val.count(',')
2
val.count('guido')
1
val.replace(',',':')
'a:b: guido'
val.swapcase()
'A,B, GUIDO'
val[::-1]
'odiug ,b,a'

Regular expression

The re module functions fall into 3 categories:pattern matching,substitution,splliting.

import re
text='foo   bar\t baz  \t qux'
re.split('\s+',text)
['foo', 'bar', 'baz', 'qux']
regex=re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
regex.findall(text)
['   ', '\t ', '  \t ']
  • To avoid unwanted escaping with \ in a regular expression,use raw string literals
text="""Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex=re.compile(pattern,re.I)

Using findall() produces a list of the email address.

regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
regex.findall(r' [email protected]')
['[email protected]']

search() returns a specified match object for the first email address in the text.

m=regex.search(text)
m
<re.Match object; span=(5, 20), match='[email protected]'>
regex.match(text)
text[m.start():m.end()]
'[email protected]'

regex.match(text) returns None,as it onlyu will match if the pattern occurs at the start of the string.

sub() will return a new string with occurences of the pattern replaced by a new string.

print(regex.sub('READACTED',text))
Dave READACTED
Steve READACTED
Rob READACTED
Ryan READACTED

Vectorized string functions in pandas

data={'Dave':'[email protected]','Steve':'[email protected]','Rob':'[email protected]','Wes':np.nan}
data=pd.Series(data);data
Dave     [email protected]
Steve    [email protected]
Rob        [email protected]
Wes                  NaN
dtype: object
data.isnull()
Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool
data.str.contains('gmail')
Dave     False
Steve    False
Rob       True
Wes        NaN
dtype: object
data
Dave     [email protected]
Steve    [email protected]
Rob        [email protected]
Wes                  NaN
dtype: object
data.map(lambda x:x[:2],na_action='ignore')  # x is the value in data, the returned Series has the same index with caller,data here.
Dave      da
Steve     st
Rob       ro
Wes      NaN
dtype: object
help(data.map)
Help on method map in module pandas.core.series:

map(arg, na_action=None) method of pandas.core.series.Series instance
    Map values of Series using input correspondence (a dict, Series, or
    function).
    
    Parameters
    ----------
    arg : function, dict, or Series
        Mapping correspondence.
    na_action : {None, 'ignore'}
        If 'ignore', propagate NA values, without passing them to the
        mapping correspondence.
    
    Returns
    -------
    y : Series
        Same index as caller.
    
    Examples
    --------
    
    Map inputs to outputs (both of type `Series`):
    
    >>> x = pd.Series([1,2,3], index=['one', 'two', 'three'])
    >>> x
    one      1
    two      2
    three    3
    dtype: int64
    
    >>> y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
    >>> y
    1    foo
    2    bar
    3    baz
    
    >>> x.map(y)
    one   foo
    two   bar
    three baz
    
    If `arg` is a dictionary, return a new Series with values converted
    according to the dictionary's mapping:
    
    >>> z = {1: 'A', 2: 'B', 3: 'C'}
    
    >>> x.map(z)
    one   A
    two   B
    three C
    
    Use na_action to control whether NA values are affected by the mapping
    function.
    
    >>> s = pd.Series([1, 2, 3, np.nan])
    
    >>> s2 = s.map('this is a string {}'.format, na_action=None)
    0    this is a string 1.0
    1    this is a string 2.0
    2    this is a string 3.0
    3    this is a string nan
    dtype: object
    
    >>> s3 = s.map('this is a string {}'.format, na_action='ignore')
    0    this is a string 1.0
    1    this is a string 2.0
    2    this is a string 3.0
    3                     NaN
    dtype: object
    
    See Also
    --------
    Series.apply : For applying more complex functions on a Series.
    DataFrame.apply : Apply a function row-/column-wise.
    DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
    
    Notes
    -----
    When `arg` is a dictionary, values in Series that are not in the
    dictionary (as keys) are converted to ``NaN``. However, if the
    dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
    provides a method for default values), then this default is used
    rather than ``NaN``:
    
    >>> from collections import Counter
    >>> counter = Counter()
    >>> counter['bar'] += 1
    >>> y.map(counter)
    1    0
    2    1
    3    0
    dtype: int64
pattern
'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'
data.str.findall(pattern,flags=re.I)
Dave     [[email protected]]
Steve    [[email protected]]
Rob        [[email protected]]
Wes                    NaN
dtype: object
matches=data.str.match(pattern,flags=re.I);matches
Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object
matches.str.get(1)
Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
matches.str[0]
Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
data.str[:5]
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

猜你喜欢

转载自www.cnblogs.com/johnyang/p/12715387.html