Pandas text data processing

 

Initialized data

import pandas as pd
import numpy as np
index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
data = {
"age": [18, 30, np.nan, 40, np.nan, 30],
"city": ["","Jing beiShang Hai", "Guang Zhou", "Shen Zhen", np.nan, " "],
"sex": [None, "male", "female", "male", np.nan, "unknown"],
"birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"]
}
user_info = pd.DataFrame(data=data, index=index)
user_info["birth"] = pd.to_datetime(user_info.birth) user_info

 

Why use str property

  Text data, that is, we often say that the string, Pandas property offers str Series, you can easily operate on each element through it. Before've learned before, during processing of the Series in each element, we can use the map or apply methods.

# Each city are lowercase: 
user_info.city.map ( the lambda the X-: x.lower ())

  What? Actually wrong, wrong reason is because the object of type float no lower property. This is because the missing values ( np.nan) belonging to a float

  This time our str property operations, and see how to use it

# Text to lowercase 
user_info.city .str .lower () 
 # statistics length of each string 
user_info.city .str .LEN ()

Replacement and division

Replace operation

# Will be replaced by an empty string to underline: 
user_info.city.str.replace ( "  " , " _ " ) 
 # use regular expressions to replace all of the city beginning with S to an empty string: 
user_info.city.str.replace ( " ^ S. * " , "  " )

Split operation

# According to the empty string to split a column: 
user_info.city.str.split ( "  " ) 
 "" " 
name 
Tom [BeiJing] 
Bob [ShangHai] 
Mary [GuangZhou] 
James [ShenZhen] 
Andy NaN 
Alice [,] 
the Name: City , dtype: Object 
"" " 

# segmentation element in the list can be used get or [] symbol visit: 
user_info.city.str.split ( "  " ) .str.get (0)
 " "" 
name 
Tom BeiJing 
Bob ShangHai 
Mary GuangZhou 
James ShenZhen 
Andy NaN 
Alice              
the Name: City, dtype:object
"""

user_info.city.str.split ( "  " ) .str [1 ]
 "" " 
name 
Tom NaN 
Bob NaN 
Mary NaN 
James NaN 
Andy NaN 
Alice        
the Name: City, dtype: Object 
" "" 

# set the parameters expand = True can easily be extended this return to DataFrame 
user_info.city.str.split ( "  " , the expand = True) 
 "" " 

      0 1 
name         
Tom BeiJing None 
Bob ShangHai None 
Mary GuangZhou None 
James ShenZhen None 
Andy NaN NaN 
Alice         
" ""

Extract a substring

  Substrings extracted from a long string.

Extract all the letters before the first empty string matching

user_info.city.str.extract ( "(\ W +) \ + S", expand = True)
Extract only be able to match the first substring, Extract accepts a regular expression and comprising at least one capture group specified parameters expand = True to ensure that every return DataFrame.
 

# Match all the empty string preceding letter 
user_info.city.str.extract ( " (\ W +) \ + S " , the expand = True) 

# match the empty string all letters of the front and rear

 


\ s +: one or more empty string
(\ w +): packet capture any number of characters
(\ w +) \ s + : before one or more empty string, the packet capture any number of characters

Extracting all the letters of the first matching string in front of and behind the empty

user_info.city.str.extract ( "(\ w +) \ s + (\ w +)", expand = True)
if a plurality of sets of regular expression to extract a return DataFrame, only one of each group.

Use extractall all matched substring

The preceding blank letter string matches all groups out
user_info.city.str.extractall ( "(\ w +) \ s +")

Testing contains the substring

To test whether using contains contains the substring.
Test whether the city contains the substring "Zh":
user_info.city.str.contains ( "Zh")
to test whether begins with the letter "S":
user_info.city.str.contains ( "^ S")

Method Summary

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/zry-yt/p/11803278.html