Initialized data
import pandas as pd import numpy as np index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") data = { "age": [18, 30, np.nan, 40, np.nan, 30], "city": ["","Jing beiShang Hai", "Guang Zhou", "Shen Zhen", np.nan, " "], "sex": [None, "male", "female", "male", np.nan, "unknown"], "birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"] } user_info = pd.DataFrame(data=data, index=index)
user_info["birth"] = pd.to_datetime(user_info.birth) user_info
Why use str property
Text data, that is, we often say that the string, Pandas property offers str Series, you can easily operate on each element through it. Before've learned before, during processing of the Series in each element, we can use the map or apply methods.
# Each city are lowercase: user_info.city.map ( the lambda the X-: x.lower ())
What? Actually wrong, wrong reason is because the object of type float no lower property. This is because the missing values ( np.nan) belonging to a float
This time our str property operations, and see how to use it
# Text to lowercase user_info.city .str .lower () # statistics length of each string user_info.city .str .LEN ()
Replacement and division
Replace operation
# Will be replaced by an empty string to underline: user_info.city.str.replace ( " " , " _ " ) # use regular expressions to replace all of the city beginning with S to an empty string: user_info.city.str.replace ( " ^ S. * " , " " )
Split operation
# According to the empty string to split a column: user_info.city.str.split ( " " ) "" " name Tom [BeiJing] Bob [ShangHai] Mary [GuangZhou] James [ShenZhen] Andy NaN Alice [,] the Name: City , dtype: Object "" " # segmentation element in the list can be used get or [] symbol visit: user_info.city.str.split ( " " ) .str.get (0) " "" name Tom BeiJing Bob ShangHai Mary GuangZhou James ShenZhen Andy NaN Alice the Name: City, dtype:object """ user_info.city.str.split ( " " ) .str [1 ] "" " name Tom NaN Bob NaN Mary NaN James NaN Andy NaN Alice the Name: City, dtype: Object " "" # set the parameters expand = True can easily be extended this return to DataFrame user_info.city.str.split ( " " , the expand = True) "" " 0 1 name Tom BeiJing None Bob ShangHai None Mary GuangZhou None James ShenZhen None Andy NaN NaN Alice " ""
Extract a substring
Substrings extracted from a long string.
Extract all the letters before the first empty string matching
user_info.city.str.extract ( "(\ W +) \ + S", expand = True)
Extract only be able to match the first substring, Extract accepts a regular expression and comprising at least one capture group specified parameters expand = True to ensure that every return DataFrame.
# Match all the empty string preceding letter user_info.city.str.extract ( " (\ W +) \ + S " , the expand = True) # match the empty string all letters of the front and rear
\ s +: one or more empty string
(\ w +): packet capture any number of characters
(\ w +) \ s + : before one or more empty string, the packet capture any number of characters
Extracting all the letters of the first matching string in front of and behind the empty
user_info.city.str.extract ( "(\ w +) \ s + (\ w +)", expand = True)
if a plurality of sets of regular expression to extract a return DataFrame, only one of each group.
Use extractall all matched substring
The preceding blank letter string matches all groups out
user_info.city.str.extractall ( "(\ w +) \ s +")
Testing contains the substring
To test whether using contains contains the substring.
Test whether the city contains the substring "Zh":
user_info.city.str.contains ( "Zh")
to test whether begins with the letter "S":
user_info.city.str.contains ( "^ S")
Method Summary