This article introduces

Do you have such a feeling, why is the data in your hands always messy?

As a data analyst, data cleaning is an essential part. Sometimes because the data is too messy, it often takes us a lot of time to process it. Therefore, mastering more data cleaning methods will increase your ability by 100 times.

Based on this, this article describes the super easy-to-use str vectorized string function in Pandas. After learning it, I instantly feel that my data cleaning ability has improved.

1 data set, 16 Pandas functions

The data set is carefully fabricated for everyone by Huang , just to help everyone learn knowledge. The data set is as follows:

import pandas as pd

df ={'姓名':[' 黄同学','黄至尊','黄老邪 ','陈大美','孙尚香'],
     '英文名':['Huang tong_xue','huang zhi_zun','Huang Lao_xie','Chen Da_mei','sun shang_xiang'],
     '性别':['男','women','men','女','男'],
     '身份证':['463895200003128433','429475199912122345','420934199110102311','431085200005230122','420953199509082345'],
     '身高':['mid:175_good','low:165_bad','low:159_bad','high:180_verygood','low:172_bad'],
     '家庭住址':['湖北广水','河南信阳','广西桂林','湖北孝感','广东广州'],
     '电话号码':['13434813546','19748672895','16728613064','14561586431','19384683910'],
     '收入':['1.1万','8.5千','0.9万','6.5千','2.0万']}
df = pd.DataFrame(df)
df

The results are as follows:

Observing the above data, the data set is messy. Next, we will use 16 Pandas to clean the above data.

① cat function: used for string splicing

df["姓名"].str.cat(df["家庭住址"],sep='-'*3)

The results are as follows:

② contains: Determine whether a string contains a given character

df["家庭住址"].str.contains("广")

The results are as follows:

③ startswith/endswith: Determine whether a string starts/ends with...

# 第一个行的“ 黄伟”是以空格开头的
df["姓名"].str.startswith("黄") 
df["英文名"].str.endswith("e")

The results are as follows:

④ count: Count the number of times a given character appears in the string

df["电话号码"].str.count("3")

The results are as follows:

⑤ get: Get the string at the specified position

df["姓名"].str.get(-1)
df["身高"].str.split(":")
df["身高"].str.split(":").str.get(0)

The results are as follows:

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course!
QQ group: 705933274

⑥ len: Calculate the length of the string

df["性别"].str.len()

The results are as follows:

⑦ upper/lower: English case conversion

df["英文名"].str.upper()
df["英文名"].str.lower()

The results are as follows:

⑧ pad+side parameter/center: add a given character to the left, right or left and right sides of the string

df["家庭住址"].str.pad(10,fillchar="*")      # 相当于ljust()
df["家庭住址"].str.pad(10,side="right",fillchar="*")    # 相当于rjust()
df["家庭住址"].str.center(10,fillchar="*")

The results are as follows:

⑨ repeat: repeat the string several times

df["性别"].str.repeat(3)

The results are as follows:

⑩ slice_replace: Use the given string to replace the character at the specified position

df["电话号码"].str.slice_replace(4,8,"*"*4)

The results are as follows:

⑪ replace: replace the character at the specified position with the given string

df["身高"].str.replace(":","-")

The results are as follows:

⑫ replace: replace the character at the specified position with the given string (regular expression is accepted)

It is easy to use regular expressions in replace;
Don't worry about whether the following case is useful or not, you just need to know how easy it is to use regular data cleaning;

df["收入"].str.replace("\d+\.\d+","正则")

The results are as follows:

⑬ Split method + expand parameter: the join method is very powerful

# 普通用法
df["身高"].str.split(":")
# split方法，搭配expand参数
df[["身高描述","final身高"]] = df["身高"].str.split(":",expand=True)
df
# split方法搭配join方法
df["身高"].str.split(":").str.join("?"*5)

The results are as follows:

⑭ strip/rstrip/lstrip: remove white space and newline

df["姓名"].str.len()
df["姓名"] = df["姓名"].str.strip()
df["姓名"].str.len()

The results are as follows:

⑮ findall: Use regular expressions to match in the string and return a list of search results

findall uses regular expressions to do data cleaning, which is really fragrant!

df["身高"]
df["身高"].str.findall("[a-zA-Z]+")

The results are as follows:

⑯ extract/extractall: accept regular expressions and extract matching strings (brackets must be added)

df["身高"].str.extract("([a-zA-Z]+)")
# extractall提取得到复合索引
df["身高"].str.extractall("([a-zA-Z]+)")
# extract搭配expand参数
df["身高"].str.extract("([a-zA-Z]+).*?([a-zA-Z]+)",expand=True)

The results are as follows:

Today’s article is here for you, I hope it can be helpful to you.

I still want to recommend the Python learning group I built myself : 705933274. The group is all learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time (only Python software development related), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

Explain 16 Pandas functions in detail to improve your "data cleaning" ability

This article introduces

1 data set, 16 Pandas functions

① cat function: used for string splicing

② contains: Determine whether a string contains a given character

③ startswith/endswith: Determine whether a string starts/ends with...

④ count: Count the number of times a given character appears in the string

⑤ get: Get the string at the specified position

⑥ len: Calculate the length of the string

⑦ upper/lower: English case conversion

⑧ pad+side parameter/center: add a given character to the left, right or left and right sides of the string

⑨ repeat: repeat the string several times

⑩ slice_replace: Use the given string to replace the character at the specified position

⑪ replace: replace the character at the specified position with the given string

⑫ replace: replace the character at the specified position with the given string (regular expression is accepted)

⑬ Split method + expand parameter: the join method is very powerful

⑭ strip/rstrip/lstrip: remove white space and newline

⑮ findall: Use regular expressions to match in the string and return a list of search results

⑯ extract/extractall: accept regular expressions and extract matching strings (brackets must be added)

Guess you like