pandas (text deduplication) delete duplicate rows based on a column
Method 1: The
unique() function, which is used to obtain the unique value of the Series object.
import pandas as pd
dic = {
'name':['a', 'b', 'c', 'd'], 'comment':['abc', '真棒', '真棒', '123']}
df = pd.DataFrame(dic)
df
Out[4]:
name comment
0 a abc
1 b 真棒
2 c 真棒
3 d 123
df['comment'] = pd.DataFrame(df['comment'].unique())
df
Out[5]:
name comment
0 a abc
1 b 真棒
2 c 123
3 d NaN
# 删除comment中的空值
df = df.dropna(subset=['comment'])
df
Out[6]:
name comment
0 a abc
1 b 真棒
2 c 123
Method Two:
drop_duplicates(subset=['comment'], keep='first', inplace=True)
subset: fill in the name of the column to be deduplicated in the form of a list, the default is None, which means that it is based on all columns.
keep: There are three optional parameters:'first','last', False, and the default value is'first'. Among them,
(1) first means: keep the duplicate rows that appear for the first time, and delete the duplicate rows that follow.
(2) Last means: delete duplicates and keep the last occurrence.
(3) False means: delete all duplicates.
inplace: The default is False, and the duplicate is returned after deleting the duplicate. True, delete duplicate items directly on the original data.
import pandas as pd
dic = {
'name':['a', 'b', 'c', 'd'], 'comment':['abc', '真棒', '真棒', '123']}
df = pd.DataFrame(dic)
df
Out[6]:
name comment
0 a abc
1 b 真棒
2 c 真棒
3 d 123
df.drop_duplicates(keep='first', inplace=True)
df
Out[14]:
name comment
0 a abc
1 b 真棒
2 c 真棒
3 d 123
Subset is None by default. According to consideration of all columns, although the comments of columns 1 and 2 are the same, the names are not the same, so they are reserved and selected according to the specific situation when using.
df.drop_duplicates(subset=['comment'], keep='first', inplace=True)
df
Out[16]:
name comment
0 a abc
1 b 真棒
3 d 123
Set subset to comment to delete duplicate values in this column. The index is not reset at this time. If necessary, the following method can be used to reset the index.
df.reset_index(drop=True, inplace=True)
df
Out[18]:
name comment
0 a abc
1 b 真棒
2 d 123