pandas (text deduplication) delete duplicate rows based on a column

pandas (text deduplication) delete duplicate rows based on a column


Method 1: The
unique() function, which is used to obtain the unique value of the Series object.

import pandas as pd

dic = {
    
    'name':['a', 'b', 'c', 'd'], 'comment':['abc', '真棒', '真棒', '123']}

df = pd.DataFrame(dic)

df
Out[4]: 
  name comment
0    a     abc
1    b      真棒
2    c      真棒
3    d     123

df['comment'] = pd.DataFrame(df['comment'].unique())

df
Out[5]: 
  name comment
0    a     abc
1    b      真棒
2    c     123
3    d     NaN

# 删除comment中的空值
df = df.dropna(subset=['comment'])

df
Out[6]: 
  name comment
0    a     abc
1    b      真棒
2    c     123

Method Two:

drop_duplicates(subset=['comment'], keep='first', inplace=True)
subset: fill in the name of the column to be deduplicated in the form of a list, the default is None, which means that it is based on all columns.
keep: There are three optional parameters:'first','last', False, and the default value is'first'. Among them,
(1) first means: keep the duplicate rows that appear for the first time, and delete the duplicate rows that follow.
(2) Last means: delete duplicates and keep the last occurrence.
(3) False means: delete all duplicates.
inplace: The default is False, and the duplicate is returned after deleting the duplicate. True, delete duplicate items directly on the original data.

import pandas as pd

dic = {
    
    'name':['a', 'b', 'c', 'd'], 'comment':['abc', '真棒', '真棒', '123']}

df = pd.DataFrame(dic)

df
Out[6]: 
  name comment
0    a     abc
1    b      真棒
2    c      真棒
3    d     123
df.drop_duplicates(keep='first', inplace=True)

df
Out[14]: 
  name comment
0    a     abc
1    b      真棒
2    c      真棒
3    d     123

Subset is None by default. According to consideration of all columns, although the comments of columns 1 and 2 are the same, the names are not the same, so they are reserved and selected according to the specific situation when using.

df.drop_duplicates(subset=['comment'], keep='first', inplace=True)

df
Out[16]: 
  name comment
0    a     abc
1    b      真棒
3    d     123

Set subset to comment to delete duplicate values ​​in this column. The index is not reset at this time. If necessary, the following method can be used to reset the index.

df.reset_index(drop=True, inplace=True)

df
Out[18]: 
  name comment
0    a     abc
1    b      真棒
2    d     123

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/109892053