(Pandas) Comment data cleaning

1. Null value processing

# 直接删除评论列中的空值(不包含空字符串)
df = df.dropna(subset=['comment'])

2. Data deduplication

It is best to use multiple columns as references when removing duplicates. You can't just use the comment column to prevent deleting the same comments written by different people.

# 根据用户id与comment两列作为参照,如存在用户id与comment同时相同,那么只保留最开始出现的。
df.drop_duplicates(subset=['user_id', 'comment'], keep='first', inplace=True)
# 重置索引
df.reset_index(drop=True, inplace=True)

3. Targeted elimination of useless comments

1. Eliminate pure digital comments, convert them to empty strings first, and then treat empty strings uniformly.

# 用空字符串('')替换纯数字('123')
df['comment'] = df['comment'].str.replace('^[0-9]*$', '')

2. Eliminate comments with single repeated characters

# 用空字符串('')替换('111','aaa','....')等
df['comment'] = df['comment'].str.replace(r'^(.)\1*$', '')

3. Turn the time in the comment to a null character

# 用空字符串('')替换('2020/11/20 20:00:00')等
df['comment'] = df['comment'].str.replace(r'\d+/\d+/\d+ \d+:\d+:\d+', '')

4. Compress the continuous repeated part at the beginning.
Effect:'aaabdc'—>'adbc''Very
good, good, good'—'Very good'

# 将开头连续重复的部分替换为空''
prefix_series = df_comment.str.replace(r'(.)\1+$', '')
# 将结尾连续重复的部分替换为空''
suffix_series = df_comment.str.replace(r'^(.)\1+', '')
for index in range(len(df_comment)):
    # 对开头连续重复的只保留重复内容的一个字符(如'aaabdc'->'abdc')
    if prefix_series[index] != df_comment[index]:
        char = df_comment[index][-1]
        df_comment[index] = prefix_series[index] + char
    # 对结尾连续重复的只保留重复内容的一个字符(如'bdcaaa'->'bdca')
    elif suffix_series[index] != df_comment[index]:
        char = df_comment[index][0]
        df_comment[index] = char + suffix_series[index]

Convert the empty string to'np.nan', and use dropna() to delete

# 将空字符串转为'np.nan',即NAN,用于下一步删除这些评论
df['comment'].replace(to_replace=r'^\s*$', value=np.nan, regex=True, inplace=True)
# 删除comment中的空值,并重置索引
df = df.dropna(subset=['comment'])
df.reset_index(drop=True, inplace=True)

Even a little restraint on yourself can make a person strong and powerfulInsert picture description here

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/110884444