import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#解决matplotlib库中的字体设置和Unicode minus问题
plt.rcParams["font.family"]="SimHei"
plt.rcParams["axes.unicode_minus"]=False
data=pd.read_csv('./data/douyin_dataset.csv')
data
Field meaning
The first column is an undefined field, which is sequential but not continuous. It may be a processed data set.
uid: user id
user_city: the city where the user is located
item_id: work id
author_id: author id
item_city: work city
channel: the source from which to view the work
finish: Whether you have finished browsing the works
like: Do you like the work?
music_id: music id
duration_time: duration of the work (seconds)
real_time: work release time
H: current hour
date: current date
2. Data cleaning
data.info()
2.1 Missing values
#判断缺失值并按照行列统计
data.isnull().sum()
2.2 Duplicate data
#判断是否有重复数据
data.duplicated().sum()
2.3 Modify column names
colNameDict ={
'Unnamed: 0':'ID','uid':'用户id','user_city':'用户所在城市','item_id':'作品id','author_id':'作者id','item_city':'作品城市','channel':'观看到该作品的来源','finish':'是否浏览完作品','like':'是否对作品点赞','music_id':'音乐id','duration_time':'作品时长 s','real_time':'作品发布时间','H':'当前小时','date':'当前日期'}
data = data.rename(columns=colNameDict)
data
2.4 Object conversion of data
data['作品发布时间']=pd.to_datetime(data['作品发布时间'])
data['当前日期']=pd.to_datetime(data['当前日期'])
data
data.info()
3. Data analysis and visualization
3.1 Statistics of daily views, daily users, daily authors and daily works