Data analysis is mainly controlled by several libraries:
numpy, PANDAS, Seaborn, matplotlib
Data analysis process
1. View data
Common functions:
df.head ()
df.info ()
df.describe ()
By understanding these functions preliminary data looks like, with or without missing values, basic statistical information
2. handle missing data
Common functions:
df.isna () # to see if there is missing data values
df.isnull () # View Data null
df.duplicated () # View duplicate data
df.fillna () # of missing data values filled
Note:
(. 1) df.fillna () function can be specified for a column filled with missing values
df['col_name'].fillna(values, inplace = True)
(2) filled with an index value, the need for index corresponding to original data set
after filling reset_index ()
titanic_df.set_index('Sex', inplace=True)
titanic_df
# 使用fillna填充缺失值,根据索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引,即取消Sex索引
titanic_df.reset_index(inplace=True)
3. Data conversion, packet aggregation
常用函数:
df.groupby(by = , sort=True).agg(func)
df.groupby(by = , sort=True).func
df.groupby(by = , sort=True)[‘col_names’].agg(func)
df.groupby(by = , sort=True)[‘col_names’].func
df.sort_index () # sorted by index
() values are sorted according to df.sort_values #
Note: sort_values () simultaneously on multiple columns of data designated to sort
titanic_df.sort_values(by = ['Pclass', 'Age'], ascending=[True, False],inplace=True)
4. Data Visualization
(1) pandas comes visualization functions
df.plot (= kind)
kind: 'Box' # boxplot
'line' # line graph
'pie' # pie
'bar' # histograms
'hist' # histograms
(2) seaborns
common graphics functions:
# 散点图,分析两个变量的关系
sns.lmplot(x="total_bill", y="tip", data=tips) #+拟合
sns.jointplot(x="x", y="y", data=df)
#柱状图,hue指定分类类名
sns.barplot(x="sex", y="survived", hue="class", data=titanic)
#分类箱线图
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)
#灰度图
sns.distplot(x, kde=True, bins=20) #bins控制分桶数目
#提琴图,是箱线图 + KDE(密度分布)
sns.violinplot(x="total_bill", y="day", hue="time", data=tips)
#对应pandas中的value_counts
sns.countplot(x="deck", data=titanic)
#点图
sns.pointplot(x="sex", y="survived", hue="class", data=titanic)
#分类子图,x 和 y 可指定多组数据
sns.factorplot(x="day", y="total_bill", hue="smoker", col="time", data=tips, kind="swarm")
#描绘数据的两两之间的关系
sns.pairplot(iris);
Focus:
multi-classification drawing functions:
sns.FacetGrid ()
as
sns.FacetGrid(data = titanic_df, row='Embarked', col='Sex',aspect=1.5) \
.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', hue_order=['male', 'female'], palette='deep', ci=None)
row, col, hue: strings
define a subset of data variables drawn in different areas of the grid
map (func, * args, ** kwargs)
the mapping function is applied to every aspect of the data subset
{row,col,hue}_order : lists, optional
Level to the command of the sort. By default, this will be the level of the displayed data, if the variable pandas are classified, it is the order of categories.
Added:
1.pivot_table function
PivotTable, rather acting to Groupby + agg
titanic_df.pivot_table(values='Survived',index='AgeBand', columns=['Sex', 'Pclass'], aggfunc=np.mean)