Data Analysis Visualization + (1)

Data analysis is mainly controlled by several libraries:
numpy, PANDAS, Seaborn, matplotlib

Data analysis process

1. View data

Common functions:
df.head ()
df.info ()
df.describe ()

By understanding these functions preliminary data looks like, with or without missing values, basic statistical information

2. handle missing data

Common functions:
df.isna () # to see if there is missing data values
df.isnull () # View Data null
df.duplicated () # View duplicate data

df.fillna () # of missing data values filled
Note:
(. 1) df.fillna () function can be specified for a column filled with missing values

   df['col_name'].fillna(values, inplace = True)

(2) filled with an index value, the need for index corresponding to original data set
after filling reset_index ()

titanic_df.set_index('Sex', inplace=True)
titanic_df
# 使用fillna填充缺失值,根据索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引,即取消Sex索引
titanic_df.reset_index(inplace=True)

3. Data conversion, packet aggregation

常用函数:
df.groupby(by = , sort=True).agg(func)
df.groupby(by = , sort=True).func
df.groupby(by = , sort=True)[‘col_names’].agg(func)
df.groupby(by = , sort=True)[‘col_names’].func

df.sort_index () # sorted by index
() values are sorted according to df.sort_values #
Note: sort_values () simultaneously on multiple columns of data designated to sort

titanic_df.sort_values(by = ['Pclass', 'Age'], ascending=[True, False],inplace=True)

4. Data Visualization

(1) pandas comes visualization functions
df.plot (= kind)
kind: 'Box' # boxplot
'line' # line graph
'pie' # pie
'bar' # histograms
'hist' # histograms

(2) seaborns
common graphics functions:


# 散点图,分析两个变量的关系
sns.lmplot(x="total_bill", y="tip", data=tips) #+拟合
sns.jointplot(x="x", y="y", data=df)

#柱状图,hue指定分类类名
sns.barplot(x="sex", y="survived", hue="class", data=titanic)

#分类箱线图
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)

#灰度图
sns.distplot(x, kde=True, bins=20) #bins控制分桶数目

#提琴图,是箱线图 + KDE(密度分布)
sns.violinplot(x="total_bill", y="day", hue="time", data=tips)

#对应pandas中的value_counts
sns.countplot(x="deck", data=titanic) 

#点图
sns.pointplot(x="sex", y="survived", hue="class", data=titanic)

#分类子图,x 和 y 可指定多组数据
sns.factorplot(x="day", y="total_bill", hue="smoker", col="time", data=tips, kind="swarm")

#描绘数据的两两之间的关系
sns.pairplot(iris);

Focus:
multi-classification drawing functions:
sns.FacetGrid ()
as

sns.FacetGrid(data = titanic_df, row='Embarked', col='Sex',aspect=1.5) \
   .map(sns.pointplot, 'Pclass', 'Survived', 'Sex', hue_order=['male', 'female'],  palette='deep', ci=None)

row, col, hue: strings
define a subset of data variables drawn in different areas of the grid

map (func, * args, ** kwargs)
the mapping function is applied to every aspect of the data subset

{row,col,hue}_order : lists, optional

Level to the command of the sort. By default, this will be the level of the displayed data, if the variable pandas are classified, it is the order of categories.

Added:
1.pivot_table function
PivotTable, rather acting to Groupby + agg

titanic_df.pivot_table(values='Survived',index='AgeBand', columns=['Sex', 'Pclass'], aggfunc=np.mean)
Published 29 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/c2250645962/article/details/96738644