pandas 基础操作整理

目录¶

. DataFrame基本方法
- 创建数据
- 示例数据
- 查看前5行
- info())
- value_counts())
- groupby
- 检测空值
- 空值处理
- 替换

作图

DataFrame基本方法¶

In [137]:

import pandas as pd
import numpy as np

创建数据¶

通过字典创建DataFrame

In [138]:

dic_data =[{‘a’:1,‘b’:5},
{‘a’:2,‘b’:6},
{‘a’:3,‘b’:7},]

In [139]:

dic_data

Out[139]:

[{‘a’: 1, ‘b’: 5}, {‘a’: 2, ‘b’: 6}, {‘a’: 3, ‘b’: 7}]

In [140]:

test2 = pd.DataFrame(dic_data)

In [141]:

test2

Out[141]:

	a	b
0	1	5
1	2	6
2	3	7

示例数据¶

下面使用seaborn自带的数据进行演示，Seaborn是Python的一个作图工具，示例数据名为小费数据集，该数据集包含某一酒店顾客消费金额、以及付小费的情况，还包括付账人的性别、是否吸烟，消费星期等等，具体如下：

特征名称	含义
total_bill	账单总额
tip	小费
sex	性别
smoker	是否抽烟
day	消费星期
time	聚餐时间段
size	聚餐人数

在seaborn使用load_dataset导入数据

In [142]:

import seaborn as sns

In [143]:

data = sns.load_dataset(‘tips’)

查看前5行¶

In [144]:

data.head()

Out[144]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

info()¶

In [145]:

data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null category
smoker 244 non-null category
day 244 non-null category
time 244 non-null category
size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB

使用info可以看出数据有244行、7列

value_counts()¶

使用value_counts()查看某一列的取值分布，例如查看消费星期的分布。

In [146]:

data[‘day’].value_counts()

Out[146]:

Sat 87
Sun 76
Thur 62
Fri 19
Name: day, dtype: int64

如上所示，day这一列中，Sat(周六)有87个样本，说明数据中有87人在周六消费。

In [147]:

data[‘sex’].value_counts()

Out[147]:

Male 157
Female 87
Name: sex, dtype: int64

例如查看性别也是同样的方法，可以看出数据中男性Male有157人，女性Female有87人。

groupby¶

groupby实现对数据分组统计，如下所示：

不同性别的tip(小费金额)的总和。

In [148]:

data.groupby([‘sex’])[‘total_bill’].sum()

Out[148]:

sex
Male 3256.82
Female 1570.95
Name: total_bill, dtype: float64

不同性别的tip(小费金额)的均值。

In [149]:

data.groupby([‘sex’])[‘total_bill’].mean()

Out[149]:

sex
Male 20.744076
Female 18.056897
Name: total_bill, dtype: float64

还可以按照两个变量进行分组，再求tip(小费金额)的总和。

In [150]:

data.groupby([‘sex’,‘day’])[‘tip’].sum()

Out[150]:

sex day
Male Thur 89.41
Fri 26.93
Sat 181.95
Sun 186.78
Female Thur 82.42
Fri 25.03
Sat 78.45
Sun 60.61
Name: tip, dtype: float64

在上一步基础上，添加unstack()方法转换为DataFrame

In [151]:

data.groupby([‘sex’,‘day’])[‘total_bill’].sum().unstack()

Out[151]:

day	Thur	Fri	Sat	Sun
sex
—	—	—	—	—
Male	561.44	198.57	1227.35	1269.46
Female	534.89	127.31	551.05	357.70

检测空值¶

检测每一列均值使用isnull().sum()

In [152]:

data.isnull().sum()

Out[152]:

total_bill 0
tip 0
sex 0
smoker 0
day 0
time 0
size 0
dtype: int64

如上所示，每一列空值数量为0，说明没有空值

空值处理¶

如果数据中有空值可以采取如下方法：

删除行：删除某一列，带有空值的行，如下：删除size这一列为空的行

In [153]:

data.dropna(subset=[‘size’],inplace=True)

删除列：

In [154]:

# data.drop([‘size’],inplace=True)

按列填充：例如将size这一列的空值填充为未知

In [155]:

data[‘size’].fillna(‘未知’,inplace=True)

替换¶

对于size这一列，用50替换2。

In [156]:

# 替换前数据
data.head()

Out[156]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [157]:

data[‘size’] = data[‘size’].replace(2, 50)

In [158]:

# 替换后数据
data.head()

Out[158]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	50
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	50
4	24.59	3.61	Female	No	Sun	Dinner	4

In [ ]:

作图¶

pandas作图¶

In [159]:

data[‘day’].value_counts()

Out[159]:

Sat 87
Sun 76
Thur 62
Fri 19
Name: day, dtype: int64

In [160]:

data[‘day’].value_counts().plot(kind=‘bar’,rot=40)

Out[160]:

<matplotlib.axes._subplots.AxesSubplot at 0x1586738fef0>

添加图形的描述

In [161]:

import matplotlib.pyplot as plt

In [162]:

fig = plt.figure(figsize=(6,4))

data[‘day’].value_counts().plot(kind=‘bar’,rot=40)

# 在柱形上方显示计数
counts = data[‘day’].value_counts().values
for index, item in zip([0,1,2,3], counts):
plt.text(index, item, item, ha=“center”, va= “bottom”, fontsize=12)

# 设置柱形名称
plt.xticks([0,1,2,3],[‘周六’,‘周日’,‘周四’,‘周五’])

# 设置x、y轴标签
plt.xlabel(‘星期’)
plt.ylabel(‘人数’)

# 设置标题以及字体大小
plt.title(‘消费日期分布图’,fontsize=11)

# 设置中文显示
plt.rcParams[‘font.sans-serif’]=[‘SimHei’]
plt.rcParams[‘font.family’]=[‘sans-serif’]
plt.show()

seaborn:柱状图¶

方法1：sns.countplot

In [163]:

sns.countplot(data=data,x=‘sex’)

Out[163]:

<matplotlib.axes._subplots.AxesSubplot at 0x15872ced390>

In [164]:

sns.countplot(data=data,x=‘sex’,hue=‘time’)

Out[164]:

<matplotlib.axes._subplots.AxesSubplot at 0x158672c7c88>

方法2：sns.barplot()

In [165]:

data[‘sex’].value_counts()

Out[165]:

Male 157
Female 87
Name: sex, dtype: int64

In [166]:

sns.barplot(x=[0,1],y=data[‘sex’].value_counts())

Out[166]:

<matplotlib.axes._subplots.AxesSubplot at 0x15872c4a0f0>

seaborn：kde图¶

In [167]:

sns.kdeplot(data[‘tip’],shade=False)

Out[167]:

<matplotlib.axes._subplots.AxesSubplot at 0x15872b8f860>

seaborn：regplot回归图¶

In [168]:

sns.regplot(data=data,x=‘total_bill’,y=‘tip’)

Out[168]:

<matplotlib.axes._subplots.AxesSubplot at 0x15866fe6ef0>

seaborn：distplot直方图¶

In [169]:

sns.distplot(data[‘tip’],color=‘red’)

Out[169]:

<matplotlib.axes._subplots.AxesSubplot at 0x15867004a58>

seaborn：boxplot箱线图¶

In [173]:

sns.boxplot(data[“tip”], orient=“v”)

Out[173]:

<matplotlib.axes._subplots.AxesSubplot at 0x15866f2a390>

seaborn:多个子图¶

In [170]:

fig,[ax1,ax2] = plt.subplots(1,2,figsize=(10,4))

# 第一幅图传给ax1
sns.countplot(data=data,x=‘sex’,ax=ax1)

# 第一幅图传给ax2
sns.distplot(data[‘tip’],color=‘red’,ax=ax2)

Out[170]:

<matplotlib.axes._subplots.AxesSubplot at 0x1586a4e37b8>

In [171]:

fig,[ax1,ax2,ax3] = plt.subplots(1,3,figsize=(16,4))

# 绘制图1
sns.countplot(data=data,x=‘sex’,hue=‘time’,ax=ax1)
# 将具体的计数值显示在柱形上方
counts=data[‘time’].groupby(data[‘sex’]).value_counts().values
count1 = counts[[1, 3]]
count2 = counts[[0, 2]]
for index, item1, item2 in zip([0,1], count1, count2):
ax1.text(index-0.2, item1 + 0.05, ‘%.0f’ % item1, ha=“center”, va= “bottom”,fontsize=12)
ax1.text(index+0.2, item2 + 0.05, ‘%.0f’ % item2, ha=“center”, va= “bottom”,fontsize=12)

# 绘制图2
sns.barplot(x=[0,1],y=data[‘sex’].value_counts(),ax=ax2)

# 绘制图3
sns.distplot(data[‘tip’],color=‘red’,ax=ax3)

#设置柱形名称
ax1.set_xticklabels([‘男性’,‘女性’])
ax2.set_xticklabels([‘男性’,‘女性’])

# 设置图例名称
ax1.legend(labels=[‘lunch’, ‘dinner’])

# 设置x,y轴标签
ax1.set_xlabel(‘性别图1’)
ax1.set_ylabel(‘人数图1’)

ax2.set_xlabel(‘性别图2’)
ax2.set_ylabel(‘人数图2’)

# 设置标题以及字体大小
ax1.set_title(‘性别和时段分布图’,size=11)
ax2.set_title(‘性别分布图’,size=11)
ax3.set_title(‘小费分布图’,size=11)

#显示汉语标注
plt.rcParams[‘font.sans-serif’]=[‘SimHei’]
plt.rcParams[‘font.family’]=[‘sans-serif’]
plt.show()

In [172]:

data[‘time’].groupby(data[‘sex’]).value_counts()

Out[172]:

sex time
Male Dinner 124
Lunch 33
Female Dinner 52
Lunch 35
Name: time, dtype: int64