Python3：Pandas的简单使用3(针对DataFrame的操作：赋值，计算，统计，画图以及io操作)

1.声明

当前的内容包括Pandas中对DataFrame的数据运算和统计运算操作，然后就是画图操作，用于本人知识梳理和复习

2.一个demo

## 使用当前的pandas实现数据计算操作
import numpy as np
import pandas as pd

shops = pd.DataFrame(np.random.randint(1, 10, (5, 5)), index=["第{}天".format(i + 1) for i in range(5)],
                     columns=["第{}家店".format(i + 1) for i in range(5)])
print("输出销售信息：\n{}".format(shops))

print("输出/2后的结果：\n{}".format(shops / 2))
# 由于第一家店出现热销，当前的数据变成 原来的 150%
# shops["第1家店"] = shops["第1家店"] * 1.5  # 这里出现错误
# shops.loc["第3天"]["第4家店"] = 15
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame

print("输出销售信息：\n{}".format(shops))

# 发现第4家店的销售数据的第三天的数据写错误了 应该为 15
shops.loc["第3天", "第4家店"] = 15
print("输出销售信息：\n{}".format(shops))
shops.loc["第3天"]["第4家店"] = 17
print("输出销售信息：\n{}".format(shops))

# 说明当前使用赋值的时候必须按照同一格式或者方式来操作计算，否则会报错的
shops.loc["第3天"] = 13
print("输出销售信息：\n{}".format(shops))

# shops.loc["第4家店"] = 14
# print("输出销售信息：\n{}".format(shops))
# 发现这里出现了问题，直接在当前的数据中追加了一行数据为 13 index为第4家店的数据，这是一个坑

# 算出当前第一家店的所有和
print("输出当前第一家店的利润总和：{}".format(shops["第1家店"].sum()))

# 输出第一家店和第3家店的每天的利润和
one_shop = shops["第1家店"]
two_shop = shops["第3家店"]
print("第一家店和第3家店的利润和：\n{}".format(one_shop.add(two_shop)))
print("输出销售信息：\n{}".format(shops))
# 输出第一家店和第3家店每天的利润差
print("第1家店和第3家店的利润差：\n{}".format(one_shop.sub(two_shop)))

print("输出当前利润大于4的数据：\n{}".format(shops > 4))
# 发现这里输出的数据又是布尔数组
# 按照同理操作可以实现
print("输出当前利润大于4的数据：\n{}".format(shops[shops > 4]))
# 发现所有不符合的数据都变成了NaN这个
# 将所有小于等于4的数据都变成 4
shops[shops <= 4] = 4
print(print("输出当前利润大于4的数据：\n{}".format(shops)))

# 输出第二家店的数据中的利润大于5并且小于 8的利润
print("==============输出第二家店的数据中的利润大于5并且小于 15的利润===============")
two_shop = shops["第2家店"]
print("输出第2家店的数据：\n{}".format(two_shop))
boolean_array = ((shops["第2家店"] > 5) & (shops["第2家店"] < 15))
print(boolean_array)
print("=================================================")
print("输出第二家店的数据中的利润大于5并且小于 15的利润:\n{}".format(two_shop[boolean_array]))

# 使用当前的query方法执行逻辑运算操作
print("使用当前的query的查询解析：\n{}".format(shops.query("第2家店>4 & 第2家店<15")))
# indexs = [i for i in shops.index]
# print(indexs)
## 输出前2天销售前2的数据

# 输出当前的销售数据中是否有达到 5,7,9的数据，isin方法判断当前的数组中的数据是否包含这些数据
print("输出当前的销售数据中是否有达到 5,7,9的数据:\n{}".format(shops.isin([5, 7, 9])))
print("输出当前的销售数据中是否有达到 5,7,9的数据:\n{}".format(shops[shops.isin([5, 7, 9])]))

# 输出前两个数据
two_days_records = shops.head(2)
print("输出前两天的销售情况：\n{}".format(two_days_records))
# print("输出前2的数据:\n{}".format(two_days_records.sort_index(axis=1,ascending=False)))
# print("输出前2的数据:\n{}".format(two_days_records.sort_index(axis=0,ascending=False)))

# 输出最大的两个值
print("输出前2的数据:\n{}".format(two_days_records.max(axis=1)))

# 使用当前的describe方法
print("调用当前的describe这个方式=============")
print(shops.describe())
# 使用当前的describe这个方式会返回当前的count属性异界一些统计函数的数据

# 输出最大值所在的索引和最小值所在的索引
print(shops)
print("输出最大值的索引：\n{}".format(shops.idxmax()))
print("输出最小值的索引：\n{}".format(shops.idxmin()))
# 这个返回的索引是先是行索引然后是列索引

# 统计某列的累积求和
print(shops["第3家店"].cumsum())  # 就像当于每天的这家店的总利润

# 使用当前的数据进行画图操作，pandas本身集成了matplotlib这个东西
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['KaiTi']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False
# 由于当前的pandas原本是集成了matplotlib这个东西，所以当前是支持画图操作的
shops["第3家店"].cumsum().plot()
shops.plot()
plt.show()

# 自定义运算 :使用当前的apply()函数来实先运算操作，主要使用的是lambda
print("使用当前的自定函数的方式实现当前的计算操作：\n{}".format(shops.apply(lambda x: x.max() - x.min())))

在这里发现了一个问题：

shops[“第1家店”] = shops[“第1家店”] * 1.5 # 这里出现错误
shops.loc[“第3天”][“第4家店”] = 15
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

以上中如果存在修改了数据然后再使用loc方法操作的时候就会出SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

结果：
在这里插入图片描述

在这里插入图片描述

总结：

1.可以直接使用DataFrame数据进行任何运算操作，以及逻辑运算操作，基本上与numpy中的使用方法一致

2.可以直接使用等号方式为指定的行或者列以及指定的特定数据赋值

3.还可以使用DataFrame.sub()和add()方式以及其他的方式实现各种计算操作

4.可以使用query()方法简化查询数据和赋值操作

5.通过DataFrame.isin()判断是否包含某些特定的数据

6.可以直接使用Pandas中的plot完成画图操作

7.还可以自定义运算方式：DataFrame.apply()

8.可以直接使用numpy中的统计操作

3.读取和写入文件操作

3.1 读取CSV文件

这里存在一个test.csv文件

,第1家店,第2家店,第3家店,第4家店,第5家店
第1天,4,6,7,3,7
第2天,9,5,7,6,8
第3天,9,2,6,7,5
第4天,1,1,7,3,4
第5天,4,3,5,4,3

# 使用当前的pandas实现读取csv文件
import pandas as pd

# 当前的读取操作就是出现一个问题，如果使用直接读取的话这个index就会默认生成一个Unamed：0然后所有行的第一个都会变成[0-5]
# 读取文件的操作的时候可以指定当前的index_col = False表示取消当前默认生成的index索引列，使用index_col=0表示读取到的数据就是第一行的数据作为索引
# 还可以指定当前的use_cols=列名的方式指定读取那一列的数据，或者是指定列的数据
pd_DataFrame = pd.read_csv("test.csv", index_col=0)
print(pd_DataFrame)

1.读取文件的时候可以指定当前的index_col，和读取指定的列方式use_cols

2.读取csv直接使用pd.read_csv()

3.2 写入CSV文件

# 写入csv文件
import numpy as np
import pandas as pd

indexs = ["第{}天".format(i + 1) for i in range(5)]
columns = ["第{}家店".format(i + 1) for i in range(5)]
pd_dataaFrame = pd.DataFrame(np.random.randint(1, 10, (5, 5)), index=indexs, columns=columns)
print("输出当前的产生的二维数据：\n{}".format(pd_dataaFrame))

# 这里的to_csv方法就是将当前的数据保存下来，可以指定当前是否保存行和列索引，还可以使用过滤的方式实现操作和数据的保存
# 当前还可以使用当前的model表示使用的模式是追加模式还是写入模式，并且还可以指定当前的header=False方式表示第二次写入的时候不需要表头
pd_dataaFrame.to_csv("test.csv")

直接使用pd.to_csv方式写出数据到文件中

3.3 读取和写入json文件

这里准备文件
在这里插入图片描述

# 将当前的数据结果存储为json文件
import pandas as pd
import numpy as np

np_random = np.random.randint(1, 100, (10000, 5))
columns = ["{}_shop".format(i + 1) for i in range(5)]
indexs = ["{}_good".format(i + 1) for i in range(10000)]
pd_dataFrame = pd.DataFrame(np_random, index=indexs, columns=columns)
# pd_dataFrame.to_excel("test.xlsx") 写入excel需要excel的模块
# pd_dataFrame.to_json("test.json",orient="records",lines=True)
pd_dataFrame.to_json("test.json",orient="records")
# 如果不指定当前的lines默认就是False,但是这个数据不符合json的规范，json中数据是有分隔符的：,
# 指定lines=True的时候默认数据就是显示在一行的
# 发现当前指定的index失效，在数据上没有任何显示
print(pd_dataFrame.head(5))
# 最后读取文件
pd_json = pd.read_json("test.json",orient="records",lines=False)
print(pd_json.head(5))

1.这里需要注意读取和写入的时候需要指定当前的lines,lines表示是否一行一行的读取或者写入，需要指定orient

2.读取和写入使用pd.read_json()和pd.to_json()文件

3.5 读取和写出HDF5文件

1.读取HDF5文件

# 读取当前写入的HDF5文件，test.h5
import pandas as pd

pd_dataFrame = pd.read_hdf("test.h5")
print(pd_dataFrame)

# 使用读取文件的时候指定当前的key方式读取文件
pd_dataFrame2 = pd.read_hdf("test.h5", key="sells")
print(pd_dataFrame2)
# 测试都成功！
# 由于当前的hdf5文件的格式位：key：value这种格式，但是如果这个文件中只有一个key就表示读取的时候不需要指定格式

# 如果当前的hdf5文件中存在多个key就会报错
# 开始再向当前的test.h5文件中写入另外一个key
pd_dataFrame2.to_hdf("test.h5",key="sells2")
# 再次读取文件的时候就报错了

2.写出HDF5文件

# 使用当前的pandas写入一个hdf5格式的文件，就是一个基于hadoop大数据的存储系统(存储)
# 注意读取当前的HDF5数据的时候需要使用当前tables库，如果没有的话就会报错
# 当前使用的tables库位 3.6.1
import pandas as pd
import numpy as np

columns = ["第{}家".format(i + 1) for i in range(10)]
indexs = ["第{}家".format(i + 1) for i in range(100)]
pd_dataFrame = pd.DataFrame(np.random.randint(1, 100, (100, 10)), index=indexs, columns=columns)
# print(pd_dataFrame.head(10))
# 数据生成成功
# 现在开始保存数据
# pd_dataFrame.to_hdf("test.h5")
# TypeError: to_hdf() missing 1 required positional argument: 'key' 这里说明保存的时候必须要指定当前的key
pd_dataFrame.to_hdf("test.h5",key="sells")

总结：

1.读取的时候，如果HDF5文件中只有一个key的时候，可以不指定读取的key，如果HDF5中不止一个key的时候必须要指定读取的key，否者报错

2.写入HDF5文件的时候一定要指定key,否者报错

3.使用pd.read_hdf()和pd.to_hdf()方式读取和写入HDF5文件

4.总结

1.再赋值操作的时候需要注意当前存在的问题

2.可以使用numpy方式中的操作运算符方式对pandas的DataFrame实现相同的操作

3.可以直接使用plot实现画图，但是需要matplotlib来show出来

4.io操作的时候需要注意读写文件的方式，每一个方式都不一样

以上纯属个人见解，如有问题请联系本人！

你是小KS

发布了215 篇原创文章 · 获赞 39 · 访问量 1万+

私信关注