Use of Pandas module in python

Pandas python tool, based on Numpy, is mainly used to solve data analysis and other related problems. It can read data, change data, and even draw pictures.
This article mainly records common problems. The data set used in this article is from the Hejing Community
https://www.kesci.com/mw/dataset/5ee30becb772f5002d75a965/file

Read data

pandas can be used to read many different types of data

data = pd.read_csv("路径,读取CSV文件")
data = pd.read_csv("路径,读取CSV文件")[["line1","line2"....]] 读取指定列的数据

pd.read_json("read json, path")

读取JSON,如果只有一行需要加lines=True
例子:pd.read_json("../data/user.json",lines=True)

pd.read_table("path")

读取普通文件,默认以 “\t” 为分隔符
例子:pd.read_table("../data/text.txt")
pd.read_table("../data/text.txt",header=0,names=["a"]) head制定行索引,name制定列名

data.loc[row range, column range]

限制读取的数据行数以及列,data值得是读取的数据
data2.loc[0:3,"朝向":"建筑年代"]

data.iloc[row index, column index]

作用和loc一致,不同的是针对的是行和列的索引
iloc[0:1,:]

data.query("expression")

根据表达式查询数据
data2.query("楼层 == '高层'") 
逻辑运算符号<、 >、|、 &

data.isin([value1,value2])

获取指定值的数据
data2[data2["楼层"].isin(["高层","中层"])]
data[data["age"].isin([18])].head()

data processing

After reading the data, a series of cleaning and conversion operations need to be performed on the data to facilitate data operations. This is the key point.

Data processing includes

Conversion》》》

The main thing here is to convert data types and data formats

pd.DataFrame(data)

转换数据结构为dataFrame

df.dtypes / data.info()

显示数据类型 / 数据详细信息
name    object		= str
age      int64		= int

astype("type")

转换为对应的数据类型,注意是创建一个新值,并不是在原有基础上做的
data["age"].astype("object")
不过可以采用重新赋值的方式修改原本的值
data["age"] = data["age"].astype("object")

groupby("line1")

按照列分组
food.groupby("星级")
这里注意,如果是单纯的分组是不显示效果的,需要结合聚合函数才会有效

print(food.groupby("星级").size())
星级
五星商户      33
准五星商户    132
准四星商户      3
四星商户     131
dtype: int64

如果需要针对分组后的数据进行处理则需要
for index, data in food.groupby(by='星级'):
    # 对不同性别进行分组显示
    print(index)
    print(data)
    print('\n')

Revise"""

Modify data if only one item is modified

data.列名 = 值

Modify all data. There are many ways to modify all data. First, you can obtain all data by traversing the data, and then modify it.

iterrows(),将DataFrame迭代成(index ,series)
for index, row in food.iterrows():
    food.loc[index,"人均价格"] = row["人均价格"].replace("元","")

apply(f) can traverse all data to execute the defined function. If there is a return value, change the value. Set axis=1 to specify that each traversal will obtain one row of data.

声明函数,获得对应列的值删除其中字符并返回
def f1(str):
    return int(str["人均价格"].replace("元",""))

执行完函数后返回新值
data = food.apply(f1,axis=1)

map(f) is similar to apply. The difference is that map targets a column of data.

def f2(str):
    return int(str.replace("元",""))

data = food["人均价格"].map(f2)
除了上面的外还可以用lambda简化
data = food["人均价格"].map(lambda x:int(x.replace("元","")))

In addition, there are many ways, which will not be described one by one here.

merge"""

Merging mainly refers to merging two data sources based on one column.

merge(df1,df2)

pd.merge(数据源1,数据源2,选项=["数据源1列名","数据源2列名"],how="left/right")
on:列名,用于连接的列
how:连接的方式

两个数据源中都有 “店铺ID”
food = pd.read_csv("../../data/A.csv")
address = pd.read_csv("../../data/address.csv")

内连接
result = pd.merge(food,address,on=["店铺ID","店铺ID"])
右连接
result = pd.merge(food,address,how="right",on=["店铺ID","店铺ID"])
左连接
result = pd.merge(food,address,how="right",on=["店铺ID","店铺ID"])

data visualization

pandas encapsulates methods for data visualization

plot(), the drawing method in pandas

常用参数如下
x : x轴
y : y轴
kind : 绘图种类
    ‘line’ : line plot (default)#折线图
    ‘bar’ : vertical bar plot#条形图
    ‘barh’ : horizontal bar plot#横向条形图
    ‘hist’ : histogram#柱状图
    ‘box’ : boxplot#箱线图
    ‘kde’ : Kernel Density Estimation plot#Kernel 的密度估计图,主要对柱状图添加Kernel 概率密度线
    ‘density’ : same as ‘kde’
    ‘area’ : area plot#不了解此图
    ‘pie’ : pie plot#饼图
    ‘scatter’ : scatter plot#散点图  需要传入columns方向的索引
    ‘hexbin’ : hexbin plot#不了解此图
grid : 是否有网格

例子
	food["人均价格"].plot(kind="line")
	food["人均价格"][0:10].plot(kind="line")
	后面无非是数据的变化
	food.sort_values(by="人均价格",ascending=False)["人均价格"][0:10].plot(kind="bar",grid=True,y="人均价格")

Guess you like

Origin blog.csdn.net/lihao1107156171/article/details/112384789