In the previous article, based on my actual work content, I wrote a simple program for reading csv files, and outputting them to excel files after simple processing. Next, I want to use pandas more deeply for data analysis. In fact, what I lack most is a variety of data sets. In the process of searching, I found Feijiang's data analysis related exercises, which also provided a large number of data sets. Refer to the ten sets of exercises in Flying Paddle to teach you how to use Pandas to do data analysis - Flying Paddle AI Studio , let's try it out. In the process of operation, in fact, some problems will still be encountered, so I will do some sorting here.
Table of contents
3. Display the number of rows and columns, column names and indexes
Fourth, the number of statistics
1. Read data
import pandas as pd
path1 = "exercise_data/chipotle.tsv"
data = pd.read_csv(path1)
print(data.head(10))
Everything is difficult at the beginning. There is a problem when reading the data, and the error is reported:
This means: Expected 1 fields on line 6, but found 5, let's look at line 6 of the dataset:
I didn't see anything, but I saw that this line was a bit long, and then there were some ",", which felt like the problem here. Because I see that the fly is separated by tabs, and read_csv uses "," by default. There are three solutions:
1. Open the tsv file and export it as a standard csv file
path1 = "exercise_data/chipotle.csv"
2. Split using symbols that do not exist in the file
data = pd.read_csv(path1, sep="\t")
3. Skip the line that parses the error (if there is a problem with a large amount of data, it is not recommended to skip)
data = pd.read_csv(path1, on_bad_lines='skip')
It is recommended to use the first one, and the complete code is modified as follows:
import pandas as pd
path1 = "exercise_data/chipotle.csv"
data = pd.read_csv(path1)
print(data.head(10))
2. Set the display length
It is displayed, but it shows..., this is because the data is too long, and pandas shows that the length is not enough, you can set it:
# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
After setting, the display is as follows:
3. Display the number of rows and columns, column names and indexes
# 打印行和列数量
print('行:' + str(data.shape[0]) + '; ' + '列:' + str(data.shape[1]))
# 打印出全部的列名称
print(data.columns)
# 打印数据集的索引
print(data.index)
Fourth, the number of statistics
After the quantity is accumulated, group by item_name, and then sort from large to small:
c = data[['item_name', 'quantity']].groupby(['item_name'], as_index=False).agg({'quantity': sum})
c1 = c.sort_values(['quantity'], ascending=False)
print(c1.head())
The parameters of the sort_values method are described as follows:
Five, deduplication
# 打印去重后item_name列的数量
print(data['item_name'].nunique())
6. Summation
# quantity求和
print(data['quantity'].sum())
Seven, complex calculations
# 将item_price转换为浮点数
dollarizer = lambda x: float(x[1:-1])
data['item_price'] = data['item_price'].apply(dollarizer)
print(data.head())
# 计算收入,保留整数
data['sub_total'] = round(data['item_price'] * data['quantity'])
print(data['sub_total'].sum())
# 订单数量
print(data['order_id'].nunique())
# 计算每单平均价格
c2 = data[['order_id', 'sub_total']].groupby(by=['order_id']
).agg({'sub_total': 'sum'})['sub_total'].mean()
print(round(c2))
Eight, the complete code
import pandas as pd
# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 读取csv文件
path1 = "exercise_data/chipotle.csv"
data = pd.read_csv(path1)
# 打印前五条
print(data.head())
# 打印行和列数量
print('行:' + str(data.shape[0]) + '; ' + '列:' + str(data.shape[1]))
# 打印出全部的列名称
print(data.columns)
# 打印数据集的索引
print(data.index)
# 按照quantity求和并且按照item_name分组
c = data[['item_name', 'quantity']].groupby(['item_name'], as_index=False).agg({'quantity': sum})
# 按照quantity从大到小排序
c1 = c.sort_values(['quantity'], ascending=False)
print(c1.head())
# 打印去重后item_name列的数量
print(data['item_name'].nunique())
# quantity求和
print(data['quantity'].sum())
# 将item_price转换为浮点数
dollarizer = lambda x: float(x[1:-1])
data['item_price'] = data['item_price'].apply(dollarizer)
print(data.head())
# 计算收入,保留整数
data['sub_total'] = round(data['item_price'] * data['quantity'])
print(data['sub_total'].sum())
# 订单数量
print(data['order_id'].nunique())
# 计算每单平均价格
c2 = data[['order_id', 'sub_total']].groupby(by=['order_id']
).agg({'sub_total': 'sum'})['sub_total'].mean()
print(round(c2))
This article is the first time we have followed the practice questions of flying paddles to do data analysis. After so many years of development, I often encounter this situation: following some website exercises or video books, etc., there will always be some pitfalls. This article follows the exercises of Flying Paddle. One is to use its exercises and data sets, and the other is to help you avoid pitfalls. I hope it can help you.