pandas data analysis (2) start to understand the data

    In the previous article, based on my actual work content, I wrote a simple program for reading csv files, and outputting them to excel files after simple processing. Next, I want to use pandas more deeply for data analysis. In fact, what I lack most is a variety of data sets. In the process of searching, I found Feijiang's data analysis related exercises, which also provided a large number of data sets. Refer to the ten sets of exercises in Flying Paddle to teach you how to use Pandas to do data analysis - Flying Paddle AI Studio , let's try it out. In the process of operation, in fact, some problems will still be encountered, so I will do some sorting here.

Table of contents

1. Read data

2. Set the display length    

3. Display the number of rows and columns, column names and indexes

Fourth, the number of statistics

Five, deduplication

6. Summation

Seven, complex calculations

Eight, the complete code


 

1. Read data

import pandas as pd

path1 = "exercise_data/chipotle.tsv"
data = pd.read_csv(path1)
print(data.head(10))

    Everything is difficult at the beginning. There is a problem when reading the data, and the error is reported:

238ea501749c4b15a3d0cfa4b6a2a17b.png

     This means: Expected 1 fields on line 6, but found 5, let's look at line 6 of the dataset:

e3b84d7b8e78403982f00da798dc26d2.png

     I didn't see anything, but I saw that this line was a bit long, and then there were some ",", which felt like the problem here. Because I see that the fly is separated by tabs, and read_csv uses "," by default. There are three solutions:

1. Open the tsv file and export it as a standard csv file

path1 = "exercise_data/chipotle.csv"

2. Split using symbols that do not exist in the file

data = pd.read_csv(path1, sep="\t")

3. Skip the line that parses the error (if there is a problem with a large amount of data, it is not recommended to skip)

data = pd.read_csv(path1, on_bad_lines='skip')

It is recommended to use the first one, and the complete code is modified as follows:

import pandas as pd

path1 = "exercise_data/chipotle.csv"
data = pd.read_csv(path1)
print(data.head(10))

5ffb34ec4d064c22bd5b75d908893b3a.png

2. Set the display length    

It is displayed, but it shows..., this is because the data is too long, and pandas shows that the length is not enough, you can set it:

# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)

 After setting, the display is as follows:

cd87d6101747438894072a83138f9f63.png

3. Display the number of rows and columns, column names and indexes

# 打印行和列数量
print('行:' + str(data.shape[0]) + '; ' + '列:' + str(data.shape[1]))
# 打印出全部的列名称
print(data.columns)
# 打印数据集的索引
print(data.index)

3004228469de4c9ca2d2a2f9068537e4.png

Fourth, the number of statistics

    After the quantity is accumulated, group by item_name, and then sort from large to small:

c = data[['item_name', 'quantity']].groupby(['item_name'], as_index=False).agg({'quantity': sum})
c1 = c.sort_values(['quantity'], ascending=False)
print(c1.head())

6de366a9d7934a58a2d03fb315a0344b.png

 The parameters of the sort_values ​​method are described as follows:

a56c1af777a44ac3a743feb7072e852f.png

Five, deduplication

# 打印去重后item_name列的数量
print(data['item_name'].nunique())

6. Summation

# quantity求和
print(data['quantity'].sum())

Seven, complex calculations

# 将item_price转换为浮点数
dollarizer = lambda x: float(x[1:-1])
data['item_price'] = data['item_price'].apply(dollarizer)
print(data.head())
# 计算收入,保留整数
data['sub_total'] = round(data['item_price'] * data['quantity'])
print(data['sub_total'].sum())
# 订单数量
print(data['order_id'].nunique())
# 计算每单平均价格
c2 = data[['order_id', 'sub_total']].groupby(by=['order_id']
                                             ).agg({'sub_total': 'sum'})['sub_total'].mean()
print(round(c2))

 bdc0bbefde38448e847c8726de875eaf.png

Eight, the complete code

import pandas as pd

# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 读取csv文件
path1 = "exercise_data/chipotle.csv"
data = pd.read_csv(path1)
# 打印前五条
print(data.head())
# 打印行和列数量
print('行:' + str(data.shape[0]) + '; ' + '列:' + str(data.shape[1]))
# 打印出全部的列名称
print(data.columns)
# 打印数据集的索引
print(data.index)
# 按照quantity求和并且按照item_name分组
c = data[['item_name', 'quantity']].groupby(['item_name'], as_index=False).agg({'quantity': sum})
# 按照quantity从大到小排序
c1 = c.sort_values(['quantity'], ascending=False)
print(c1.head())
# 打印去重后item_name列的数量
print(data['item_name'].nunique())
# quantity求和
print(data['quantity'].sum())
# 将item_price转换为浮点数
dollarizer = lambda x: float(x[1:-1])
data['item_price'] = data['item_price'].apply(dollarizer)
print(data.head())
# 计算收入,保留整数
data['sub_total'] = round(data['item_price'] * data['quantity'])
print(data['sub_total'].sum())
# 订单数量
print(data['order_id'].nunique())
# 计算每单平均价格
c2 = data[['order_id', 'sub_total']].groupby(by=['order_id']
                                             ).agg({'sub_total': 'sum'})['sub_total'].mean()
print(round(c2))

     This article is the first time we have followed the practice questions of flying paddles to do data analysis. After so many years of development, I often encounter this situation: following some website exercises or video books, etc., there will always be some pitfalls. This article follows the exercises of Flying Paddle. One is to use its exercises and data sets, and the other is to help you avoid pitfalls. I hope it can help you.

 

Guess you like

Origin blog.csdn.net/qq_21154101/article/details/127495470