Visual Analysis Experiment
- Store traffic data visualization
-
- Data Sources
- Experimental requirements:
-
- Plot the October traffic line graph for all convenience stores.
- Draw a line chart of the daily average passenger flow of each type of business in October.
- Select a business, count the total monthly traffic, and draw a histogram.
- Select a merchant, count the daily average passenger flow from Monday to Sunday in a certain month, and draw a histogram.
- Select a business and plot a traffic histogram.
- Select a business and draw a traffic density map.
- Calculate the proportion of the total customer traffic of each category of stores in a month to the total customer traffic of the month, and draw a pie chart.
- Pima Indian Diabetes Data Visualization
Store traffic data visualization
Data Sources
The store data comes from the Tianchi Koubei Merchant Traffic Prediction Competition, and only part of the data is screened here. The meaning of each field in the data of "shop_payNum_new.csv" is shown in the table below:
Experimental requirements:
Reference Case 1 Select 5 tasks to draw different graphics from the following tasks:
Plot the October traffic line graph for all convenience stores.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data = data_total.iloc[data_total.index.month == 10]
data_id = data.groupby('shop_id')
for key in data_id.groups.keys():
data_id.get_group(key).plot(y=['pay_num'], title='customer flow of shop '+str(key))
plt.show()
[Analysis]
First use pandas.read_csv to get all store data. Since it is necessary to filter the passenger flow line chart in October, use iloc to complete the data filtering, and use shop_id to perform groupby grouping to obtain the id key of each store. For each key, use get_group to obtain the data of the corresponding store in turn, and use plot to draw.
【Running】
Due to the fact that there are many drawings in actual operation, only a part is displayed.
Draw a line chart of the daily average passenger flow of each type of business in October.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data = data_total.iloc[data_total.index.month == 10]
data_id = data.groupby('cate_2_name')
for keys in data_id.groups.keys():
data_id.get_group(keys).groupby(data_id.get_group(keys).index.day).mean().plot(y=['pay_num'], kind='line', title=keys)
plt.show()
[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the line chart of the daily average passenger flow of each type of business in October. Use iloc to filter the data and filter out the October data of each merchant. Use groupby to group the sales data and get the key value of each group. Use a loop to traverse each key, then get the date and average the date, and finally use plot to generate a line chart.
[Operation]
Some results are shown as follows
Select a business, count the total monthly traffic, and draw a histogram.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data_14 = data_total[data_total['shop_id'] == 14]
data_14_id = data_14.groupby(data_14.index.month).sum()
data_14_id.plot(kind='bar', y=['pay_num'], title='total custom of shop-14')
plt.xlabel('month')
plt.show()
[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the total passenger flow of a single merchant in each month. First, filter the data and filter out the data with shop_id 14. Use groupby combined with the sum function to perform group summation, and finally set the kind to a histogram and generate a drawing.
【run】
Select a merchant, count the daily average passenger flow from Monday to Sunday in a certain month, and draw a histogram.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data_14 = data_total[(data_total['shop_id'] == 14) & (data_total.index.month == 1)]
data_14_id = data_14.groupby(data_14.index.strftime('%w'))
data_14_id.mean().plot(y=['pay_num'], kind='bar', title='Average custom of shop 14 in January')
plt.xlabel('day')
plt.show()
[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the average customer flow of a single merchant in a single month. First, filter the data, and filter out the data with shop_id 14 in January. Use groupby combined with strftime function to group by date to calculate the average value, and finally draw directly as a histogram.
【run】
Select a business and plot a traffic histogram.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data_14 = data_total[data_total['shop_id'] == 14]
data_14.plot(kind='hist', y=['pay_num'], title='shop-14-block')
plt.show()
[Analysis]
First use pandas.read_csv to read all shop data, and then filter all data according to shop_id. After filtering out the data of the corresponding store, use the plot to draw directly, and select the style as the 'hist' histogram.
【run】
Select a business and draw a traffic density map.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data_14 = data_total[data_total['shop_id'] == 14]
data_14.plot(kind='kde', y=['pay_num'], title='shop-14-density')
plt.show()
[Analysis]
First use pandas.read_csv to read all shop data, and then filter all data according to shop_id. After filtering out the data of the corresponding store, use the plot to draw directly, and select the style as the 'kde' density distribution map.
【run】
Calculate the proportion of the total customer traffic of each category of stores in a month to the total customer traffic of the month, and draw a pie chart.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
data_month1 = data_total[data_total.index.month == 1]
data_month1_rate = data_month1.groupby('cate_2_name').sum() / data_month1['pay_num'].sum()
data_month1_rate['pay_num'].plot(kind='pie', autopct='%.2f')
plt.ylabel('')
plt.title('January')
plt.show()
[Analysis]
First use pandas.read_csv to read all store data, and then filter all data according to January. Then use groupby and sum to group and sum the traffic of each category, and use sum to sum all the traffic. The result of comparing the two results is the proportion. Finally, a pie chart can be made according to the proportion.
【run】
Pima Indian Diabetes Data Visualization
Data source: http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. The meaning of the first 9 fields of "pima.csv" data:
(1)Number of times pregnant
(2)Plasma glucose concentration a 2 hours in an oral glucosetolerancetest
(3)Diastolic blood pressure (mm Hg)
(4)Triceps skin fold thickness (mm)
(5)2-Hour serum insulin (mu U/ml)
(6)Body mass index (weight in kg/(height in m)^2)
(7)Diabetes pedigree function
(8)Age (years)
(9)Class variable (0 or 1)
Experimental requirements:
Refer to Case 2 to complete the following tasks:
Draw a scatterplot for any two fields.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)
close_px_all.columns = ['Number of times pregnant',
'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
'2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
'Age (years)', 'Class variable']
# print(close_px_all.head())
# # 任选两个字段绘制散点图
pregnant_age = close_px_all[['Number of times pregnant', 'Age (years)', 'Class variable']]
ax = pregnant_age[pregnant_age['Class variable'] == 0].plot(kind='scatter', y='Number of times pregnant', c='red',
x='Age (years)', title='Number of times pregnant-Age',
ax=None)
pregnant_age[pregnant_age['Class variable'] == 1].plot(kind='scatter', y='Number of times pregnant', c='blue',
x='Age (years)', title='Number of times pregnant-Age', ax=ax)
plt.show()
[Analysis]
First read the data through pandas.read_csv, and then name each column for easy processing. Since the relationship between Number of times pregnant and Age is to be displayed, only these two columns can be retained by filtering the data. Then use the plot to visualize the data, select the kind as 'scatter' and specify the horizontal and vertical coordinates.
【run】
Draw a scatterplot using all or some of the features.
【Code】
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)
close_px_all.columns = ['Number of times pregnant',
'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
'2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
'Age (years)', 'Class variable']
# 使用全部或者部分特征绘制散布图
color = {
1: 'red', 0: 'blue'}
pd.plotting.scatter_matrix(close_px_all.iloc[:, [0, 3, 4]], figsize=(9, 9), diagonal='kde', s=40, alpha=0.6,
c=close_px_all['Class variable'].apply(lambda x: color[x]))
plt.show()
【Analysis】
Number of times pregnant, Triceps skin fold thickness and 2-Hour serum insulin were selected to analyze the characteristics of the class variable and draw a scatter diagram. First read the data through pandas.read_csv, and then name each column for easy processing. Use scatter to analyze the 0th column, the 3rd column and the 4th column, and draw and display the scatter diagram.
【run】
Draw a harmonic graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)
close_px_all.columns = ['Number of times pregnant',
'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
'2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
'Age (years)', 'Class variable']
# 绘制调和曲线图
pd.plotting.andrews_curves(close_px_all, 'Class variable', color=['red', 'blue'])
plt.show()
[Analysis]
First read the data through pandas.read_csv, and then name each column for easy processing. Just call the function directly to draw.
【run】