Introduction to Data Mining - Visual Analysis Experiment

Store traffic data visualization

Data Sources

The store data comes from the Tianchi Koubei Merchant Traffic Prediction Competition, and only part of the data is screened here. The meaning of each field in the data of "shop_payNum_new.csv" is shown in the table below:
insert image description here

Experimental requirements:

Reference Case 1 Select 5 tasks to draw different graphics from the following tasks:

Plot the October traffic line graph for all convenience stores.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data = data_total.iloc[data_total.index.month == 10]
data_id = data.groupby('shop_id')
for key in data_id.groups.keys():
    data_id.get_group(key).plot(y=['pay_num'], title='customer flow of shop '+str(key))
plt.show()

[Analysis]
First use pandas.read_csv to get all store data. Since it is necessary to filter the passenger flow line chart in October, use iloc to complete the data filtering, and use shop_id to perform groupby grouping to obtain the id key of each store. For each key, use get_group to obtain the data of the corresponding store in turn, and use plot to draw.

【Running】
Due to the fact that there are many drawings in actual operation, only a part is displayed.
insert image description here
insert image description here

Draw a line chart of the daily average passenger flow of each type of business in October.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data = data_total.iloc[data_total.index.month == 10]
data_id = data.groupby('cate_2_name')
for keys in data_id.groups.keys():
    data_id.get_group(keys).groupby(data_id.get_group(keys).index.day).mean().plot(y=['pay_num'], kind='line', title=keys)
plt.show()

[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the line chart of the daily average passenger flow of each type of business in October. Use iloc to filter the data and filter out the October data of each merchant. Use groupby to group the sales data and get the key value of each group. Use a loop to traverse each key, then get the date and average the date, and finally use plot to generate a line chart.

[Operation]
Some results are shown as follows
insert image description here
insert image description here

Select a business, count the total monthly traffic, and draw a histogram.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data_14 = data_total[data_total['shop_id'] == 14]
data_14_id = data_14.groupby(data_14.index.month).sum()
data_14_id.plot(kind='bar', y=['pay_num'], title='total custom of shop-14')
plt.xlabel('month')
plt.show()

[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the total passenger flow of a single merchant in each month. First, filter the data and filter out the data with shop_id 14. Use groupby combined with the sum function to perform group summation, and finally set the kind to a histogram and generate a drawing.

【run】
insert image description here

Select a merchant, count the daily average passenger flow from Monday to Sunday in a certain month, and draw a histogram.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data_14 = data_total[(data_total['shop_id'] == 14) & (data_total.index.month == 1)]
data_14_id = data_14.groupby(data_14.index.strftime('%w'))
data_14_id.mean().plot(y=['pay_num'], kind='bar', title='Average custom of shop 14 in January')
plt.xlabel('day')
plt.show()

[Analysis]
First use pandas.read_csv to get all store data. Due to the need to filter the average customer flow of a single merchant in a single month. First, filter the data, and filter out the data with shop_id 14 in January. Use groupby combined with strftime function to group by date to calculate the average value, and finally draw directly as a histogram.

【run】
insert image description here

Select a business and plot a traffic histogram.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data_14 = data_total[data_total['shop_id'] == 14]
data_14.plot(kind='hist', y=['pay_num'], title='shop-14-block')
plt.show()

[Analysis]
First use pandas.read_csv to read all shop data, and then filter all data according to shop_id. After filtering out the data of the corresponding store, use the plot to draw directly, and select the style as the 'hist' histogram.

【run】
insert image description here

Select a business and draw a traffic density map.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data_14 = data_total[data_total['shop_id'] == 14]
data_14.plot(kind='kde', y=['pay_num'], title='shop-14-density')
plt.show()

[Analysis]
First use pandas.read_csv to read all shop data, and then filter all data according to shop_id. After filtering out the data of the corresponding store, use the plot to draw directly, and select the style as the 'kde' density distribution map.

【run】
insert image description here

Calculate the proportion of the total customer traffic of each category of stores in a month to the total customer traffic of the month, and draw a pie chart.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_total = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)

data_month1 = data_total[data_total.index.month == 1]
data_month1_rate = data_month1.groupby('cate_2_name').sum() / data_month1['pay_num'].sum()
data_month1_rate['pay_num'].plot(kind='pie', autopct='%.2f')
plt.ylabel('')
plt.title('January')
plt.show()

[Analysis]
First use pandas.read_csv to read all store data, and then filter all data according to January. Then use groupby and sum to group and sum the traffic of each category, and use sum to sum all the traffic. The result of comparing the two results is the proportion. Finally, a pie chart can be made according to the proportion.

【run】
insert image description here

Pima Indian Diabetes Data Visualization

Data source: http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. The meaning of the first 9 fields of "pima.csv" data:

(1)Number of times pregnant
(2)Plasma glucose concentration a 2 hours in an oral glucosetolerancetest
(3)Diastolic blood pressure (mm Hg)
(4)Triceps skin fold thickness (mm)
(5)2-Hour serum insulin (mu U/ml)
(6)Body mass index (weight in kg/(height in m)^2)
(7)Diabetes pedigree function
(8)Age (years)
(9)Class variable (0 or 1)

Experimental requirements:

Refer to Case 2 to complete the following tasks:

Draw a scatterplot for any two fields.

【Code】

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
  
close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)  
close_px_all.columns = ['Number of times pregnant',  
                        'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',  
                        'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',  
                        '2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',  
                        'Age (years)', 'Class variable']  
# print(close_px_all.head())  
  
# # 任选两个字段绘制散点图  
pregnant_age = close_px_all[['Number of times pregnant', 'Age (years)', 'Class variable']]  
ax = pregnant_age[pregnant_age['Class variable'] == 0].plot(kind='scatter', y='Number of times pregnant', c='red',  
                                                            x='Age (years)', title='Number of times pregnant-Age',  
                                                            ax=None)  
pregnant_age[pregnant_age['Class variable'] == 1].plot(kind='scatter', y='Number of times pregnant', c='blue',  
                                                       x='Age (years)', title='Number of times pregnant-Age', ax=ax)  
plt.show()  

[Analysis]
First read the data through pandas.read_csv, and then name each column for easy processing. Since the relationship between Number of times pregnant and Age is to be displayed, only these two columns can be retained by filtering the data. Then use the plot to visualize the data, select the kind as 'scatter' and specify the horizontal and vertical coordinates.

【run】
insert image description here

Draw a scatterplot using all or some of the features.

【Code】

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)
close_px_all.columns = ['Number of times pregnant',
                        'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
                        'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
                        '2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
                        'Age (years)', 'Class variable']

# 使用全部或者部分特征绘制散布图
color = {
    
    1: 'red', 0: 'blue'}
pd.plotting.scatter_matrix(close_px_all.iloc[:, [0, 3, 4]], figsize=(9, 9), diagonal='kde', s=40, alpha=0.6,
                           c=close_px_all['Class variable'].apply(lambda x: color[x]))
plt.show()

【Analysis】
Number of times pregnant, Triceps skin fold thickness and 2-Hour serum insulin were selected to analyze the characteristics of the class variable and draw a scatter diagram. First read the data through pandas.read_csv, and then name each column for easy processing. Use scatter to analyze the 0th column, the 3rd column and the 4th column, and draw and display the scatter diagram.

【run】
insert image description here

Draw a harmonic graph.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

close_px_all = pd.read_csv('dataset/pima.csv', parse_dates=True, index_col=None, header=None)
close_px_all.columns = ['Number of times pregnant',
                        'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
                        'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
                        '2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
                        'Age (years)', 'Class variable']

# 绘制调和曲线图
pd.plotting.andrews_curves(close_px_all, 'Class variable', color=['red', 'blue'])
plt.show()

[Analysis]
First read the data through pandas.read_csv, and then name each column for easy processing. Just call the function directly to draw.

【run】
insert image description here

Guess you like

Origin blog.csdn.net/m0_46326495/article/details/123691554