Python comprehensive case - data analysis of tip data set (detailed ideas + source code analysis)

Table of contents

1. Please import the corresponding module and get the data. Import the pending data tips.xls and display the first 5 rows.

2. Analyze data

 3. Add a column of "per capita consumption"

4Query the data of smoking men whose per capita consumption is greater than 5

 5. Analyze the relationship between tip amount and total consumption, and whether there is a positive correlation between tip amount and total consumption. Draw and observe.

6. To analyze which is more generous, male or female, is to group and see whether males or females have higher average tipping levels

7. To analyze the relationship between date and tip, please draw a histogram.

8. Graphical analysis of the effect of the combination of gender + smoking on generosity

 9. Draw and analyze the relationship between the dinner time period and the tip amount

Summarize

 


This training mainly analyzes and visualizes the tip data, and the data used is placed in the file.


1. Please import the corresponding module and get the data. Import the pending data tips.xls and display the first 5 rows.

# 导入相应模块
import pandas as pd
import matplotlib.pyplot as plt

# 导入数据并显示前5行
tips_data = pd.read_excel('tips.xls')
print(tips_data.head())


2. Analyze data

1. View the description information of the data

2. Modify the column name as Chinese characters (total_bill--total consumption, tip--tip, sex--gender, smoker--whether smoking, day--week, time--dinner time, size--number of people), and display The first 5 rows of data.

# 导入数据并显示描述信息
print(tips_data.describe())

# 修改列名并显示前5行
tips_data.columns = ['消费总额', '小费', '性别', '是否抽烟', '星期', '聚餐时间段', '人数']
print(tips_data.head())


 3. Add a column of "per capita consumption"

# 导入数据并增加“人均消费”列
tips_data['人均消费'] = tips_data['消费总额'] / tips_data['人数']
print(tips_data.head())


4Query the data of smoking men whose per capita consumption is greater than 5

# 导入数据并查询抽烟男性中人均消费大于5的数据
smoking_male = tips_data[(tips_data['是否抽烟']=='Yes') & (tips_data['性别']=='Male')]
result = smoking_male[smoking_male['消费总额'] / smoking_male['人数'] > 5]
print(result)


 5. Analyze the relationship between tip amount and total consumption, and whether there is a positive correlation between tip amount and total consumption. Draw and observe.

# 导入数据并绘制散点图
x = tips_data['消费总额']
y = tips_data['小费']
plt.scatter(x, y)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show()

It can be seen that the tip amount seems to increase as the total consumption increases, which indicates that there is a certain degree of positive correlation between the tip amount and the total consumption, but not a very strong positive correlation.


6. To analyze which is more generous, male or female, is to group and see whether males or females have higher average tipping levels

# 导入数据并计算男女顾客的小费平均值
gender_tip_mean = tips_data.groupby('性别')['小费'].mean()
print(gender_tip_mean)
 
 

It can be seen that in this dataset, male customers tip slightly more than female customers on average. Therefore, judging from this data, male customers may be more generous.


7. To analyze the relationship between date and tip, please draw a histogram.

# 导入数据并绘制直方图
grouped = tips_data.groupby('星期')['小费']
hist_data = [grouped.get_group(day) for day in grouped.groups]
plt.hist(hist_data, bins=10, histtype='bar', stacked=True)
plt.legend(grouped.groups.keys())
plt.xlabel('Tip amount')
plt.ylabel('Frequency')
plt.show()

 


8. Graphical analysis of the effect of the combination of gender + smoking on generosity

# 导入数据并绘制箱线图
fig, ax = plt.subplots()
ax.boxplot([tips_data[tips_data['性别']=='Male'][tips_data['是否抽烟']=='Yes']['小费'],
            tips_data[tips_data['性别']=='Male'][tips_data['是否抽烟']=='No']['小费'],
            tips_data[tips_data['性别']=='Female'][tips_data['是否抽烟']=='Yes']['小费'],
            tips_data[tips_data['性别']=='Female'][tips_data['是否抽烟']=='No']['小费']],
           labels=['Male smoker', 'Male non-smoker', 'Female smoker', 'Female non-smoker'])
plt.xlabel('Gender and smoking')
plt.ylabel('Tip amount')
plt.title('Effect of gender and smoking on tipping behavior')
plt.show()

 

It can be seen that tipping by male smokers is at the highest level of all combinations, while tipping by female non-smokers is at the lowest level of all combinations. Thus, in this dataset, male smokers may be more generous, while female non-smokers may be less generous. 


 9. Draw and analyze the relationship between the dinner time period and the tip amount

# 导入数据并绘制散点图
colors = ['blue', 'green', 'red', 'purple']
grouped = tips_data.groupby('聚餐时间段')
for i, (key, group) in enumerate(grouped):
    plt.scatter(group['消费总额'], group['小费'], label=key, color=colors[i])
plt.xlabel('Total bill amount')
plt.ylabel('Tip amount')
plt.title('Relationship between meal time and tipping behavior')
plt.legend()
plt.show()

 

 It can be seen that the tip amounts for lunch and dinner are roughly positively correlated, while the tip amounts for breakfast and supper are relatively sparse, with no obvious correlation. So, from this data, it seems that lunch and dinner are more likely to receive higher tip levels.

Summarize

This is a process of data analysis and visualization, and its main steps are as follows:

  1. Import required modules, including Pandas and Matplotlib.

  2. Use Pandas to read and process datasets, including modifying column names, calculating per capita consumption, querying data under specific conditions, and more.

  3. Use Matplotlib to draw various types of charts, including scatter plots, histograms, box plots, etc., to discover the relationship between some characteristics of customers and the amount of tips.

  4. Beautify and customize plotted charts, including adding labels, titles, axis labels, legends, and more.

  5. Consider the actual situation and boundary conditions to ensure that the code can work stably and efficiently.

This process involves a variety of data analysis and visualization techniques, which can help us better understand the data, discover the laws and trends in it, and provide reference for further research and decision-making. At the same time, it is also necessary to pay attention to data quality and code efficiency to avoid unexpected problems.

Source code download:

visualization.py Jiang Yanxi/Xiao Jiang’s CSDN - Gitee.com https://gitee.com/jiang-yanxi123/xiaojiangs---csdn/blob/master/visualization.py

Guess you like

Origin blog.csdn.net/m0_62338174/article/details/130080873