Data analysis, machine learning, and visualization research based on Python e-commerce user behavior

If you need the source code of this project, a full set of documents and related resources, you can private message the blogger! ! !

Driven by the rapid development of digitalization and Internet technology, consumers' purchasing power and consumption concepts are constantly upgrading and changing. The explosive growth of user consumption data provides us with opportunities to find potentially valuable information.

This research uses the Taobao user behavior dataset provided by Alibaba, which contains nearly 4 million records. For some abnormal data in the data set, we used Python to preprocess and obtain clear and effective information. By building a structured analysis of user behavior and product information and implementing visual exploration, we have carried out detailed visual analysis of traffic indicators such as PV and UV, as well as user preferences for products, user behavior patterns, and traffic standards. We adopted the main ideas of e-commerce analysis, such as funnel model, daily ARPPU, daily ARPU, payment rate, repurchase rate and retention rate and other key indicators of e-commerce. Based on these analysis results, we provide merchants and platforms with practical strategies to drive effective marketing campaigns.

We use the method of K-Means clustering and RFM model to classify users, and divide users into four categories: new customers, star customers, secondary customers, and lost customers. For these four types of customers, we have implemented different marketing strategies to further optimize our marketing plans and e-commerce solutions. Finally, we expanded the four user behaviors (click, bookmark, add to shopping cart, and purchase) into four data indicators. In the end, we used the logistic regression model in machine learning to predict and analyze the user's purchase behavior. The model achieved an excellent performance of 98% in accuracy rate, showing a good prediction effect.

1.1 Research Significance

The continuous progress of Internet technology has driven the rapid development of global e-commerce, and this trend has also made e-commerce the main choice for people to purchase items. Taobao, as one of China's e-commerce giants, covers a wide range of people and has a large number of users. Therefore, the collection and analysis of its user behavior data is extremely important for improving Taobao's business decisions.

omitted here...

1.2 Research purpose

(1) Analyze Taobao user behavior data

(2) Explore the relationship between user behavior and product sales

(3) Provide in-depth analysis of user portraits and product sales trends

(4) Provide business decision support for Taobao

omitted here...

1.3 Research Significance

(1) Improve platform user experience

(2) Optimize product strategy

(3) Optimize the recommendation system

(4) Provide decision support for Taobao

(5) Promote the development of e-commerce industry

omitted here...

2. Research process

2.1 Overall research route

Figure 1 The overall research roadmap of this paper

2.2 Data introduction

The research data selected in this paper is based on an open source data set provided by Alibaba Tianchi Competition, which can be used as a research on big data user behavior visualization analysis, and can also be used as practical data for big data structural analysis and data analysis algorithms.

It contains all user behavior data: user ID, product ID, product category ID, behavior type, timestamp.

This data set contains data from November 25, 2017 to December 3, 2017, but there is a small amount of dirty data. When designing a structured analysis system, it is necessary to further preprocess and clean the data set to ensure Data sets can facilitate subsequent structured analysis and data analysis algorithms.

Table 1 Data field attribute introduction

field value

illustrate

User ID

Integer type, serialized user ID

Product ID

Integer type, product ID after serialization

Commodity category ID

Integer type, the ID of the category to which the serialized product belongs

behavior type

String, enumeration type, including ('pv', 'buy', 'cart', 'fav')

timestamp

Timestamp when the action occurred

pv

Product details page pv, equivalent to click

buy

commodity purchase

cart

add item to cart

fav

Favorite Product

Traditional data analysis software and programming languages ​​include Excel, SQL, R, SAS, and Python. Different tools and programming languages ​​are suitable for different business scenarios, which are omitted here...

2.3 Data preprocessing

Before data analysis, data preprocessing is usually required. Data preprocessing refers to the cleaning, conversion, integration, and reduction of raw data to make the data more suitable for subsequent analysis. Data preprocessing can eliminate errors, deletions, anomalies, and duplications in data, improve data quality, reduce errors, and provide a more reliable basis for subsequent data analysis.

omitted here...

Figure 2 Data missing value and outlier exploration

After checking the missing values ​​and outliers of the data, it is necessary to control the data in a time dimension, because an important data analysis thinking of this research is to explore behaviors at different times through some time indicators, traffic The distribution of indicators and so on.

Figure 3 Time Dimension Expansion Code Implementation

Use the to_datetime() method of Pandas to convert the timestamp into an actual time value, and then we need to expand the data field, because time is a field containing a variety of information, which we can decompose into year and month , day, week, hour, minute and other fields.

Then group by the year field, aggregate and count the user IDs to see if there is any offset in the data in the time dimension.

Figure 4 Time Dimension Distribution Exploration

The data I selected is the data set of 2017, including the behavior data from November 25th to December 3rd. At this time, I found that there is time data that is not within the range. At this time, I need to perform a constraint process by myself to constrain the time range to In this cycle, it is convenient for subsequent analysis and implementation

Figure 5 Distribution of data days

After finding that everything is normal, we have completed a basic preprocessing of the data, which is conducive to the accuracy and interpretability of our subsequent analysis, and will not cause too much interference to our analysis process.

2.4 Analysis and realization of user data

2.4.1 Overall User Behavior Analysis

omitted here...

Figure 6 Visualization of overall user behavior analysis

omitted here...

2.4.2 Daily Behavior Analysis of Users

Analyze the daily behavior of users from 2017-11-25 00:00:00 to 2017-12-03 23:59:59, design PV, UV and average value, average number of visits per capita, average number of visits per capita, payment rate, payment The average rate, the number of purchases per capita, and the average number of purchases per capita.

Figure 7 User daily behavior visualization

Since entering December, the number of visits and the number of people have gradually increased, and reached a peak on 12-02, the number of visits was about 480,000, the number of people was about 3.50,000, and the purchase volume was about 10,000; The data of is basically greater than or equal to the mean.

Payment rate = number of people paying / total number of people, omitted here...

Figure 8 The payment rate of clicks & purchases on the current day

During this stage, the user not only clicked and browsed the product, but also purchased the product, which is omitted here...

2.4.3 Analysis of User's Momentary Behavior

Here we explore a user through the time dimension, aiming to use hourly data in a day to perform aggregation operations, and to perform corresponding analysis and display under the same time dimension, including daily analysis indicators.

Figure 9 Visualization of user hourly behavior analysis

05-10 o'clock: people gradually wake up and start to go to work, visit the app by using travel time, and the number and number of visits continue to increase; 10-17 o'clock: people visit the app in their free time during working hours, the number of visitors and the number of visitors are omitted here .. ....

2.4.3 User’s Choice of Commodity Category

According to the analysis of Taobao user behavior data, there are obvious differences in the number of views and purchases of different product categories. Some popular product categories, such as clothing, shoes, bags and accessories, mobile phone digital, etc., have a high browsing rate and are omitted here...

Figure 10 Distribution of user behaviors to product categories

 

2.4.4 Daily Distribution of User Behavior

Here, the behaviors of collecting, adding to a shopping cart, and purchasing are selected, and visual exploration is carried out according to a daily distribution.

Figure 11 Analysis of daily behavioral data [collection, addition to shopping cart, purchase]

The click behavior is not included in the comparison here, because the number of clicks is omitted here...

2.4.5 Comparison of total visits and total transaction volume (daily)

In the time-change graph of traffic volume in hours, omitted here...

Figure 12 Visualization of total visits and total transaction volume comparison (hourly)

2.4.6 PV and UV changes within a week

The number of pv and uv increases from Monday to Thursday during the week, and the week is omitted here...

Figure 13 PV and UV visualization within a week

2.4.7- day ARPPU, daily ARPU

Daily ARPPU refers to the daily average paying user income omitted here...

Figure 14 Daily ARPPU, Daily ARPU

Figure 15 Daily payment rate visualization

2.4.8 Repurchase time consumption interval times

The repurchase time, consumption interval and times are measured by an e-commerce company, which is omitted here...

Figure 16 Visualization of days between repurchases

Figure 17 Repurchase Frequency Visualization

Through the above visualization, we can roughly understand that the frequency of user repurchase is about 3 times, so we should make accurate marketing recommendations for users with a relatively low repurchase rate.

2.4.9 Retention rate indicators

Retention rate means omit somewhere...

Figure 18 Visualization of retention rate indicators

It can be seen that the retention rate is good, and the retention rate during this period is almost 70+%; there is not much difference between the retention rate of the next day and the retention rate of 25/26/30 days; the Double Twelve event can bring a short-term retention rate raised.

2.5 Commodity Preference Analysis

2.5.1 top10 products with different behaviors

Users browse a large number of products every day, and each product will have an ID field. After visually analyzing the product IDs under different behaviors, we can grasp the omission here...

Figure 19 Visualization of top10 product IDs under different behaviors

2.5.2 top20 commodity categories with different behaviors

Through the different categories of goods, here is not the business omitted here......

Figure 20 Heat distribution of commodity categories under different behaviors

2.6 Exploration of Data Analysis Algorithms

2.6.1 Funnel Model

The funnel model is a data analysis technique used in e-commerce analysis. It aims to help e-commerce companies understand users' purchasing behavior and improve their websites and promotional activities .

Figure 21 Funnel Model Visualization of 4 Behaviors

The possible operations after the user clicks are: click -> add shopping cart, click -> bookmark, add shopping cart -> pay, bookmark -> pay, it can be clearly seen that the churn rate of users is relatively high, which is omitted here . .....

Figure 22 Funnel model visualization under independent visitors

Here is the visualization of the funnel model under independent visitors. Through the results display, we can find that the conversion rate from clicking to adding shopping cart behavior is relatively high, and the conversion rate from adding shopping cart to payment behavior is also relatively high.

2.6.2 RFM data analysis algorithm

The RFM algorithm is a method that is omitted here by analyzing the customer's consumption ...

Figure 23 RFM algorithm user group label

Since there is no specific transaction quantity in this user behavior, RFM is converted to RF calculation method here to fix the transaction amount.

Figure 24 RF data analysis user division

The most important customers are important development customers, which are omitted here...

2.6.3 Clustering Algorithm to Realize User Hierarchy

Through the behavior data of users and various new dimension fields under the perspective data, such as the number of recent purchases and the time of the latest purchase, we can use the Kmeans clustering algorithm to cluster the user groups, according to The number of clusters, the RFM model is used to delineate it in depth, which is omitted here...

Figure 25 Elbow method and contour coefficient value

Through the silhouette coefficient and inflection point and the number of user groups divided by RFM at the beginning, it is determined that the optimal number of clusters is 4 categories.

26 用户聚类可视化

将用户划分为4类,消费新鲜度频率以及消费时间间隔来确定出不同的类型用户。0类用户的消费时间间隔相对于比较短,也就是但是消费频率不是很高,可以划分为重要发展客户;此处省略......

2.6.4用户购买预测模型

逻辑回归是一种常见的分类算法,它的原理基于线性回归,并使用逻辑函数将连续的输出转换为离散的概率预测。在逻辑回归中,将输入特征和权重进行线性组合,然后将结果输入到逻辑函数中,逻辑函数将输出值映射到[0, 1]区间内的概率值。这个概率值表示输入特征属于某个类别的概率大小。

此处省略......种领域中广泛应用于分类任务。

进行前面的基本数据统计分析和业务分析模型搭建之后,这里我们采用机器学习的思想,将用户的行为扩充为多个维度字段,其中购买行为作为目标预测列,引入逻辑回归模型进行预测,最终预测精度达到98%。

27 逻辑回归准确率展示

2.7可视化大屏设计及展示

可视化大屏是一种通过将数据可视化呈现在大屏幕上的方式来帮助人们更好地理解和解释数据的工具。相比于传统的数据报此处省略......

28 可视化大屏展示

3.总结

3.1研究特点

本研究通过选取大数据集,此处省略......

3.2研究缺点

由于选取的开源数据,数据的维度和数据的数量都有所限制,后续研究可以通过网络爬虫的手段对其行为进行捕捉,其次在业务分析模型中增加过多的时间维度上的分析。

3.3未来展望

随着电商平台用户数此处省略......

每文一语

当你觉得自己很迷茫的时候,不如尝试放下,重新起航

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/131552141
Recommended