Design and Implementation of Online Shopping Behavior Analysis Based on Hadoop

If you need this project, you can private message the blogger to provide deployment and explanation services! ! ! ! !

Based on the open source data of Taobao user behavior, this research conducts big data analysis research, and conducts multi-dimensional user behavior analysis on the open source data set released by Alibaba Tianchi through the Hadoop big data analysis platform, so as to provide feasible decision-making for e-commerce sales.

This study selected the data from December 1st to 18th, 2021, in which each row of the data set contains every behavior of the user. First, we upload the data set to HDFS storage in Hadoop, and then use the Flume component of Hadoop to configure the environment for automatic data loading, and load the data into the hive database for big data analysis. Through the statistical analysis of common e-commerce indicators: PV, UV, bounce rate, repurchase rate, etc., multi-dimensional perspective analysis is performed on user behavior, activity and other indicators according to the time dimension, and then the popularity in e-commerce data is analyzed. Statistical analysis of sales ID, hot-selling commodity category, and user geographic location. Store the analyzed result table in the hive database, and then use the sqoop component to automatically export the result table in the hive database to the relational database MySQL, which is convenient for data storage, analysis and display.

Afterwards, for the analyzed result data table, Python's pyecharts visualization library is used for front-end visual display. By calling the data set in MySQL, a multi-dimensional visual chart type is drawn, which is easy to understand and display. Finally, combined with the page method in pyecharts, these visualizations use the front-end interactive large-screen visual display design, and combine the HTML large-screen visualization to write static data to build a cool visual large screen. Displaying these results through rich graphs can help decision makers make decisions quickly.

1.1 Research Background

In recent years, with the popularization of the Internet and the development of e-commerce, more and more people choose to shop online, which makes the competition in the e-commerce industry increasingly fierce. In this case, in order to better understand consumers' shopping behavior and needs, e-commerce companies need to obtain valuable information through big data analysis. With the support of big data technology, a large amount of user behavior data can be analyzed in order to better understand the needs of consumers and provide better decision support for enterprises.

As one of the most popular big data technologies, Hadoop has become the platform of choice for processing large-scale data. It can process massive data quickly and efficiently, and can automatically perform data fragmentation and parallel computing, which greatly improves the speed and efficiency of data processing. At the same time, the Hadoop ecosystem also provides many components suitable for big data analysis, such as Flume, Hive, and Sqoop. These components can work together to realize automated data processing and analysis.

Therefore, based on Hadoop technology, this study uses open source data sets to conduct big data analysis on Taobao user behavior. We selected the data from December 1st to 18th, 2021. The data during this period can reflect consumers' shopping behavior and trends, and can provide valuable decision support for e-commerce companies. We used the Flume component of Hadoop to load the data set into the Hive database, and then performed statistical analysis on common e-commerce indicators, such as PV, UV, bounce rate, repurchase rate, etc. At the same time, we conducted multi-dimensional perspective analysis on user behavior, activity and other indicators to better understand users' shopping behavior and needs.

In addition, we also conducted statistical analysis on the hot-selling IDs, hot-selling commodity categories, and user geographic locations in the e-commerce data. This information can provide e-commerce companies with better product positioning and marketing strategies. Finally, we used Python's pyecharts visualization library for front-end visualization. By calling the data set in MySQL, we drew a multi-dimensional visualization chart, allowing decision makers to understand the analysis results more intuitively. At the same time, we also combined HTML large-screen visualization to build a cool large-screen visualization, allowing decision makers to observe and understand data more conveniently.

In short, the main purpose of this research is to conduct multi-dimensional analysis of Taobao user behavior through Hadoop's big data analysis platform, and provide valuable decision support for e-commerce companies. Through statistical and perspective analysis of common e-commerce indicators and user behavior indicators, we can better understand users' shopping behavior and needs, and provide better guidance for e-commerce companies' product positioning and marketing strategies. In addition, this study also explores the application of Hadoop technology in big data analysis, and realizes the visual display of data in combination with Python's visualization library, providing better tools and platforms for data analysis and decision-making.

In conclusion, the research background of this study is that the competition in the e-commerce industry is becoming more and more fierce, and in order to better understand the needs and shopping behavior of consumers, it is necessary to adopt big data analysis technology. As one of the most popular big data technologies at present, Hadoop can use multiple components to realize automatic data processing and analysis, and provide better support for data analysis. This study uses Hadoop technology to conduct multi-dimensional analysis of Taobao user behavior, combined with visual display, it can provide better decision support for e-commerce companies. At the same time, this research also provides some ideas and technical support for big data analysis and visualization display, which is also meaningful for promoting the development and application of big data technology.

1.2 Analysis of current research status at home and abroad

In recent years, with the rapid development of Internet technology and e-commerce business, big data analysis has become more and more widely used in the field of e-commerce. This paper mainly introduces the application status of Hadoop-based big data analysis technology in the field of e-commerce.

The analysis of domestic and foreign research status shows that the application of Hadoop-based big data analysis technology in the field of e-commerce has become a trend. In China, Alibaba is one of the main developers of Hadoop. Alibaba's large-scale data analysis platform MaxCompute has been widely used in e-commerce fields, such as Taobao and Tmall. At the same time, Baidu, Tencent and other companies are also actively applying Hadoop technology for big data analysis in the field of e-commerce.

In foreign countries, e-commerce giants such as Amazon and eBay have also made many attempts in big data analysis technology, such as using Hadoop technology to analyze user behavior data to improve sales efficiency and user experience. In addition, some small e-commerce companies in the United States are also trying to apply big data analysis technology to improve sales efficiency.

In terms of specific applications, Hadoop-based big data analysis technology is mainly used in the following aspects:

First, through the analysis of user behavior data, user behavior prediction and personalized recommendation can be realized. For example, by analyzing the user's historical behavior data, it is possible to predict the products that the user may be interested in, and recommend related products to the user.

Secondly, through the analysis of commodity sales data, the prediction of sales trends and popular commodities can be realized. For example, through the analysis of product sales data, it is possible to predict which products may become popular products, and take timely marketing measures to increase sales.

In addition, big data analysis technology can also help e-commerce companies improve operational efficiency. Through the analysis of e-commerce data, sales bottlenecks and optimization points can be found, and corresponding measures can be taken in time to improve sales efficiency.

Finally, big data analysis technology can also help e-commerce companies to control risks. Through the analysis of e-commerce data, potential risk factors can be discovered and timely measures can be taken to reduce the risk.

To sum up, the application of Hadoop-based big data analysis technology in the field of e-commerce has achieved great results, and there is still a lot of room for development. In the future, we can further explore how to combine new technologies such as artificial intelligence with big data analysis technology to achieve more intelligent and precise e-commerce marketing and operation management. At the same time, we can further study how to combine big data analysis technology with new technologies such as the Internet of Things and cloud computing to build a more complete and efficient e-commerce platform. In addition, with the continuous development of big data analysis technology, we need to further improve the data security and privacy protection mechanism to ensure the security and privacy of user data are not violated.

In short, the application of Hadoop-based big data analysis technology in the field of e-commerce has great potential and development space. Through the analysis of user behavior, product sales, operational efficiency, and risk control, it can help e-commerce companies improve sales efficiency and user experience, thereby achieving better business value and social benefits.

1.3 Research purpose

This paper aims to conduct in-depth research on Taobao users' online shopping behavior through Hadoop-based big data analysis technology, in order to provide feasible decision-making for e-commerce sales. The specific research purposes include the following aspects:

Collect and organize big data samples of Taobao users’ online shopping: This research will select the open source data set released by Alibaba Tianchi, and use the Taobao user behavior data from December 1 to 18, 2021 as a sample, extract and sort out representative feature variables for subsequent big data analysis.

Use the Hadoop big data analysis platform to conduct multi-dimensional user behavior analysis on data: use Hadoop's Flume component to configure an environment for automatically loading data, upload the data to HDFS storage, and load the data into the hive database for big data analysis. Through the statistical analysis of common e-commerce indicators: PV, UV, bounce rate, repurchase rate, etc., multi-dimensional perspective analysis of user behavior, activity and other indicators according to the time dimension, so as to deeply explore the online shopping experience of Taobao users Behavioral characteristics.

Statistical analysis of hot-selling IDs, hot-selling commodity categories, and user geographic locations in e-commerce data: by screening and classifying the analysis results, combined with popular-selling IDs, hot-selling commodity categories, and user geographic locations in e-commerce data. Statistical analysis, so as to deeply understand the shopping behavior and consumption habits of Taobao users.

Using Python's pyecharts visualization library for front-end visualization: In view of the above analysis results, this research will use Python's pyecharts visualization library for front-end visualization, drawing multi-dimensional visualization chart types, in order to understand and display the online shopping behavior characteristics of Taobao users.

Use the large-screen visual display design of front-end and back-end interaction and build a cool big-screen visualization: Finally, this research will combine the page method in pyecharts to design these visualizations using the large-screen visual display of front-end and back-end interaction, and combine HTML large-screen visualization Write static data and build a cool visual large screen. Through the rich chart display, it can help decision makers quickly understand the online shopping behavior characteristics of Taobao users and make more accurate decisions.

To sum up, this study aims to conduct in-depth research and analysis on Taobao users’ online shopping behavior through Hadoop-based big data analysis technology, and provide feasible decision-making for e-commerce sales, including the behavioral characteristics and consumption of Taobao users. It conducts multi-dimensional analysis on habits, hot-selling products and geographical location, and presents the analysis results through front-end visual display and large-screen visual display. The purpose of this study is to help the e-commerce platform to better understand consumer needs and behavioral characteristics, improve sales efficiency and competitiveness, and promote the sustainable development of the e-commerce industry by in-depth mining and analysis of online shopping behavior data of Taobao users. At the same time, this study will also explore the application and prospect of Hadoop-based big data analysis technology in the field of e-commerce, and provide certain references and references for research in related fields.

1.4 Research Significance

This paper conducts big data analysis research based on open source data of Taobao user behavior, mainly discusses the application of Hadoop big data analysis platform in e-commerce sales and the significance of multi-dimensional user behavior analysis in e-commerce decision-making. The significance of this study is reflected in the following aspects.

First of all, this study provides a Hadoop-based big data analysis solution for e-commerce data analysis. With the continuous development of Internet technology, e-commerce platforms are becoming more and more popular, and the accumulation of massive user behavior data provides basic data for e-commerce decision-making. This paper uses Hadoop technology to analyze massive data, combined with multi-dimensional perspective analysis, to mine the potential value in the data, and provide more accurate and feasible solutions for e-commerce decision-making.

Second, this study is innovative in multi-dimensional user behavior analysis. Through statistical analysis of common e-commerce indicators: PV, UV, bounce rate, repurchase rate, etc., and multi-dimensional perspective analysis of user behavior, activity and other indicators according to the time dimension, it is possible to understand user behavior rules more comprehensively . In addition, it also conducts statistical analysis on the hot-selling IDs, hot-selling product categories, and user geographic locations in the e-commerce data, and explores the rules of user behavior from multiple perspectives to provide more accurate basis for e-commerce decision-making.

Third, this study has utility in terms of visual presentation. This article uses Python's pyecharts visualization library for front-end visualization. By calling the data set in MySQL, it draws a multi-dimensional visualization chart type, which is easy to understand and display. Moreover, this article also combines HTML large-screen visualization to write static data, and builds a cool visual large screen, which can provide e-commerce decision makers with clearer and more intuitive data display.

Finally, this study has important implications in terms of e-commerce decision-making. With the continuous development of e-commerce platforms, more and more user behavior data are accumulated. How to tap the potential value of these data and improve sales performance has become an important issue for e-commerce platforms. The Hadoop-based big data online shopping behavior analysis scheme proposed in this paper can provide feasible decisions for e-commerce platforms, optimize marketing strategies, improve user conversion rates, and achieve better business value.

To sum up, this study has important research significance and application value in e-commerce data analysis, multi-dimensional user behavior analysis, visual display and e-commerce decision-making. In China, with the rapid rise of e-commerce platforms, e-commerce data analysis has become a research field that has attracted much attention. The Hadoop-based big data analysis scheme proposed in this paper has important reference significance for the data analysis and decision-making of domestic e-commerce platforms.

At the same time, in the world, e-commerce data analysis is also a hot research field. In particular, some large foreign e-commerce platforms have achieved certain results in big data analysis and artificial intelligence technology, providing more accurate and efficient solutions for e-commerce decision-making. The Hadoop-based big data analysis solution proposed in this paper can also provide a feasible solution for the international e-commerce platform.

In short, this study proposes a Hadoop-based big data online shopping behavior analysis scheme, which provides a feasible decision-making basis for e-commerce decision-making through multi-dimensional user behavior analysis and visual display. This research has important research significance and application value in e-commerce data analysis, multi-dimensional user behavior analysis, visual display and e-commerce decision-making.

2 The overall design of the study

2.1 Overall research route

The main purpose of this research is to use the open source data of Taobao user behavior to carry out big data analysis research to provide feasible decision-making for e-commerce sales. For this reason, this research selects the open source data set released by Alibaba Tianchi and uploads it to Hadoop's HDFS storage for storage. Afterwards, the Flume component of Hadoop is used to automatically load the data, and the data is loaded into the hive database for big data analysis.

In the process of analysis, this study first conducts statistical analysis on common e-commerce indicators, such as PV, UV, bounce rate, repurchase rate, etc., to understand the basic situation of user behavior. Then, perform multi-dimensional perspective analysis on user behavior, activity and other indicators according to the time dimension to further understand the changing trends and rules of user behavior. In addition, this study also conducts statistical analysis on factors such as hot-selling IDs, hot-selling product categories, and user geographic locations in e-commerce data to understand the characteristics and preferences of user purchasing behavior.

In order to facilitate data storage, analysis and display, this study stores the analyzed result table in the hive database, and uses the sqoop component to automatically export it to the relational database MySQL. On this basis, this study uses Python's pyecharts visualization library for front-end visualization. By calling the data set in MySQL, it draws multi-dimensional visualization chart types, which is easy to understand and display. Finally, combined with the page method in pyecharts, these visualizations use the front-end interactive large-screen visual display design, and combine the HTML large-screen visualization to write static data to build a cool visual large screen. These results are displayed through rich charts to help decision makers make decisions quickly.

To sum up, the overall research route of this study can be summarized as follows: First, big data analysis is conducted on the open source data of Taobao user behavior, including basic indicators, multi-dimensional perspective analysis, and the characteristics and preferences of user purchasing behavior. Then, store the analysis results in the hive database and export them to the MySQL database for easy data storage and analysis. Finally, Python's pyecharts visualization library is used for front-end visual display. By calling the data set in MySQL, drawing multi-dimensional visual chart types, a cool visual large screen is built to facilitate understanding and display of analysis results, helping decision makers quickly make a decision.

The detailed description of the research route of this study is as follows:

(1) Data preparation and storage

This study chooses the Taobao user behavior open source data set released by Alibaba Tianchi as the research object. First upload the data set to Hadoop's HDFS storage, and use Hadoop's Flume component to configure the environment for automatic data loading, and load the data into the hive database for big data analysis.

(2) Data analysis and statistics

This study uses a multi-dimensional data analysis method to conduct statistical analysis on common e-commerce indicators, such as PV, UV, bounce rate, repurchase rate, etc. At the same time, multi-dimensional perspective analysis is performed on user behavior, activity and other indicators according to the time dimension to understand the changing trends and laws of user behavior. In addition, statistical analysis is also carried out on factors such as hot-selling IDs, hot-selling commodity categories, and user geographic locations in e-commerce data to understand the characteristics and preferences of users' purchasing behaviors.

(3) Data storage and visual display

In order to facilitate data storage, analysis and display, this study stores the analyzed result table in the hive database, and uses the sqoop component to automatically export it to the relational database MySQL. On this basis, this research uses Python's pyecharts visualization library for front-end visualization display, and draws multi-dimensional visualization chart types by calling the data set in MySQL. Combined with the page method in pyecharts, the large-screen visual display design for front-end and back-end interaction of these visualizations is combined, and the HTML large-screen visualization is used to write static data to build a cool large-screen visualization. These results are displayed through rich charts to help decision makers make decisions quickly.

(4) Result analysis and decision making

Finally, this study summarizes and analyzes the analysis results to form a feasible decision for e-commerce sales. According to the analysis results, decision makers can understand the characteristics and preferences of user behavior, and formulate marketing strategies and promotion plans in a targeted manner to improve sales results and customer satisfaction.

To sum up, this study uses the open source data of Taobao user behavior for big data analysis research, and conducts multi-dimensional user behavior analysis through the Hadoop big data analysis platform, and finally forms a feasible decision for e-commerce sales. This research route combines big data storage and processing technology, data analysis and statistical methods, and data visualization display technology to provide a strong support for e-commerce sales.

Figure 1 Research Roadmap

2.2 Hadoop environment introduction and deployment

Hadoop is a distributed big data processing framework, its main features are high fault tolerance, high scalability and high performance. Composed of HDFS, MapReduce, YARN and Commons, Hadoop is a big data processing platform with wide applications. Among them, HDFS is the Hadoop distributed file system, which is used to store large-scale data; MapReduce is the distributed computing framework of Hadoop, which is used to process data; YARN is the resource manager of Hadoop, which is used to manage cluster resources; Commons is the common Libraries for providing various supporting libraries and tools.

When deploying a Hadoop environment, the following aspects need to be considered:

(1) Hardware equipment

Hadoop needs to run on a group of networked computer clusters, so it is necessary to choose hardware devices with high performance and reliability. Hardware devices should have high-speed CPU, large-capacity memory and disk space to meet Hadoop's big data processing needs.

(2) Operating system and software environment

Hadoop runs on the Linux operating system, so you need to choose a Linux version suitable for Hadoop. In addition, you need to install software such as Java, SSH, SCP, and wget.

(3) Hadoop environment configuration

When deploying the Hadoop environment, Hadoop needs to be configured to meet specific application requirements. The main configuration items include the size of the Hadoop cluster, the configuration of nodes, the number of copies of HDFS, the number of tasks of MapReduce, etc.

(4) Component installation and configuration

When deploying the Hadoop environment, you also need to install and configure Hadoop components, such as HDFS, hive, flume, sqoop, mysql, etc. These components need to be installed and configured in a corresponding order to ensure the normal operation and data processing functions of Hadoop.

(5) Security and authority management

When deploying a Hadoop environment, security and authority management issues need to be considered. It mainly includes user authentication, data encryption, data access control, data backup and recovery, etc.

In short, when deploying the Hadoop environment, it is necessary to start from the aspects of hardware equipment, operating system and software environment, Hadoop environment configuration, component installation and configuration, security and rights management, etc., to ensure the normal operation and data processing functions of Hadoop. At the same time, Hadoop needs to be optimized and configured according to specific application requirements.

2.3 Pre-knowledge preparation

2.3.1 Introduction to HDFS

HDFS, the Hadoop Distributed File System, is one of the core components of Hadoop. It is a distributed file system with high fault tolerance, high reliability and high scalability, which is widely used in big data processing and storage. HDFS can disperse and store data on multiple nodes of the cluster, provides a unified access interface, and has functions such as high-speed reading, writing, and data backup. The basic concepts and related knowledge of HDFS will be introduced below.

(1) block

HDFS divides files into fixed-size blocks (128MB by default), and stores each block on a different node to achieve decentralized data storage and high-speed read and write.

(2) Name Node (NameNode)

The name node is the main node of HDFS and is responsible for storing metadata information of the file system, including files, blocks, and data nodes. It maintains the namespace of the file system and records the location information of the data node where each block resides.

(3) Data Node (DataNode)

Data nodes are working nodes of HDFS, which are responsible for storing actual data blocks and responding to read and write requests from clients. It regularly reports information about data blocks to the name node, and receives instructions from the name node to perform operations such as copying and deleting data blocks.

(4) Number of copies

HDFS adopts a multi-copy backup mechanism of data blocks to improve data reliability and fault tolerance. By default, each data block has 3 copies stored on different data nodes to prevent single point of failure and data loss.

(5) Security

In order to ensure the security of HDFS, Hadoop provides some security mechanisms, including user authentication, access control, data encryption, data backup and recovery and other functions. Among them, user authentication and access control are the most basic security mechanisms, which can be authenticated by user name and password, and data access control can be performed.

(6) Access method

HDFS provides a variety of access methods, including command line interface (CLI), Java API, HDFS file system (Hadoop file system), etc. Among them, the HDFS file system is the most commonly used access method, which provides an interface similar to the standard file system and can be used conveniently in applications.

To sum up, HDFS is the abbreviation of Hadoop Distributed File System, and it is one of the core components in Hadoop. It uses blocks to store files in a decentralized manner, and uses a multi-copy backup mechanism to ensure data reliability and fault tolerance. The name node and data node of HDFS are respectively responsible for the metadata information of the file system and storing data blocks. HDFS provides a variety of security mechanisms to ensure data security and privacy protection.

2.3.2 Introduction to Flume

Flume is an open source data collection and aggregation system of the Apache Foundation, mainly used for high-speed transmission of large-scale data. The basic function of Flume is to collect and aggregate data from multiple data sources, and then send the data to the target system. The design philosophy of Flume is high reliability, high scalability and flexibility, and it can handle various types of data sources, including logs, events, messages, etc.

The architecture of Flume mainly includes three components: source, channel and sink. The source component is used to collect data from the data source, the channel component is used to cache and store data, and the sink component is used to send data to the target system. Flume also supports many different types of source and sink components, including avro, thrift, kafka, etc., to meet the needs of different types of data sources.

The workflow of Flume is as follows: first, the source component collects data from the data source and sends it to the channel component for caching and storage; then, the sink component reads the data from the channel component and sends it to the target system ; Finally, the channel component will delete the data that has been successfully transmitted to free up storage space. Flume also supports custom interceptors, which can process data between source and sink components, such as formatting and filtering data.

When using Flume for data transfer, it needs to be configured. The configuration of Flume mainly includes the configuration of source, channel and sink components, as well as other related configuration items, such as interceptor configuration, failure handling strategy, etc. Flume supports many different configuration methods, including XML configuration, Java API configuration, etc.

In summary, Flume is an open source data collection and aggregation system that can handle high-speed transmission of large-scale data. The architecture of Flume mainly includes source, channel and sink components, as well as many different types of source and sink components. The workflow of Flume is to collect data from the source component, cache and store it through the channel component, and finally send the data to the target system by the sink component. When using Flume for data transmission, it needs to be configured, including the configuration of source, channel, and sink components, as well as other related configuration items. Flume has important application value in the field of big data processing and data analysis, and can improve the efficiency and reliability of data transmission.

2.3.3 Introduction to Hive

Hive is a Hadoop-based data warehouse tool that can map structured data into a database table and provide support for SQL query language, allowing users to operate on data in a manner similar to SQL. Hive adopts a metadata storage model similar to a relational database management system (RDBMS), so it can store data in the HDFS file system of the Hadoop cluster, and supports high scalability and fault tolerance.

In Hive, data is stored and managed through tables. A Hive table is composed of columns and rows, and each column has a corresponding data type and name. Hive also supports a variety of file formats, including text, CSV, Avro, Parquet, and more. In addition, Hive also supports partitioning and bucketing to improve query performance.

The core of Hive is the query engine, which translates SQL statements into MapReduce jobs, thereby realizing data query and processing on Hadoop clusters. Hive's query engine can combine multiple query jobs to implement complex query operations. In addition, Hive also supports UDF (User Defined Function), UDAF (User Defined Aggregate Function) and UDTF (User Defined Table Function) to extend the functionality of Hive.

In addition to querying data, Hive also supports data loading and data export. Data loading can be performed through HiveQL statements or using the LOAD command. Hive also supports importing data from other storage systems into Hive. Data export You can use the INSERT statement to export data from Hive tables to other storage systems.

The advantage of Hive is that it can convert massive data on the Hadoop cluster into structured data that is easy to query and process, and provides SQL syntax support so that users can query and process data in a way similar to traditional relational databases. In addition, Hive supports multiple file formats and partitioning techniques to improve query performance. In addition, Hive also has high scalability and fault tolerance, and can easily process PB-level data.

In short, Hive is a Hadoop-based data warehouse tool that can map structured data into a database table and provide support for SQL query language. Hive can convert massive data on Hadoop clusters into structured data that is easy to query and process, and supports multiple file formats and partitioning techniques to improve query performance. Hive has high scalability and fault tolerance, and can easily process PB-level data, so it has a wide range of applications in the field of big data analysis and processing.

2.3.4 Introduction to Sqoop

Sqoop is a tool for data interaction between relational database and Hadoop. It supports importing data from relational database to Hadoop and exporting data from Hadoop to relational database. The full name of Sqoop is SQL-to-Hadoop, which is one of the important components in the Hadoop ecosystem.

The basic concepts and related knowledge of Sqoop mainly include the following aspects:

(1) The principle and characteristics of Sqoop

Sqoop is written based on Java and uses Hadoop's MapReduce framework to support importing data from relational databases to Hadoop and exporting data from Hadoop to relational databases. Sqoop supports multiple data sources and data formats for importing and exporting data, such as relational databases such as MySQL, Oracle, and SQL Server, and data formats such as CSV, Avro, and Parquet. Sqoop also supports parallel import and export of data, which can be partitioned and processed in batches as needed.

(2) How to use Sqoop and commands

The usage and commands of Sqoop are relatively simple, mainly divided into two operations: import and export. Among them, the command to import data is: sqoop import, and the command to export data is: sqoop export. These commands also support various parameter options, which can be configured as needed. For example, you can specify the connection string, user name, and password of the data source, specify the query statement, delimiter, and file format for importing data, specify the table name and column name for exporting data, and so on.

(2) Integration of Sqoop and Hadoop

The integration of Sqoop and Hadoop is mainly based on the Hadoop MapReduce framework, and the MapReduce jobs generated by Sqoop are submitted to the Hadoop cluster for processing. During the integration of Sqoop and Hadoop, it is also necessary to configure Hadoop environment variables and configuration files so that Sqoop can correctly connect and operate the data in the Hadoop cluster.

(3) Sqoop optimization and performance tuning

The optimization and performance tuning of Sqoop mainly includes the following aspects: optimize the query statement of the data source, reasonably set the parallelism of import and export, select the appropriate partition strategy, set the appropriate cache size, and reduce the serialization and deserialization of data etc. These optimizations and tunings can significantly improve the performance and efficiency of Sqoop, and speed up data import and export.

In short, Sqoop is an important component in the Hadoop ecosystem, which is used for data interaction between relational databases and Hadoop. The principles and characteristics of Sqoop, usage methods and commands, integration with Hadoop, optimization and performance tuning, etc. need to be mastered and studied in order to give full play to the role of Sqoop in big data processing.

2.3.5 Introduction to MySQL

In Hadoop, MySQL is widely used for data storage and management. MySQL is an open source relational database management system with high efficiency, stability, and ease of use. It is one of the commonly used database management systems in Hadoop.

MySQL in Hadoop is mainly used in the following aspects:

(1) Store analysis results

When performing big data analysis, the analysis results need to be stored in MySQL for subsequent query and analysis. MySQL can provide efficient data storage and management, and supports SQL query at the same time, which can meet the needs of data analysis and query.

(2) Data import and export

In Hadoop, the import and export of data is very important. You can use the sqoop tool to import data from Hadoop's distributed file system to MySQL, and you can also use sqoop to export data from MySQL to Hadoop's distributed file system.

(3) Data backup and recovery

In Hadoop, data backup and recovery is also very important. You can use the backup and recovery tools provided by MySQL to back up and restore data in MySQL. In this way, in the event of an unexpected situation with the data, the data can be recovered quickly.

(4) Database optimization

In Hadoop, MySQL performance optimization is also very important. You can optimize MySQL's query performance and response speed by configuring MySQL's cache, index, and query statements.

It should be noted that when using MySQL in Hadoop, you need to pay attention to the version and configuration of MySQL. Usually, you need to use the MySQL version suitable for Hadoop, and make corresponding configurations at the same time to ensure the normal operation and performance of MySQL.

In short, in Hadoop, MySQL is a very important component for data storage and management. By using MySQL, the efficiency and accuracy of data analysis and query can be improved, and at the same time, data backup and recovery capabilities can be improved to ensure data security and reliability. When using MySQL, you need to pay attention to version and configuration issues to ensure the normal operation and performance of MySQL.

2.3.6 Introduction to Pyecharts

Pyecharts is a Python-based data visualization library that implements all chart types based on Echarts and supports mainstream front-end frameworks, such as Flask and Django. Pyecharts has good scalability and customization, which can meet various data visualization needs.

The following are some basic concepts and related knowledge about Pyecharts:

Echarts is an open source visualization library based on JavaScript that supports various types of charts, including line charts, bar charts, pie charts, scatter charts, etc. Pyecharts is a Python package library based on Echarts, which can use Python language to call various functions of Echarts.

(1) Visualization type

Pyecharts supports many types of charts, including line charts, histograms, pie charts, scatter plots, maps, etc. Each chart type has different optional parameters and properties, which can be flexibly customized according to data requirements.

(2) Theme style

Pyecharts supports a variety of theme styles, including light, dark, chalk, essos, etc., and can choose a suitable theme style for visual display according to different data requirements.

(3) Data format

Pyecharts supports multiple data formats, such as list, tuple, pandas DataFrame, numpy array, etc., and can perform flexible format conversion according to the data source.

(4) Other features

Pyecharts also provides a variety of other features, such as event monitoring, animation effects, chart dragging, chart linkage, etc., to meet more advanced data visualization needs.

To sum up, Pyecharts is a Python-based data visualization library that implements all chart types based on Echarts and supports mainstream front-end frameworks such as Flask and Django. Pyecharts has good scalability and customization, which can meet various data visualization needs.

2.4 Dataset Introduction

This data set is selected from the open source data of Ali Tianchi, covering the user behavior data of a Taobao merchant from December 1 to 18, 2021. This data set contains multiple fields such as user ID, product ID, behavior type, user geographic location, product category, date, and hour, with a total of tens of thousands of rows of data. It is a representative e-commerce user behavior data set.

In this data set, user_id represents the unique identifier of the user, item_id represents the unique identifier of the product, and behavior_type represents the user's behavior type of the product, including browsing, collection, adding to shopping cart and purchasing. user_geohash represents the geographical location information of the user, item_category represents the category information of the product, and date and hour represent the date and hour when the user behavior occurred, respectively.

Through the analysis of this data set, we can understand the purchase behavior and preferences of users on the e-commerce platform, understand the sales of goods and the geographical distribution characteristics of users, and provide reference for decision-making of the e-commerce platform. At the same time, this data set also has certain application value of data mining and machine learning, such as predicting user's purchase behavior and product sales trend.

2.5 Configure import data environment and load data

First, to upload the data set to the Hadoop platform, we need to configure the parameters of the Flume configuration file. The configuration file is as follows:

Figure 2 Flume configuration file

This configuration file is for Flume and defines a data collector named agent3. It defines three elements: source, channel and sink. Among them, source3 specifies a collection source named source3, using the spooling directory mode, the data directory is /home/hadoop/taobao/data, and there is no file header information. channel3 specifies a storage channel named channel3, using the file mode, and the path of the checkpoint file is /home/hadoop/taobao/tmp/point.

The path of the data file is /home/hadoop/taobao/tmp. Sink3 specifies a data output terminal named sink3, which uses the hive mode. The metastore address of hive is thrift://hadoop:9083, the database name is taobao, the table name is taobao_data, the data format is DELIMITED, and the delimiter is a comma. The field names are user_id, item_id, behavior_type, user_geohash, item_category, date, hour, and the batch size of each submitted data is 90.

Finally, by assembling the source, channel, and sink, the data is collected from the source to the channel, and then the data is exported from the channel to the sink, and finally the data is written into the Hive table. Throughout the process, Flume will automatically transfer the data in the source to the channel, and then transfer the data in the channel to the sink. In this way, efficient and reliable data acquisition and import operations can be achieved.

After creating the Flume configuration file, we need to create a folder for storing metadata, so that when loading data each time, we only need to move the metadata to the target folder to automatically import the target data .

Then start the cluster, and turn on the hive monitoring and Flume log monitoring, and finally use the shell script to automatically complete the movement of the data file, thus realizing the data loading.

2.6 Create data table and result table in Hive

This step should be completed before the previous step. Create a database in hive, create a data receiving table and a data result table. The data receiving table is used to receive flow data in flume, and the data result table is used to store the results of hive analysis.

Figure 3 Create table display in hive

Through these SQL statements, we can create multiple tables in Hive for storing analysis results. These tables include:

(1) taobao_data: This table is used to store raw data, including multiple fields such as user ID, product ID, behavior type, user geographic location, product category, date, and hour. The storage format is ORC format, and transactions are enabled manage.

(2) taobao_result: This table is used to store statistical analysis results, including keyword key and numerical value, used to store statistical results of different dimensions.

(3) taobao_result_date: This table is used to store statistical results by date dimension, including date and value.

(4) taobao_result_hour: This table is used to store statistical results by hour, including hour and value.

(5) taobao_result_item_id: This table is used to store statistical results by product ID dimension, including product ID and value.

(6) taobao_result_user_geohash: This table is used to store statistical results based on the user's geographic location, including user geographic location information and numerical value.

(7) taobao_result_item_category: This table is used to store statistical results by item category, including item category and value.

Through the creation of these tables, the analysis results can be conveniently stored and queried, which can help us better understand user behavior and product sales, so as to support the business decision-making of the e-commerce platform. At the same time, the creation of these tables also provides convenience for data mining and machine learning, such as user portraits and recommendation algorithms based on these tables.

2.7 Big data analysis and sqoop export

After the table is created, the import and loading of the data set is completed. The next step is big data analysis. Use hivesql to write query statements. When analyzing in hive, each time the analysis results are inserted into the previously created ones. in the datasheet.

Figure 4 Big data analysis source code

After completing the above data analysis, a lot of data result tables are generated in the hive data warehouse. Now we need to export these result tables to the mysql relational database in Hadoop. Such benefits:

(1) MySQL is a common relational database with a wide range of application scenarios and development tools, and has good support for data storage and management. Although Hive has a query interface in SQL language, its underlying storage and query engine are different from relational databases such as MySQL. Therefore, it is necessary to convert the analysis result table into a MySQL table through data export to facilitate further data processing and visualization. exhibit.

(2) MySQL has good performance and scalability, and can support large-scale data storage and highly concurrent query operations. Hive has better support for big data processing and query, but for some low-frequency queries or small-scale data processing, MySQL may be more suitable. Therefore, by exporting the analysis result table in Hive to MySQL, the advantages of the two databases can be better utilized to meet the data processing and query requirements of different scenarios.

(3) MySQL can better support the use of front-end visualization tools, such as Tableau, PowerBI, Metabase, etc., and can directly perform data query and chart display by connecting to the MySQL database. Although Hive also has similar tool support, it requires additional configuration and deployment work, which is not as convenient and efficient as MySQL's direct support.

To sum up, exporting the analysis result table in Hive to MySQL can make better use of the advantages of the two databases, and at the same time facilitate data storage and query operations, as well as visual display of data.

But before that, you need to create a receiving table in mysql, so that you can use sqoop to export to mysql.

The above codes are MySQL DDL statements, which are used to create a table named taobao_result or other types. The table includes two fields: key and value, both of type varchar(255). In addition, these two fields use the encoding method of utf8 character set and utf8_general_ci collation, and support Chinese and other multi-byte character sets. At the same time, the default value of the key and value fields is NULL.

In addition, this table uses the InnoDB engine to support functions such as transaction management and foreign key constraints. The ROW_FORMAT attribute is Dynamic, indicating that the row format is dynamic and can be dynamically adjusted according to the size of the row data to improve data storage efficiency.

In general, this DDL statement defines a basic table structure that can be used to store statistical results of different dimensions. If you need to store more fields or define more complex data types, you need to expand and modify this statement.

The next step is to use the sqoop command to export the data

Figure 5 sqoop export data source code

This is a command that uses Sqoop to export data. Its main function is to export the data in the taobao_result table in Hive to the taobao_result table in MySQL.

The specific command parameters are explained as follows:

(1) sqoop export: means to execute the export command.

(2) --connect jdbc:mysql://localhost:3306/taobao: Indicates connecting to the Taobao database of MySQL, the port is 3306.

(3) --username root -P: Indicates that the root user is used to log in, and the -P option indicates that a password is required.

(4) --table taobao_result: Indicates to export data to the taobao_result table in MySQL.

(5) --export-dir /user/hive/warehouse/taobao.db/taobao_result: means to export data from the taobao_result table in Hive, and its storage path is /user/hive/warehouse/taobao.db/taobao_result.

(6) -m 1: Indicates that a Mapper task is used for the export operation.

(7) --input-fields-terminated-by '\001': Indicates that the field separator of the input data is \001.

To sum up, this command exports the data in the taobao_result table in Hive to the taobao_result table in MySQL through Sqoop, which is convenient for subsequent data storage and query.

2.8 Data Analysis and Visualization

For the analyzed result data, use sqoop to export the result table to mysql. The advantage of this is that it is convenient for us to manage the analyzed data and visualize the data, because generally we can use some software to connect to the relational database for visual research and display. In this data visualization, we directly statically write the obtained results into the code, and use pyecharts for visual display.

2.8.1 Analysis of store sales

Figure 6 Taobao store data analysis

It can be seen from here that the data user visits of this store are relatively large, with nearly 6W pieces of data, but through the perspective analysis of users, it is found that there are only 981 users, followed by the analysis of the number of purchases by users, and it is found that the data is only 273 Article, the analysis results here can ensure that we have an overall understanding of a store's data and know the overall sales of the store.

Figure 7 Analysis of user shopping situation at a certain moment

Through this, we can see that there is a gap between the number of users and the number of purchasers. Not all users in this store will shop.

Figure 8 The ratio of the number of purchases greater than 2 to the total number of people

Through the analysis here, we can see that this store still needs to be improved in terms of repurchase rate. The repurchase rate is the second purchase of a store or the products in the store, which can fully reflect the attractiveness of a store. And quality level, can continue to attract those who have consumed the crowd for secondary consumption.

Figure 9 The bounce rate of the store

Bounce rate (Bounce Rate) refers to the proportion of visitors leaving the website directly after visiting a certain page of the website. Specifically, the bounce rate refers to the ratio of the number of visits that stay on a certain page for a certain period of time (usually more than 1 second) and then leave the website to the total number of visits to the page.

The bounce rate is an important indicator to measure the user experience of the website and the quality of the page content. Usually, the higher the bounce rate, it means that there are certain problems with certain pages or content on the website, such as the content is not attractive enough, the loading speed is too slow, Unreasonable page layout, etc. The lower the bounce rate, it means that the user experience of the website is better, the content of the website is more attractive, and it also shows that the website has a better effect in attracting visitors.

In the e-commerce website, the bounce rate is also an important indicator, which can help the website administrator understand the user's interest in the product and the shopping experience, so as to carry out page optimization, product recommendation, etc., and improve the conversion rate and user stickiness of the website .

The bounce rate here shows that the store’s product quality and attractiveness are still not satisfactory, and it can continue to give full play to its advantages and continuously optimize its store quality and product recommendation quality.

2.8.2 User Behavior Analysis

The analysis and visualization of Taobao users' purchase behavior has the following advantages:

(1) More intuitive: By visually displaying Taobao users' purchase behavior, decision makers can more intuitively understand users' purchasing habits, product preferences, purchase paths and other information, so as to better formulate marketing strategies and optimize website design.

(2) More accurate: By visually displaying Taobao users' purchase behavior, users' behavior data and trends can be captured more accurately, helping companies better understand user needs and behaviors.

(3) More efficient: By displaying the purchasing behavior of Taobao users in a visual way, it is possible to quickly identify abnormal points and key concerns in the data, improve decision-making efficiency, and optimize marketing strategies.

(4) More flexible: To display the purchasing behavior of Taobao users in a visual way, different visualization methods can be adopted according to different analysis requirements and business scenarios, such as histograms, line charts, pie charts, etc., so as to better meet different needs. analysis needs and presentation needs.

(5) More real-time: Through the real-time visual display of Taobao users' purchasing behavior, it is possible to understand the latest behaviors and trends of users in a timely manner, so as to make better decisions and adjustments.

To sum up, the analysis and visualization of Taobao users' purchase behavior can help enterprises to understand user behavior and trends more intuitively, accurately, efficiently, flexibly, and in real time, thereby optimizing marketing strategies, improving user experience and website conversion rate, and thus improving enterprise's profitability. competitiveness and profitability.

Figure 10 Taobao user behavior analysis

Through this, we can analyze that the users of this store prefer to collect the product, and the second is that the purchase volume is greater than the plus shopping sum. The purchase volume of independent IP is calculated according to whether a user purchases or not.

Figure 11 Analysis of user shopping situation

Through the analysis of user shopping conditions here, we can directly find out the proportion of corresponding e-commerce indicators in the entire user behavior, and do statistical analysis, we can find that click behavior and shopping cart behavior are more, followed by collection behavior , and finally the purchase behavior. Through such an analysis, we can see which proportions of the store are relatively large under these circumstances, and have an overall understanding.

Figure 12 Purchase situation of users based on geographic location

By analyzing and counting these data, we can know which areas the store is popular with for price comparison, and we can combine some corresponding local characteristics and customs to carry out precise recommendation and marketing to users, and the final effect is to achieve recommendation.

2.8.3 Statistical Analysis of Best-selling Commodities

Statistical analysis and visual display of hot-selling products is an important means of data analysis, which can help merchants better understand product sales and trends, improve sales efficiency and economic benefits, and also support product recommendation and optimize product strategies, etc. Work.

Help understand product sales: Hot-selling products refer to products with high sales volume. Through statistical analysis and visual display of them, you can have a clearer understanding of product sales and trends, and help merchants better understand user needs and Market changes, so as to carry out better commodity strategy adjustment and management.

Improve sales efficiency: Through the statistical analysis of hot-selling products, merchants can more accurately understand which products are more popular, and then concentrate resources on these products to improve sales efficiency and conversion rate, while reducing the need for unpopular products. waste of resources and improve economic efficiency.

Optimizing product strategy: Through the statistical analysis of popular products, merchants can understand users' preferences and needs for products of different categories, brands, and price segments, thereby optimizing product positioning and strategies, and enhancing product competitiveness and market share.

Figure 13 Statistical analysis of Taobao’s hot-selling commodity IDs

Through the statistical analysis of the store's hot-selling products, we can find out which products are more popular, and then we can further expand and adjust some of the characteristics and marketing strategies of these products.

Figure 14 Taobao commodity category statistics

Through the visual analysis of the hot sales of Taobao product categories, we can find out which product categories in the store are more popular, then we can adopt centralized procurement and recommendation for this type of products, and finally realize the Precise marketing of a category of goods.

2.8.4 Analysis of Daily Time Dimension of Stores

By visually displaying the time dimension, we can have a general understanding of the data of each day, so that we will have a better understanding of the data.

Figure 15 Analysis of user activity from December 1st to December 18th

Through the analysis of user activity, it can be found that 12.12 is a peak period, that is, a day with more shopping.

Figure 16 Analysis of Average Daily User Clicks

Figure 17 Analysis of average daily user purchases

Figure 18 Analysis of average daily users adding shopping carts

Figure 19 Analysis of Average Daily User Favorites

2.8.5 Hourly Dimension Analysis of Stores

Visual analysis of hourly user behavior and user activity can help e-commerce platforms understand user activity and preference changes, and can also reveal user shopping behavior and characteristics in different time periods. By presenting the analysis results in a visualized way, the decision makers of the e-commerce platform can understand the rules and trends of user behavior more intuitively, and adjust business strategies and marketing activities in time to improve user conversion rate and satisfaction. For example, if it is found that the activity of users in a certain period of time is low, you can increase the conversion rate and retention rate of users by launching promotional activities for this period of time or optimizing the design of related pages. Through visual analysis, users' needs and preferences can be better discovered, and e-commerce platforms can be helped to improve users' shopping experience and satisfaction, thereby maximizing business value.

Figure 20 Analysis of average daily user activity

Through the findings here, we can conclude that the activity of users is relatively high at 7 and 8 o'clock in the evening, so for this time period, we can make some adjustments to the page and manual services.

Figure 21 Analysis of average user clicks per hour

Figure 22 Analysis of average hourly user purchases

Figure 23 Analysis of the amount of users adding shopping carts per hour

Figure 24 Analysis of average hourly user favorites

2.9 Large screen visual design

Based on the design and implementation of Hadoop-based online shopping behavior analysis, building a large visual screen through pyecharts can bring the following benefits:

(1) Improve the effect of data visualization: By converting the data into intuitive forms such as charts and maps for display, the characteristics and laws of the data can be better presented, making the data easier to understand and analyze. Displaying data visualization results on a large screen can make the data more vivid and intuitive, thereby better helping decision makers understand the meaning of the data and gain insight into business opportunities.

(2) Improve the efficiency of data analysis: Through data visualization, abnormalities and trends in the data can be quickly found, so that decisions can be made quickly. Displaying data visualization results on a large screen allows decision makers to display data analysis results in real time in team meetings and perform interactive operations to quickly make decisions and adjust business strategies.

(3) Facilitate data sharing: displaying the analysis results on a large screen can facilitate multiple decision makers to view and analyze data at the same time, discuss business problems and solutions together, and improve the efficiency of data sharing and collaborative work.

(4) Strengthen brand image: Displaying the analysis results on a large visual screen can improve the brand image and business level of the enterprise, thereby enhancing the competitiveness of the enterprise in the industry.

In summary, through the design and implementation of Hadoop-based online shopping behavior analysis, using pyecharts to build a large visual screen can improve data visualization and data analysis efficiency, facilitate data sharing and enhance brand image, and is an important factor in improving data analysis and decision-making efficiency. means.

Figure 25 Visual large screen 1

Figure 26 Large-screen visualization 2

Figure 27 Large screen visualization 3

Finally, data is statically written in HTML for large-screen visualization, and the results of building a Hadoop-based large-screen visualization are as follows:

Figure 28 Large-screen visualization

3 Summary and Analysis

3.1 The innovation of this research

The innovations of this study are mainly reflected in the following aspects:

(1) Comprehensive application of various big data analysis technologies: This research uses Hadoop, Flume, Hive, Sqoop and other big data analysis technologies to collect, store, analyze and visualize Taobao user behavior data. Compared with traditional data analysis methods, this study comprehensively applied a variety of technologies, fully utilized the advantages of high concurrency, high fault tolerance, and high performance of the big data platform, and improved the efficiency and accuracy of data analysis.

(2) In-depth analysis using multi-dimensional indicators: This study conducts multi-dimensional indicator analysis on Taobao user behavior data, including time dimension, geographical location dimension, commodity category dimension, etc., and digs deep into the laws in the data from multiple perspectives and trends, to help e-commerce platforms better understand user needs and behavioral characteristics, and provide feasible decisions for e-commerce sales.

(3) Using pyecharts for visual display: This study uses Python's pyecharts visualization library to perform multi-dimensional visual display of the analysis results. The analysis results are presented in a visual way, making the data more intuitive, easy to understand and analyze. At the same time, using the interactive visualization function in pyecharts and the large-screen visual display method of front-end and back-end interaction can better help decision makers understand the results of data analysis and adjust business strategies and marketing activities in a timely manner.

(4) Realized the integration of data storage and analysis and display: In this study, the analysis result table was stored in the hive database, and then the result table in the hive database was automatically exported to the relational database MySQL by using the sqoop component, realizing data The integration of storage and analysis display facilitates decision makers to view and analyze data.

To sum up, this study comprehensively applies a variety of big data analysis technologies, uses multi-dimensional indicators for in-depth analysis, uses pyecharts for visual display, and realizes the integration of data storage and analysis display, which is innovative and practical It has an important reference value for improving the efficiency and level of e-commerce sales.

3.2 Inadequacies of this study

This study conducts big data analysis research based on the open source data of Taobao user behavior. By using Hadoop big data analysis platform and pyecharts visualization library for analysis and display, it provides a feasible decision for e-commerce sales. However, there are still some shortcomings in this study:

(1) The data time range is limited: the data selected in this study is limited to December 1-18, 2021, so the research results may not be comprehensive and representative. In the future, we can consider adding data sources and time ranges to obtain more comprehensive data characteristics and rules.

(2) Lack of practical application verification: The analysis results of this study are only speculations and predictions based on historical data, lacking practical application verification. In the future, the research results can be further compared and verified with the actual e-commerce sales to increase the credibility and practical applicability of the research.

(3) Limited visual display: This study uses the pyecharts visualization library for results display, but the type and number of result charts are limited, which may not fully meet the needs of decision makers. In the future, other data visualization tools can be further studied and applied to meet different visualization needs.

(4) Insufficient data quality control: In the process of data processing and analysis in this study, there is a lack of strict control and cleaning of data quality, and there may be data errors and abnormalities. In the future, the data quality control process can be further improved to improve the accuracy and reliability of data processing and analysis.

(5) System performance bottleneck: In the process of data processing and analysis, multiple components such as Hadoop, Flume, hive, and sqoop were used in this study, and the system performance may be limited by the bottleneck. In the future, the system architecture can be further optimized and component configuration adjusted to improve system performance and stability.

To sum up, there are still deficiencies in this study in terms of data range, practical application verification, visual display, data quality control, and system performance. In the future, relevant processes and methods can be further improved and optimized to improve the credibility of the research results. and practical applicability.

3.3 Summary

This study conducts big data analysis research based on the open source data of Taobao user behavior, and conducts multi-dimensional analysis of Taobao user behavior data from December 1 to 18, 2021 through the Hadoop big data analysis platform to provide decision-making basis for e-commerce sales.

First, we load the data into the Hive database through the Flume component of Hadoop, then conduct statistical analysis on e-commerce indicators such as PV, UV, bounce rate, repurchase rate, etc., and analyze user behavior, activity and other indicators according to the time dimension Perform multi-dimensional perspective analysis. At the same time, we also conducted statistical analysis on hot-selling IDs, hot-selling commodity categories, and user geographic locations, providing comprehensive data insights for e-commerce platforms.

Secondly, we store the analysis result table in the Hive database, and then use the Sqoop component to automatically export the result table in the Hive database to the relational database MySQL, which is convenient for data storage, analysis and display. Afterwards, we use Python's pyecharts visualization library for front-end visualization. By calling the data set in MySQL, we draw multi-dimensional visualization chart types, which are easy to understand and display.

Finally, we combine the page method in pyecharts to design these visualizations using the front-end interactive large-screen visualization display, and combine the HTML large-screen visualization to write static data, and build a cool large-screen visualization. Through the analysis results displayed in rich charts, decision makers can more intuitively understand the rules and trends of user behavior, so as to better formulate business strategies and marketing activities, and improve user conversion rate and satisfaction.

The results of this study show that the Hadoop-based big data analysis platform can conduct comprehensive and multi-dimensional analysis of e-commerce user behavior data, and build a visual large-screen display result through pyecharts, providing comprehensive data insights and insights for e-commerce platforms. Decision basis. This analysis method and display method not only improves the visualization effect and analysis efficiency of data, but also facilitates data sharing and collaborative work. It is the development trend of data analysis in the e-commerce industry in the future.

every word

The charm of youth is that you can keep trying

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/131424982