Hadoop-based MapReduce website log big data analysis (including preprocessing MapReduce program, hdfs, flume, sqoop, hive, mysql, hbase components, echarts)

If you need this project, you can private message the blogger! ! !

This project includes: PPT, visualization code, project source code, supporting Hadoop environment (decompression visualization), shell script, MapReduce code, documentation and related tutorials, large data set!

This paper introduces a method for big data analysis of website logs based on Hadoop. This project first uploads the website logs to the HDFS distributed file system, and then uses MapReduce for data preprocessing. By using Hive for big data analysis, we are able to conduct statistical analysis on important indicators such as website PV, independent IP, number of user registrations, and number of bounced users. Finally, we use Sqoop to export the analysis results to the MySQL database, and use Python to build a visual interface to facilitate users to understand the analysis results more intuitively.

By using the Hadoop distributed computing framework, this project can efficiently process a large amount of website log data. Using MapReduce for preprocessing can effectively reduce the amount of data, and perform preliminary data cleaning and screening. When using Hive for big data analysis, we can quickly obtain the required data by writing complex SQL query statements, and perform in-depth statistical analysis on these data.

Through this project, we can quickly and accurately obtain the key indicator data of the website, help companies better understand user behavior, optimize website operation strategies, and improve user experience. At the same time, the data export and visualization functions of this project also provide users with a more convenient and intuitive data display method, making the data analysis results easier to understand and use.

omitted here...

1.1 Research Background

With the development of Internet technology, more and more enterprises have transferred their business online. The website is an important platform for enterprises to display their brand image and provide products or services, and the website log is an important data source for recording website activities.

omitted here...

1.2 Research purpose

This paper aims to discuss the research purpose of big data analysis of website logs based on Hadoop. With the popularity of the Internet, website traffic is increasing day by day, and a large amount of website log data is generated. These data contain a lot of information, which can help website managers understand user behavior and needs, and provide valuable reference for website optimization and improvement.

omitted here...

1.3 Research Significance

This paper aims to discuss the research significance of big data analysis of website logs based on Hadoop. With the continuous development of the Internet, more and more website log data are generated, which contain a large amount of information and can provide valuable references for website optimization and improvement. Therefore, the significance of this study lies in:

omitted here...

1.4 Analysis of Research Status at Home and Abroad

With the advent of the era of big data, more and more companies have begun to pay attention to how to use big data for website log analysis in order to obtain business value from it. As a distributed computing framework, Hadoop can be used to process and analyze large-scale data. This paper will analyze the domestic and foreign research status of big data analysis of website logs based on Hadoop.

1. Domestic research status:

omitted here...

2. Research status abroad:

omitted here...

2 Research process

2.1 Overall research route

This paper uses Hadoop to analyze the large data set of website logs offline. At first, it is necessary to build a Hadoop distributed system and install various components required by this research. After deploying Hadoop, first upload the log data to the hdfs distributed file system, use the idea of ​​MapReduce, use Python to write Map and Reduce scripts, and clean the original data.

After cleaning the website log data into structured data, then save it in hdfs, and then perform table creation and data import operations in hive, and use the big data analysis component hive to conduct statistical analysis on it, and dig out some commonly used business indicators , and then use the sqoop component in Hadoop to import the analysis result table in hive into mysql, or store the results in hbase. Finally, it is visualized through Python's Pyecharts visualization library, and the statistical business indicators are displayed on the web page.

Figure 1.1 Overall Research Roadmap

As shown in the figure below, through such a series of operations and processes, big data analysis can be displayed in front of decision makers.

Figure 1.2 Technology development flow chart

2.2 Build Hadoop environment system

In this study, the Hadoop pseudo-distributed system is built to analyze the big data. You can learn the basic principles and architecture of Hadoop, and have a deeper understanding of the operating mechanism of Hadoop. Being able to simulate a multi-node distributed environment on a single machine can better test and develop distributed applications. It can make full use of its own computing resources and improve the efficiency of data processing.

2.2.1 Hadoop deployment and installation of various components

Since the deployment of Hadoop and the installation of various components are relatively cumbersome, no specific detailed installation and deployment descriptions are given here. This research spends time in the early stage to install various components, and the results are shown as follows:

Figure 2.1 Hadoop installation display

Figure 2.2 Hadoop cluster startup and hive installation display

The characteristics and explanations of various nodes in Hadoop are as follows:

(1) NameNode It is the master server in Hadoop, which manages the file system namespace and access to files stored in the cluster.

(2) Secondary NameNode is an auxiliary background program used to monitor the status of HDFS.

(3) DataNode It is responsible for managing the storage connected to the node (there can be multiple nodes in a cluster). Each node that stores data runs a datanode daemon.

(4) NodeManager: The agent on each node in YARN, which manages a single computing node in the Hadoop cluster, including maintaining communication with ResourceManger, supervising the life cycle management of Containers, and monitoring the resource usage (memory, CPU, etc.) of each Container , to track node health, manage logs and auxiliary services used by different applications.

(5) ResourceManager: In YARN, ResourceManager is responsible for the unified management and allocation of all resources in the cluster. It receives resource report information from each node (NodeManager) and distributes the information to each application according to a certain strategy (actually is the ApplicationManager) RM works with NodeManagers (NMs) per node and ApplicationMasters (AMs) per application.

Figure 2.3 mysql and sqoop installation display

Hadoop is a distributed computing framework that can store and process large-scale data sets. Sqoop and MySQL are two components commonly used in the Hadoop ecosystem.

Sqoop is a tool for importing data from a relational database into the Hadoop ecosystem. It supports a variety of relational databases (such as MySQL, Oracle, PostgreSQL, etc.), and can convert data in relational databases into data formats in the Hadoop ecosystem (such as HDFS, Hive, HBase, etc.). Sqoop also supports incremental import and export, and custom import queries.

MySQL is an open source relational database management system that is widely used in the development of web applications. In the Hadoop ecosystem, MySQL is commonly used to store metadata and other information related to Hadoop data. MySQL can query and process data through Hadoop's MapReduce job, and can also be used with Sqoop to import relational data into the Hadoop ecosystem.

The above has completed the installation and deployment of the basic components required for this research, aiming to prepare a good environmental foundation for the subsequent research process.

2.3 Dataset Introduction

The data log of this research comes from a technology learning forum in China. The forum is hosted by a training institution and gathers many technology learners. People post and reply to posts every day. By obtaining the open source dataset log, which includes the website log data of the two days of 2013-05-30 and 2013-05-31, each line of record consists of 5 parts: visitor IP, access time, access resource, access status (HTTP status code), this visit traffic.

Figure 3.1 Log data display

The data field has irregular characteristics. Based on the Hadoop big data analysis hive for structural statistical analysis, the data needs to be further preprocessed. Since the data volume is very large, the total size of the two-day log file is 200MB, the data volume of No. 30 is about 55W, the data volume of No. 31 is about 140W, and the total data volume is about 200W. From the perspective of big data, it has met the requirements of big data simulation analysis. Traditional Analyzing software to deal with it has not reached the characteristics of high efficiency.

Write MapReduce scripts through Python, process and clean data streams on data logs, and finally resolve data irregularities.

2.4 MapReduce data preprocessing

2.4.1 Introduction to the Principle of MapReduce

MapReduce is a distributed computing model proposed by Google in 2004. It aims to achieve efficient large-scale data processing by decomposing large-scale data sets into small data blocks and then performing parallel computing in distributed computing clusters. deal with. The core idea of ​​the MapReduce model is to divide the data into small blocks for processing, and to divide the calculation into two stages, namely "mapping" and "reduction".

omitted here...

Figure 4.1 MapReduce programming model diagram

The main features of MapReduce include the following:

omitted here...

MapReduce is an efficient, stable, and scalable distributed computing model that has been widely used in various big data processing scenarios.

Figure 4.2 Mapper.Py display

The idea of ​​the above code is to parse each line in the log file

omitted here...

Figure 4.3 Reducer.Py display

This code is a Python implementation of Reducer in Hadoop.

omitted here...

Figure 4.4 Data preprocessing results

The processed data is used for subsequent big data analysis, after executing the corresponding shell file

omitted here...

Figure 4.5 MapReduce execution shell script display

Finally, by executing our script file, we can use source or ./ command

Figure 4.6 Display of MapReduce execution results

2.4 Hadoop basic components and their introduction

2.4.1 Basic concepts of Hive

omitted here...

2.4.2 Basic concepts of HDFS

omitted here...

2.4.3 The basic concept of Sqoop

omitted here...

2.4.4 Basic concepts of MySQL

omitted here...

2.5 Create database table and import

Create a hive database table based on the result file structure, and create a partition table on the result file. First, put the cleaned file in the folder we set, create a table in hive, create a partition table here, create external table table name (field field type.....) partitioned by (partition field field type ) rowformat delimted fields terminated by 'separator', location The ancestor folder of the data path (does not contain the direct storage folder of the data).

The table creation statement is as follows:

CREATE EXTERNAL TABLE whw(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hadoop/data';

2.5.1 The concept of partition and bucket

When Hive creates data tables, in order to improve query efficiency and reduce query costs, partition tables are often chosen. A partitioned table partitions data according to a certain column, and stores the same data grouped in different folders or directories to achieve more efficient data query and processing.

The concept of partitioning is to group data according to the value of a certain column, and store data in different folders or directories to improve query efficiency. In Hive, common partition fields include date, time, region, city, gender, etc. For example, if the sales data is partitioned by date, the sales data of each day can be stored in different directories, so as to quickly query the sales status of each day.

In addition to partitioning, Hive also provides another data organization method, that is, bucketing. Bucketing is to group data according to the hash value of a certain column, and store data with the same hash value in the same file to achieve more efficient data query and processing. Compared with partitioning, bucketing is more suitable for scenarios where the amount of data is large and the data distribution is relatively uniform.

The advantage of partitioning and bucketing is that it can improve the efficiency of data query and processing and reduce query costs. By grouping and storing data according to a certain column, the amount of data that needs to be scanned during query can be reduced and the query speed can be improved. In addition, partitioning and bucketing can also be used to optimize data storage and compression, reducing storage and transmission costs. Partitions and buckets can be selected according to the characteristics of actual data to achieve more efficient data query and processing.

In this research, we are partitioning by date, which can ultimately improve our query efficiency.

2.5.2 Import of Partition Dataset

Create a partition statement:

Alter table table name add partition (partition field = 'partition label') location data path (parent folder of the data file)

ALTER TABLE whw ADD PARTITION(logdate='2022_05_30') LOCATION '/user/hadoop/data/datas';

Figure 5.1 Display of execution results of partition table import

Figure 5.2 Query data import result display

According to the above ideas and steps, import the data of two days into hive, and then we will query the data we need through hive.

2.6 Hive Statistical Analysis

Use Hive to perform data analysis and statistics on the result table. Before that, we need to understand these web page indicators, clear the meaning and significance of these indicators, and propose corresponding measures for the optimization and construction of its website.

2.6.1 Introduction and statistics of PV indicators

PV (Page View) refers to the page views of the website, that is, the sum of the number of times all pages on the website are visited. In website analysis, PV is one of the most basic indicators, used to measure website traffic and audience size.

In the website scenario, the meaning of PV refers to the number of pages a user visits the website, and each page opened is counted as a PV. For example, when a user visits a certain website, he browses multiple pages such as the homepage, article list, article details, etc., and the total number of views of these pages is the PV.

 

Figure 6.1 PV index query statistics

2.6.2 Introduction and Statistics of Indexes of Registered Users

omitted here...

Figure 6.2 Index query statistics of registered users

2.6.3 Introduction and Statistics of Independent IP Number Indicators

omitted here...

Figure 6.3 Independent IP number index query statistics

2.6.4 Introduction and statistics of the number of bounced users

The number of bounced users refers to the number of users who leave the website without continuing to visit other pages after visiting a certain page of the website. This metric is often used to measure the user experience and attractiveness of a website. If the number of bounced users is too high, it means that users are not interested in or satisfied with the content or experience of the website and need to be optimized.

omitted here...

Figure 6.4 Query Statistics of Number of Bounced Users

2.6.5 Data Table Summary

Inner join means to query the intersection of two tables, and the condition of ON is 1=1, which means that the join condition is always established. Here, all the query results are summarized into one data table.

Figure 6.5 Data table summary operation display

2.7 Data export and data display

2.7.1 Create tables in MySQL

Use mysql -u root -p (to start MySQL, you need to enter a password, not displayed), you need to use a database when using mysql to create data. The creation command is as follows:

create table whw_logs_stat(logdate varchar(10) primary key,pv int,reguser int,ip int,jumper int);

Figure 7.1 Creation of mysql data table

2.7.2 Sqoop imports hive table into mysql

Use sqoop to import the result table in our hive into our MySQL, use sqoop export –connect jdbc:mysql://localhost:3306/database –username root -p –table table name in MySQL –export-dir hive The storage location of the result table inside -m 1 –input -fields-terminated -by '\001'

Note that here you need to open a new one in the terminal, and then use this command to import the hive data table into mysql. And we need to know in advance where our hive data table exists, that is, where the hive data table exists in hdfs.

Figure 7.2 The location of the hive data table

Figure 7.3 Display of sqoop execution results

Finally, we entered the mysql terminal interface to check that the data has been imported successfully.

 mysql final table display

2.7.3 Data Visualization

Using data visualization tools to convert data into visual forms such as charts, tables, and maps can make data more intuitive, easy to understand and analyze, and avoid the difficulties caused by relying only on numbers and words.

omitted here...

data visualization

The code is omitted, please private message the blogger ! ! !

3 Summary and Analysis

3.1 The innovation of this research

omitted here...

3.2 What needs to be improved in this study

omitted here...

4 Conclusion

Based on the Hadoop platform, this project preprocesses the website log data through MapReduce, and uses Hive for big data analysis, and realizes the statistical analysis of the website PV, independent IP, user registration number, and the number of users who jumped out. Finally, export the statistical results to the MySQL database through Sqoop, and use Python to build a visualization platform to display the data analysis results.

omitted here...

every word

Rather than being full and overflowing, enough is enough

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/131501792