Review of e-commerce data warehouse system for big data projects

1. Training topics

E-commerce data warehouse system for big data projects

2. Purpose of training

Complete an e-commerce data warehouse project:

1. Complete the construction of related environments such as hadoop, kafka, flume, mysql and zookeeper.
2. Install the hive data warehouse environment compatible with spark, and use MySQL to store Metastore to share data with other clients.
3. Simulate the information input at the Kafka production message side, which can be received normally at the Kafka consumer message side, and enable kafka monitoring.
4. Create a gmall database in the hive data warehouse, use the sqoop tool to first import the mysql database data into the HDFS of the hadoop cluster, then import it from HDFS to the gmall database of the hive data warehouse, and finally layer by layer in the gmall database Import data from the ODS layer to the ADS layer.

3. Operating environment

1. Linux system: Centos 7.5
2. Hive on spark version: apache-hive-3.1.2
3. Java version: 1.8.0_212
4. Kafka version: kafka_2.11-2.4.1
5. Flume version: apache-flume- 1.9.0
6. Sqoop version: sqoop-1.4.6
7. ZooKeeper version: apache-zookeeper-3.5.7
8. MySQL version: mysql-5.7.28
9. Spark version: spark-3.0.0

related technology describe
Hive A Hadoop-based data warehouse tool that maps structured data files to database tables, converts SQL statements into MapReduce tasks for execution, and quickly implements simple MapReduce statistics.
Kafka A high-throughput distributed publish-subscribe messaging system;
Flume Distributed mass log collection, aggregation and transmission system, supports customization of various data senders in the log system for data collection;
Sqoop It is used for data transfer between Hadoop, Hive and MySQL. It can import data from MySQL to HDFS of Hadoop, and also import data from HDFS to hive database.
ZooKeeper A reliable coordination system for large-scale distributed systems, providing: configuration maintenance, name service, distributed synchronization, group service, etc.
Spark A very popular open source big data in-memory computing framework. It can be calculated based on big data stored on Hadoop.

4. Training process (training content and main modules)

1. Build the Hadoop cluster environment: Quickly complete the re-build on the basis of the Hadoop cluster configured in the big data practice.
2. Install the hive data warehouse environment compatible with spark.
3. Use MySQL to store hive's metastore to share data with other clients.
4. Complete the construction of cluster environments such as kafka, flume and zookeeper.
5. Use sqoop to import the mysql database data into HDFS first, then into the gmall database of hive, and finally import the data layer by layer from the ODS layer until the ADS layer.

5. Course knowledge points used in practical training

  1. When the data warehouse is built LZO压缩, the data is used to reduce the disk storage space. For example, 100G data can be compressed to less than 10G.

  2. When the data warehouse is built, the data is adopted parquet存储方式,是可以支持切片的,不需要再对数据创建索引. If you simply store data in text mode, you need to use the lzop compression method that supports slicing and create an index.

  3. dwd_dim_date_info is when data is loaded into the time dimension table 列式存储+LZO压缩. Directly import the date_info.txt file to the target table, and it will not be directly converted to column storage + LZO compression. It is necessary to create an ordinary temporary table dwd_dim_date_info_tmp, and load date_info.txt into the temporary table. Finally, insert the queried data into the final target table by querying the temporary table data.

  4. Use scripts to quickly start related service processes, import and export data, etc., which nohup表示不挂起means to run commands without hanging up; /dev/null:是 Linux 文件系统中的一个文件,被称为黑洞,所有写入改文件的内容都会被自动丢弃; 2>&1 : 表示将错误重定向到标准输出上;&: 放在命令结尾,表示后台运行.

  5. Use “select * from 表名” 不执行MR操作, the default is the DeprecatedLzoTextInputFormat specified in the ods_log table creation statement, which can recognize lzo.index as an index file.

  6. Use “select count(*) from 表名” 执行MR操作, the default is CombineHiveInputFormat, which does not recognize lzo.index as an index file, and treats the index file as a normal file. More seriously, this makes LZO files unslicable, 修改CombineHiveInputFormat为HiveInputFormatjust fine.

6. Problems encountered in training and solutions

Problems encountered:

1) Formatting the NameNode again makes it impossible to start the datanode process

Solution: Formatting the NameNode will generate a new cluster id, resulting in NameNode和DataNode的集群id不一致the cluster not being able to find past data. Yes 删除所有机器的data和logs目录,然后再进行格式化, or 进入namenode对应的clusterID所在的文件,复制其clusterID到datanode对应的clusterIDyes.

2) After xshell connects to the virtual machine, the number entered from the numeric keypad is invalid when entering commands

Solution: Open xshell, click to “默认属性”open the dialog box, select in the category “VT模式”, and then in the options on the right, select: in the initial numeric keyboard mode “设置为普通”, and finally click “确定”.

3) When multiple queues are configured to load data, the queues used do not have enough space

insert image description here

Solution: modify the configuration file in the hadoop installation directory: capacity-scheduler.xml, increase the capacity of the queue

insert image description here

4) When the MapReduce task is executed, the virtual memory is exceeded, causing the process to be killed

insert image description here

Solution: Properly increase yarn.nodemanager.vmem-pmem-ratiothe size of , and increase the corresponding virtual memory for the physical memory.

insert image description here

5) The Hive version is not compatible with the Spark version, and jar packages cannot be uploaded and data imported to hadoop HDFS.

insert image description here

Solution: Use the hive environment that has been compiled and compatible with the Spark version, that is, the hive on spark version, and re-execute the task for testing. The successful status is as shown in the figure below:

insert image description here

7. Course training experience and experience

  1. Through the two-week big data project 5, I learned 利用三台虚拟机作服务器搭建Hadoop、kafka、flume、mysql以及zookeeper等环境to use sqoop工具将mysql数据导入hadoop 集群的HDFS上, 再导进到hive的gmall数据库中. Create and use scripts to import data from the ODS layer layer by layer until the ADS layer.

  2. The following processes are required to build the environment correctly:

Hadoop105 virtual machine:
RunJar、RunJar、QuorumPeerMain、Kafka、NameNode、DataNode、NodeManager、Application、JobHistoryServer;

On Hadoop106 virtual machine:
Application, QuorumPeerMain, Kafka, DataNode, ResourceManager, NodeManager;

On Hadoop107 virtual machine:
QuorumPeerMain, Kafka, Application, SecondaryNameNode, DataNode, NodeManager.

Among them, in addition to the necessary processes of the original hadoop cluster, 启动 metastore与启动 hiveserver2各对应一个RunJar进程start hadoop105 and hadoop106 , 采集flume各对应一个Application进程start
hadoop107 消费flume也对应一个Application进程;启动Kafka、zookeeper在三台虚拟机上各对应一个Kafka、QuorumPeerMain进程

  1. After installing the hive environment compatible with spark, Hive既作为存储元数据又负责SQL的解析优化,语法是HQL语法,执行引擎变成了Spark,Spark负责采用RDD执行. MySQL can be used to store hive's metastore, which can share data with other clients.

  2. The data when the data warehouse is built 采用LZO压缩,减少磁盘存储空间. For example, 100G data can be compressed to less than 10G. 搭建数仓时,数据采用parquet存储方式,是可以支持切片的,不需要再对数据创建索引. If you simply store data in text mode, you need to use the lzop compression method that supports slicing and create an index.

  3. For adoption 列式存储+LZO压缩的数据库表. Directly import the txt file to the target table, and it will not be directly converted to columnar storage + LZO compression. 需创建一张普通的tmp临时表, load the txt file into that temp table. Finally, insert the queried data into the final target table by querying the temporary table data.

6. Some students use video recording of the project operation process and put it into PPT for demonstration, which can also reduce the number of corresponding PPT pages, and at the same time make the PPT content more vivid and dynamic, which is also a good choice.

Eight, custom related scripts

Custom scripts are stored in the bin directory of the custom user moyufeng, which 绿色的文件are scripts that have been validated by source, and are generally identified with the suffix .sh. For commonly used scripts, you can consider removing the .sh suffix through mv renaming (more convenient) , It is recommended to add .sh as a suffix for less commonly used scripts.

insert image description here
insert image description here
insert image description here

Due to the large number of custom scripts, some of the scripts are quite long, and the specific content of the scripts will be put into another article:
Big data e-commerce data warehouse related scripts
https://blog.csdn.net/m0_48170265/article/details/130376063

9. Project start

1. Configure the hosts file and virtual machine ip address

The following is VMnet8 NAT模式an example of connecting to the network (if the virtual machine in NAT mode cannot be connected to the Internet, it needs to be changed to bridge mode)

View the local network ip:
win+ rAfter entering cmd, enter:

ipconfig

insert image description here

Find the network segment (first three digits) and subnet mask (usually 255.255.255.0) of the network IPv4 address of VMnet8

The network segment here is: 192.168.123.*
Subnet mask: 255.255.255.0

Configure the virtual network: 保持虚拟网络子网ip的前三位网段和上面的网段相同, the subnet mask is the same as above

insert image description here

insert image description here

Configure NAT network reference https://blog.csdn.net/m0_48170265/article/details/129982752 in 3. Set up a virtual network

Configure the Windows local hosts file:

C:\Windows\System32\drivers\etc\hosts

insert image description here

Make sure that the first three digits of the subnet ip address are the same as the network segment connected to the computer, and the fourth digit is optional . A subnet ip address corresponds to a custom host name, which is convenient for the local browser to resolve the corresponding custom host name. ip address

insert image description here

Similarly, configure the hosts file on the three virtual machines hadoop105, hadoop106, and hadoop107 in turn . If you have written an xsync distribution script, after configuring it on one virtual machine, you can use the script to distribute it to the other two related virtual machines.

vim /etc/hosts

insert image description here

Configure the host names of three virtual machines hadoop105, hadoop106, hadoop107 in turn (to facilitate other virtual machines to analyze and identify according to the hosts file):

vim /etc/hostname

insert image description here
insert image description here

insert image description here
Reboot is required to take effect

-- 查看虚拟机名称
hostname

-- 修改虚拟机名称
vi /etc/hostname
按 i 修改名称为 xxx

-- 重启生效
reboot

According to the hosts file, configure the ip addresses of three virtual machines in turn, all of which are indispensable

# 修改ip地址
vi /etc/sysconfig/network-scripts/ifcfg-ens33

insert image description here

Check the ip address:

ip addr or ifconfig

If you change the ip address, you need to enter the command to restart

systemctl restart network

2. Group clusters and related services with custom scripts

2.1 Start the sc script

insert image description here

At this point all the processes are as follows:

insert image description here

2.2 Start kafka monitoring with ke.sh script

insert image description here
insert image description here

Default account admin; password 123456

At this point all the processes are as follows:

insert image description here

insert image description here

insert image description here

3. Storage location of gmall data warehouse on hdfs

insert image description here
insert image description here
insert image description here

4. Use DataGrip to connect to the hive database

Use DataGrip to connect to the hive database for easy data query

Preconditions for hive startup

1. Make sure that hdfs and yarn are started
2. Make sure that hive's metabase mysql is started

insert image description here
Click Test and find that the connection is successful

insert image description here
insert image description here

If the connection fails, you may need to restart the hiveserver2 service (the hiveserver2 startup will be slow to complete, and it will take two to three minutes). You need to wait for the hiveserver2 service to complete before you can connect to the hive database. Enter the following startup command on hadoop105:

cd /opt/module/hive/bin/
nohup hive --service hiveserver2 1>/opt/module/hive/logs/hive.log 2>/opt/module/hive/logs/hive_err.log &

– nohup: placed at the beginning of the command, it means that the terminal process is not suspended, even if the terminal process is closed, it remains in the allowed state – 1: represents
standard log output
– 2: represents error log output
– /opt/module/hive/logs : where / opt/module/hive/ decompresses the directory for your own hive. If there is no logs directory in this directory, you
need to mkdir a logs directory to store the hive.log and hive_err.log log files. These two log files do not need to be created by yourself , automatically created when the related service is running
– &: stands for running in the background,
so the whole command can be understood as: run the hiveserver2 service in the background and output the standard log to hive.log, output the error log to hive_err.log, and close the terminal (window) , will also remain running

insert image description here

cd /opt/module/hive/logs After entering the log directory,
you can use tail -300f hive_err.log to view the last/nearly 300 lines of error log information, and the number 300 can be changed to 500 or 1000 according to your needs

insert image description here
insert image description here

Similarly, tail -300f hive_err.log can view the last/nearly 300 lines of standard log output

5. View the hive database locally on the virtual machine

insert image description here

View the gmall database:

insert image description here

10. Program List

Result demonstration:
1. Cluster.sh startup screenshot (full screen, with multiple names)

cluster.sh start

insert image description here

2. After cluster.sh starts, take a screenshot of jpsall (full screen, with multiple names)

jpsall

insert image description here

3. Results of table creation in gmall data warehouse

show tables;

insert image description here

insert image description here

4. ODS layer order table data query (5 times with name abbreviation)

select * from ods_order_info limit 5;

insert image description here

DataGrip connects hive data warehouse query:

insert image description here

  1. DWD layer data warehouse data query (5 times with name abbreviation)

7.1 View the region dimension table

select * from dwd_dim_base_province limit 12;

insert image description here

7.2 View the time dimension table

select * from dwd_dim_date_info limit 15;

insert image description here

  1. DWS layer data warehouse data query (5 times with name abbreviation)

8.1 View daily commodity behavior

select * from dws_sku_action_daycount where dt=‘2020-06-14’ limit 15;

insert image description here

8.2 View daily regional statistics

select * from dws_area_stats_daycount where dt=‘2020-06-15’ limit 15;

insert image description here

  1. DWT layer data warehouse data query (5 times with name abbreviation)

9.1 View the wide table of commodity topics

select * from dwt_sku_topic limit 15;

insert image description here

9.2 View the regional theme wide table

select * from dwt_area_topic limit 15;

insert image description here

  1. ADS layer warehouse data query (5 times with name abbreviation)

10.1 View brand repurchase rate

select * from ads_sale_tm_category1_stat_mn;

insert image description here
10.2 View regional theme information

select * from ads_area_topic;

insert image description here

insert image description here

  1. Kafka data collection

Kafka production message

kafka-console-producer.sh --broker-list hadoop105:9092 --topic topic01

insert image description here

Kafka consumption message

kafka-console-consumer.sh --bootstrap-server hadoop105:9092 --from-beginning --topic topic01

insert image description here
Kafka monitoring

    先使用ke.sh启动相关服务,登录http://hadoop105:8048/ke查看相关信息。

insert image description here
12. Check the ods_log log

Application: Use the DataGrip tool to connect to the local hive database and check the consistency of the data in the database table

insert image description here

Guess you like

Origin blog.csdn.net/m0_48170265/article/details/130031285