2. Taobao purchase behavior analysis project - Hive query, introduction and use of Sqoop, installation and use of SQLyog, overview and installation and use of Superset

1. Top 10 best-selling products

insert image description here

Idea: For the most popular products in the table, it is actually to group the it of the products, and then find out how many user ids have appeared (the same user can purchase repeatedly, so there is no need to deduplicate), sort and then fetch The first 10 will do.

select item_id, count(user_id) sale_num
from to_user_log
group by item_id
order by sale_num desc
limit 10

**Query the current HiveSQL execution progress**

[root@node3 ~]# tail -f nohup.out
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1683334882009_0003, Tracking URL = http://node4:8088/proxy/application_1683334882009_0003/
Kill Command = /opt/hadoop-3.1.3/bin/mapred job  -kill job_1683334882009_0003
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8
2023-05-06 09:43:30,458 Stage-1 map = 0%,  reduce = 0%
2023-05-06 09:46:00,536 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 11.11 sec
2023-05-06 09:46:07,886 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 399.97 sec
2023-05-06 09:46:11,326 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 402.3 sec

insert image description here
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
**HiveSQL optimization: **Global sorting, the mapreduce job after hsql conversion should only reduce tasks. When the amount of data is relatively large, order by should be used with caution, which may cause reduce to take a long time to complete or fail to complete. SQL needs to be optimized

select item_id, count(user_id) sale_num
from to_user_log
where user_id is not null
group by item_id
distribute by sale_num //按照分布式排
sort by sale_num desc //按照降序排
limit 10

However, since our nodes are still on the same machine, the time may still be a little longer.
Create a table of results for hot products

create table if not exists tm_hot_sale_product(
    item_id int comment  "商品id",
    sale_num int comment "销售数量",
    date_day string comment "分析日期"
)
row format delimited
fields terminated by ","
lines terminated by "\n";

Save the result to Hive table

-- 将执行后的结果插入表中
from to_user_log
insert into tm_hot_sale_product
select item_id, count(user_id) sale_num, '20230509'
where user_id is not null
group by item_id
order by sale_num desc
limit 10;

insert image description here
Import data from hive table into mysql table

2.Sqoop overview

Sqoop: A tool for converting relational database (oracle, mysql, sqlserver, etc.) data to hadoop, hive, hbase, etc. Similar product DataX (Ali's top data exchange tool)
insert image description here
official website: http://sqoop.apache.org/
Version introduction: (the two versions are completely incompatible, sqoop1 is the most used)

  1. sqoop1:1.4.x
  2. sqoop2: 1.99.x
    The sqoop architecture is very simple, and it is the simplest framework for the Hadoop ecosystem.
    Sqoop1 is directly connected to hadoop by the client side, and the tasks are generated and executed by corresponding mapreduce through parsing

insert image description here
Configure the input and output of MR through InputFormat and OutputFormat in MR

3. Analysis of Sqoop principle

Export from HDFS import
insert image description here

export export:
insert image description here

4. Sqoop installation

When using 1.4.7 , the specific download address is as follows:
http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
choose to install hive The server node3 installs sqoop, the specific installation steps are as follows:
1 Upload: Upload the sqoop installation package to the /opt/apps directory of node3
2 Unzip and rename

[root@node3 apps]# tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/
[root@node3 apps]# cd ../
[root@node3 opt]# mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop-1.4.7

3 Configure environment variables

[root@node3 opt]# cd sqoop-1.4.7/
[root@node3 sqoop-1.4.7]# pwd
/opt/sqoop-1.4.7  #复制路径
[root@node3 sqoop-1.4.7]# vim /etc/profile
# sqoop环境变量
export SQOOP_HOME=/opt/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin
[root@node3 sqoop-1.4.7]# source /etc/profile

4 Check whether the environment variable is valid

[root@node3 ~]# cd
[root@node3 ~]# sqoop version
Warning: /opt/sqoop-1.4.7/../hcatalog does
not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your
HCatalog installation.
Warning: /opt/sqoop-1.4.7/../accumulo does
not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of
your Accumulo installation.
......
INFO sqoop.Sqoop: Running Sqoop version:
1.4.7
Sqoop 1.4.7

5 Close sqoop warning
insert image description here
6 Configure sqoop-env.sh No need to modify

[root@node3 bin]# cd ../conf
[root@node3 conf]# ls
oraoop-site-template.xml sqoop-envtemplate.sh   sqoop-site.xml
sqoop-env-template.cmd   sqoop-sitetemplate.xml
[root@node3 conf]# cp sqoop-envtemplate.sh sqoop-env.sh

8 sqoop command help

[root@node3 ~]# sqoop help
......
usage: sqoop COMMAND [ARGS]
Available commands:
 codegen           Generate code to
interact with database records
 create-hive-table Import a table
definition into Hive
 eval               Evaluate a SQL
statement and display the results
  `export`             Export an HDFS
directory to a database table
 help               List available
commands
  `import`             Import a table from
a database to HDFS
 import-all-tables Import tables from a
database to HDFS
 import-mainframe   Import datasets from
a mainframe server to HDFS
 job               Work with saved jobs
 list-databases     List available
databases on a server
 list-tables       List available tables
in a database
 merge             Merge results of
incremental imports
 metastore         Run a standalone
Sqoop metastore
  `version`           Display version
information
See 'sqoop help COMMAND' for information
on a specific command.

sqoop COMMAND [ARGS]
‘sqoop help COMMAND’

[root@node3 conf]# sqoop help import

9 Add the database driver package mysql-connector-java-5.1.37.jar and upload it to
node3:/opt/sqoop-1.4.7/lib

5. SQLyog nanny level installation

This software is similar to the Navicat
start mysqld command

[root@node1 ~]# systemctl status mysqld
● mysqld.service - MySQL Server

1 First decompress the software \SQLyog.rar to a directory without Chinese characters and spaces, for example:
D:\devsoft
2 Enter the decompressed directory, find SQLyog.exe double-click
3 enter the registration code entry interface
insert image description here
4 where SQLyog.exe is located Find the Key.txt file in the directory and open it.
5. Copy the user name and registration code into the corresponding input box.
insert image description here
6. Then click the "Register" button, and the registration success interface will be displayed, and you can use it.
insert image description here
insert image description here
Fill in the MySQL host name, password and other information
insert image description here
and click the connect button to save the connection information.
insert image description here

6. Sqoop export data to MySQL

First connect to MySQL in SQLyog to create a database

CREATE DATABASE taobao;

SQLyog visual creation table
insert image description here
export hive database to mysql official document
Syntax

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)

Table 27. Common arguments
insert image description here
Table 29. Export control arguments:
insert image description here
insert image description here
Write a script to export the hive database to the mysql database

[root@node3 ~]# cat export_tm_hot_sale_product.txt 
export 
--connect
jdbc:mysql://node1:3306/taobao
--username
root
--password
123456
-m
1
--table
tm_hot_sale_product
--columns
item_id,sale_num,date_day
--export-dir
/user/hive_remote/warehouse/taobao.db/tm_hot_sale_product

execute script command

[root@node3 ~] sqoop --options-file export_tm_hot_sale_product.txt

insert image description here

7. Overview and installation of Superset

7.1 Superset overview

Superset is a "modern enterprise-level BI (business intelligence) web application " open sourced by Airbnb. It provides lightweight data query and visualization solutions for data analysis by creating and sharing dashboards.
Superset official website: https://superset.apache.org/
insert image description here
The front-end of Superset mainly uses React and NVD3/D3, while the back-end is based on Python's Flask framework and dependent libraries such as Pandas and SQLAlchemy, which mainly provide the following functions:
1 Integrated data query function, supports multiple databases, including MySQL, Oracle, SQL Server, SQLite, SparkSQL, Hive, Kylin, etc., and deeply supports Druid. For more data source support,
insert image description here
see https://superset.apache.org/docs/ databases/installing-database-drivers/
2 Predefines a variety of visualization charts through NVD3/D3 to meet most of the data display functions. If there are other needs, you can also develop more chart types by yourself, or embed other JavaScript chart libraries (such as HighCharts, ECharts).
3 Provides a fine-grained security model, enabling access control at the functional and data levels. Support multiple authentication methods (such as database, OpenID, LDAP, OAuth, REMOTE_USER, etc.)

7.2 Install the Python environment

https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
conda is an open source package and environment manager that can be used to install different Python versions of software packages and their dependencies on the same machine , and can switch between different Python environments. Anaconda includes Conda, Python, and a lot of installed toolkits, such as numpy, pandas, etc. Miniconda includes Conda and Python. Here, we don't need so many toolkits, so we choose MiniConda. Using the root user will encounter various problems later, we use the itbaizhan user to operate.
create new user

[root@node4 ~]# useradd [username]
[root@node4 ~]# passwd [username]
更改用户 [username] 的密码 。
passwd:所有的身份验证令牌已经成功更新。

Create a directory /opt/module, and need to change the user group and user to which it belongs

[root@node4 ~]# mkdir /opt/module
[root@node4 ~]# chown -R itbaizhan:itbaizhan /opt/module/

Upload file
insert image description here
Execute installation script

[itbaizhan@node4 ~]# bash Miniconda3-
latest-Linux-x86_64.sh
In order to continue the installation
process, please review the license
agreement.
Please, press ENTER to continue
>>>  #安装Enter键 然后按空格键
Do you accept the license terms? [yes|no]
[no] >>> yes #输入yes,然后按下Enter键
Miniconda3 will now be installed into this
location:
/root/miniconda3
  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below
#输入安装路径
[/home/itbaizhan/miniconda3] >>>
/opt/module/miniconda3
Do you wish the installer to initialize
Miniconda3
by running conda init? [yes|no]
[no] >>> yes #输入yes初始化然后按下Enter键
Thank you for installing Miniconda3!

Load the environment variable configuration file to make it effective

[itbaizhan@node4 ~]$ source ~/.bashrc
(base) [itbaizhan@node4 ~]$

Deactivate the base environment
After the installation of Miniconda is complete, the default base environment will be activated every time you open the terminal. We can disable the activation of the default base environment through the following command.

(base) [itbaizhan@node4 ~]$ conda config --set auto_activate_base false
(base) [itbaizhan@node4 ~]$

Configure conda image

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
conda config --set show_channel_urls yes

Install the python environment,

 [itbaizhan@node4 ~]# conda create -n
superset python=3.9.15
 Proceed ([y]/n)? y

Supplement: conda environment management command
Create environment: conda create -n env_name
View all environments: conda info -e / --envs
Activate environment: conda activate env_name
Deactivate: conda deactivate Delete environment in the currently activated environment
: conda remove -n env_name --all

Activate the superset environment

[itbaizhan@node4 ~]# conda activate
superset
(superset) [itbaizhan@node4 ~]# conda
deactivate #取消激活

7.3 Linux virtual machine installation and configuration Superset

Add sudo permission for itbaizhan user

[root@node4 ~]# chmod u+w /etc/sudoers
[root@node4 ~]# vim /etc/sudoers
root    ALL=(ALL)       ALL
#添加
itbaizhan       ALL=(ALL)       ALL
[root@node4 ~]# chmod u-w /etc/sudoers

Install basic dependencies before installing SuperSet

[root@node4 ~]# su itbaizhan
[itbaizhan@node4 ~]# sudo yum install -y gcc gcc-c++ libffi-devel python-devel python-pip python-wheel python-setuptools openssl-devel cyrus-sasl-devel openldap-devel
[sudo] itbaizhan 的密码:itbaizhan
[itbaizhan@node4 ~]# sudo yum install -y dnf
[itbaizhan@node4 ~]# sudo dnf install -y gcc gcc-c++ libffi-devel python3-devel python3-pip python3-wheel openssl-devel cyrus-sasl-devel openldap-devel

Install/update setuptools and pip,

[itbaizhan@node4 ~]# conda activate superset
(superset) [itbaizhan@node4 ~]# pip install --upgrade setuptools pip -i https://pypi.tuna.tsinghua.edu.cn/simple
# 查看setuptools版本
(superset) [itbaizhan@node4 ~]$ pip list|grep setuptools
 setuptools             67.6.0
#如果大于65.5.0,需要将setuptools降为65.5.0版本即可,避免:cannot import name 'Log' from'distutils.log'
(superset) [itbaizhan@node4 ~]# pip install setuptools==65.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip list|grep setuptools setuptools             65.5.0

Install Superset

(superset) [root@node4 ~]# pip install apache-superset==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

Add environment variables:

(superset) [itbaizhan@node4 ~]# export FLASK_APP=superset

Initialize the superset

(superset) [itbaizhan@node4 ~]# superset db upgrade

An error message may appear: ModuleNotFoundError: No module named'cryptography.hazmat.backends.openssl.x509

(superset) [itbaizhan@node4 ~]$ pip list|grep cryptography 
cryptography           39.0.2

The existing cryptography version is not compatible, you need to install 3.3.2

(superset) [itbaizhan@node4 ~]$ pip uninstall cryptography
(superset) [itbaizhan@node4 ~]$ pip install cryptography==3.3.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip list|grep cryptography
cryptography           3.3.2
(superset) [itbaizhan@node4 ~]# superset db upgrade

Error message: ModuleNotFoundError: No module named 'werkzeug.wrappers.etag' This is a bug that appeared in the superset 2.0 version, and it is solved by reducing the version of Werkzeug

(superset) [itbaizhan@node4 ~]$ pip list|grep Werkzeug
Werkzeug               2.2.3
(superset) [itbaizhan@node4 ~]$ pip uninstall -y Werkzeug
(superset) [itbaizhan@node4 ~]$ pip uninstall -y Flask
(superset) [itbaizhan@node4 ~]$ pip install Flask==2.0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip install Werkzeug==2.0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]# rm -f .superset/superset.db
(superset) [itbaizhan@node4 ~]# superset db upgrade

Error message: ModuleNotFoundError: No module named 'wtforms.ext', ext has been removed for WTForms3.0, and the version of WTForms needs to be reduced to 2.3.3

(superset) [itbaizhan@node4 ~]$ pip list|grep WTForms
WTForms                3.0.1
WTForms-JSON           0.3.5
(superset) [itbaizhan@node4 ~]$ pip uninstall -y WTForms
(superset) [itbaizhan@node4 ~]$ pip install WTForms==2.3.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]# rm -f .superset/superset.db
(superset) [itbaizhan@node4 ~]# superset db upgrade

Create an administrator account

(superset) [root@node4 ~]# superset fab create-admin
Username [admin]: itbaizhan
User first name [admin]: it
User last name [user]: baizhan
Email [[email protected]]: xflovejava@126.com
             
Password:   #itbaizhan
Repeat for confirmation:

Initialize Superset

 (superset) [root@node4 ~]# superset init

7.4 Start and stop superset

install gunicorn

[root@node4 ~]# su itbaizhan # 切换用户
[itbaizhan@node4 root]$ cd
[itbaizhan@node4 ~]$ conda activate superset # linux进入虚拟环境
(superset) [itbaizhan@node4 ~]$  pip install gunicorn -i https://pypi.tuna.tsinghua.edu.cn/simple

Note: gunicorn is a Python Web Server, similar to Tomcat in java
to start superset

(superset) [itbaizhan@node4 ~]$ gunicorn --workers 5 --timeout 120 --bind node4:8787 "superset.app:create_app()" --daemon # 设置5个工作进程 120s关闭 绑定node4的8787 
(superset) [itbaizhan@node4 ~]$ jps # 查看进程
65627 Jps

Parameter Description

--workers:指定进程个数
--bind:绑定本机地址,即为Superset访问地址
--timeout:worker进程超时时间,超时会自动重启
--daemon:后台运行

insert image description here

Close the superset command

(superset) [itbaizhan@node4 ~]$ ps -ef | awk '/superset/ && !/awk/{print $2}' |xargs kill -9 # 关闭superset命令
conda deactivate # 退出虚拟环境
exit # 退出用户

7.5 Superset start and stop script

(superset) [itbaizhan@node4 ~]$ conda deactivate
[itbaizhan@node4 ~]$ vim superset.sh
[itbaizhan@node4 ~]$ vim superset.sh
[itbaizhan@node4 ~]$ cat superset.sh
#!/bin/bash
superset_status(){
    
    
	result=`ps -ef | awk '/gunicorn/ && !/awk/{print $2}' | wc -l`
	if [[ $result -eq 0 ]]; then 
        	return 0
    	else
        	return 1
	fi
}
superset_start(){
    
    
	source ~/.bashrc
	superset_status >/dev/null 2>&1
	if [[ $? -eq 0 ]]; then
		conda activate superset ;gunicorn --workers 5 --timeout 120 --bind node4:8787 --daemon 'superset.app:create_app()'
	else
		echo "superset running!!"
	fi
}

superset_stop(){
    
    
	superset_status >/dev/null 2>&1
	if [[ $? -eq 0 ]]; then
		echo "superset is stop"
	else
		ps -ef | awk '/gunicorn/ && !/awk/{print $2}' | xargs kill -9
	fi
}

case $1 in
	start )
		echo "start Superset!!"
		superset_start
	;;
	stop )
		echo "stop Superset!!"
		superset_stop
	;;
	restart )
		echo "restart Superset!!"
		superset_stop
		superset_start
	;;
	status )
		superset_status >/dev/null 2>&1
		if [[ $? -eq 0 ]]; then
			echo "superset is stop"
		else
			echo "superset running"
	fi
esac


Add execute permission to the script

[itbaizhan@node4 ~]$ chmod +x superset.sh
[itbaizhan@node4 ~]$ ll
-rwxrwxr-x 1 itbaizhan itbaizhan     1141 8月  29 19:50 superset.sh

script use

[itbaizhan@node4 ~]$ ./superset.sh start
start Superset!!
[itbaizhan@node4 ~]$ ./superset.sh status
superset running
[itbaizhan@node4 ~]$ ./superset.sh  stop
stop Superset!!

7.6 Superset integrates MySQL database

Use superset to complete data visualization without using code.
insert image description here
insert image description here
insert image description here
insert image description here
Click TEST CONNECTION, and the message "Connection looks good!" will appear, indicating that the connection is successful.
insert image description here

7.7 Visualize using Supersert——Visualization of Top 10 Best Selling Items

insert image description here
insert image description here
Click the edit icon behind the created tm_hot_sale_product table to modify the columns.
insert image description here
insert image description here
Select columns
insert image description here
and select the small triangle icon in front of the column you want to modify.
insert image description here
insert image description here
You can add labels, descriptions, whether the execution is temporary, filter, dimension, etc. to the column.
Select Edit under the "METRICS (Indicators)" tab to name the statistical "indicators"
Click +charts
insert image description here
Click bar charts
insert image description here
to select content
insert image description here
insert image description here

save data
insert image description here
modify
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/m0_63953077/article/details/130588546