1. Top 10 best-selling products
Idea: For the most popular products in the table, it is actually to group the it of the products, and then find out how many user ids have appeared (the same user can purchase repeatedly, so there is no need to deduplicate), sort and then fetch The first 10 will do.
select item_id, count(user_id) sale_num
from to_user_log
group by item_id
order by sale_num desc
limit 10
**Query the current HiveSQL execution progress**
[root@node3 ~]# tail -f nohup.out
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1683334882009_0003, Tracking URL = http://node4:8088/proxy/application_1683334882009_0003/
Kill Command = /opt/hadoop-3.1.3/bin/mapred job -kill job_1683334882009_0003
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8
2023-05-06 09:43:30,458 Stage-1 map = 0%, reduce = 0%
2023-05-06 09:46:00,536 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 11.11 sec
2023-05-06 09:46:07,886 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 399.97 sec
2023-05-06 09:46:11,326 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 402.3 sec
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
**HiveSQL optimization: **Global sorting, the mapreduce job after hsql conversion should only reduce tasks. When the amount of data is relatively large, order by should be used with caution, which may cause reduce to take a long time to complete or fail to complete. SQL needs to be optimized
select item_id, count(user_id) sale_num
from to_user_log
where user_id is not null
group by item_id
distribute by sale_num //按照分布式排
sort by sale_num desc //按照降序排
limit 10
However, since our nodes are still on the same machine, the time may still be a little longer.
Create a table of results for hot products
create table if not exists tm_hot_sale_product(
item_id int comment "商品id",
sale_num int comment "销售数量",
date_day string comment "分析日期"
)
row format delimited
fields terminated by ","
lines terminated by "\n";
Save the result to Hive table
-- 将执行后的结果插入表中
from to_user_log
insert into tm_hot_sale_product
select item_id, count(user_id) sale_num, '20230509'
where user_id is not null
group by item_id
order by sale_num desc
limit 10;
Import data from hive table into mysql table
2.Sqoop overview
Sqoop: A tool for converting relational database (oracle, mysql, sqlserver, etc.) data to hadoop, hive, hbase, etc. Similar product DataX (Ali's top data exchange tool)
official website: http://sqoop.apache.org/
Version introduction: (the two versions are completely incompatible, sqoop1 is the most used)
- sqoop1:1.4.x
- sqoop2: 1.99.x
The sqoop architecture is very simple, and it is the simplest framework for the Hadoop ecosystem.
Sqoop1 is directly connected to hadoop by the client side, and the tasks are generated and executed by corresponding mapreduce through parsing
Configure the input and output of MR through InputFormat and OutputFormat in MR
3. Analysis of Sqoop principle
Export from HDFS import
export export:
4. Sqoop installation
When using 1.4.7 , the specific download address is as follows:
http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
choose to install hive The server node3 installs sqoop, the specific installation steps are as follows:
1 Upload: Upload the sqoop installation package to the /opt/apps directory of node3
2 Unzip and rename
[root@node3 apps]# tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/
[root@node3 apps]# cd ../
[root@node3 opt]# mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop-1.4.7
3 Configure environment variables
[root@node3 opt]# cd sqoop-1.4.7/
[root@node3 sqoop-1.4.7]# pwd
/opt/sqoop-1.4.7 #复制路径
[root@node3 sqoop-1.4.7]# vim /etc/profile
# sqoop环境变量
export SQOOP_HOME=/opt/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin
[root@node3 sqoop-1.4.7]# source /etc/profile
4 Check whether the environment variable is valid
[root@node3 ~]# cd
[root@node3 ~]# sqoop version
Warning: /opt/sqoop-1.4.7/../hcatalog does
not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your
HCatalog installation.
Warning: /opt/sqoop-1.4.7/../accumulo does
not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of
your Accumulo installation.
......
INFO sqoop.Sqoop: Running Sqoop version:
1.4.7
Sqoop 1.4.7
5 Close sqoop warning
6 Configure sqoop-env.sh No need to modify
[root@node3 bin]# cd ../conf
[root@node3 conf]# ls
oraoop-site-template.xml sqoop-envtemplate.sh sqoop-site.xml
sqoop-env-template.cmd sqoop-sitetemplate.xml
[root@node3 conf]# cp sqoop-envtemplate.sh sqoop-env.sh
8 sqoop command help
[root@node3 ~]# sqoop help
......
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to
interact with database records
create-hive-table Import a table
definition into Hive
eval Evaluate a SQL
statement and display the results
`export` Export an HDFS
directory to a database table
help List available
commands
`import` Import a table from
a database to HDFS
import-all-tables Import tables from a
database to HDFS
import-mainframe Import datasets from
a mainframe server to HDFS
job Work with saved jobs
list-databases List available
databases on a server
list-tables List available tables
in a database
merge Merge results of
incremental imports
metastore Run a standalone
Sqoop metastore
`version` Display version
information
See 'sqoop help COMMAND' for information
on a specific command.
sqoop COMMAND [ARGS]
‘sqoop help COMMAND’
[root@node3 conf]# sqoop help import
9 Add the database driver package mysql-connector-java-5.1.37.jar and upload it to
node3:/opt/sqoop-1.4.7/lib
5. SQLyog nanny level installation
This software is similar to the Navicat
start mysqld command
[root@node1 ~]# systemctl status mysqld
● mysqld.service - MySQL Server
1 First decompress the software \SQLyog.rar to a directory without Chinese characters and spaces, for example:
D:\devsoft
2 Enter the decompressed directory, find SQLyog.exe double-click
3 enter the registration code entry interface
4 where SQLyog.exe is located Find the Key.txt file in the directory and open it.
5. Copy the user name and registration code into the corresponding input box.
6. Then click the "Register" button, and the registration success interface will be displayed, and you can use it.
Fill in the MySQL host name, password and other information
and click the connect button to save the connection information.
6. Sqoop export data to MySQL
First connect to MySQL in SQLyog to create a database
CREATE DATABASE taobao;
SQLyog visual creation table
export hive database to mysql official document
Syntax
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Table 27. Common arguments
Table 29. Export control arguments:
Write a script to export the hive database to the mysql database
[root@node3 ~]# cat export_tm_hot_sale_product.txt
export
--connect
jdbc:mysql://node1:3306/taobao
--username
root
--password
123456
-m
1
--table
tm_hot_sale_product
--columns
item_id,sale_num,date_day
--export-dir
/user/hive_remote/warehouse/taobao.db/tm_hot_sale_product
execute script command
[root@node3 ~] sqoop --options-file export_tm_hot_sale_product.txt
7. Overview and installation of Superset
7.1 Superset overview
Superset is a "modern enterprise-level BI (business intelligence) web application " open sourced by Airbnb. It provides lightweight data query and visualization solutions for data analysis by creating and sharing dashboards.
Superset official website: https://superset.apache.org/
The front-end of Superset mainly uses React and NVD3/D3, while the back-end is based on Python's Flask framework and dependent libraries such as Pandas and SQLAlchemy, which mainly provide the following functions:
1 Integrated data query function, supports multiple databases, including MySQL, Oracle, SQL Server, SQLite, SparkSQL, Hive, Kylin, etc., and deeply supports Druid. For more data source support,
see https://superset.apache.org/docs/ databases/installing-database-drivers/
2 Predefines a variety of visualization charts through NVD3/D3 to meet most of the data display functions. If there are other needs, you can also develop more chart types by yourself, or embed other JavaScript chart libraries (such as HighCharts, ECharts).
3 Provides a fine-grained security model, enabling access control at the functional and data levels. Support multiple authentication methods (such as database, OpenID, LDAP, OAuth, REMOTE_USER, etc.)
7.2 Install the Python environment
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
conda is an open source package and environment manager that can be used to install different Python versions of software packages and their dependencies on the same machine , and can switch between different Python environments. Anaconda includes Conda, Python, and a lot of installed toolkits, such as numpy, pandas, etc. Miniconda includes Conda and Python. Here, we don't need so many toolkits, so we choose MiniConda. Using the root user will encounter various problems later, we use the itbaizhan user to operate.
create new user
[root@node4 ~]# useradd [username]
[root@node4 ~]# passwd [username]
更改用户 [username] 的密码 。
passwd:所有的身份验证令牌已经成功更新。
Create a directory /opt/module, and need to change the user group and user to which it belongs
[root@node4 ~]# mkdir /opt/module
[root@node4 ~]# chown -R itbaizhan:itbaizhan /opt/module/
Upload file
Execute installation script
[itbaizhan@node4 ~]# bash Miniconda3-
latest-Linux-x86_64.sh
In order to continue the installation
process, please review the license
agreement.
Please, press ENTER to continue
>>> #安装Enter键 然后按空格键
Do you accept the license terms? [yes|no]
[no] >>> yes #输入yes,然后按下Enter键
Miniconda3 will now be installed into this
location:
/root/miniconda3
- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below
#输入安装路径
[/home/itbaizhan/miniconda3] >>>
/opt/module/miniconda3
Do you wish the installer to initialize
Miniconda3
by running conda init? [yes|no]
[no] >>> yes #输入yes初始化然后按下Enter键
Thank you for installing Miniconda3!
Load the environment variable configuration file to make it effective
[itbaizhan@node4 ~]$ source ~/.bashrc
(base) [itbaizhan@node4 ~]$
Deactivate the base environment
After the installation of Miniconda is complete, the default base environment will be activated every time you open the terminal. We can disable the activation of the default base environment through the following command.
(base) [itbaizhan@node4 ~]$ conda config --set auto_activate_base false
(base) [itbaizhan@node4 ~]$
Configure conda image
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
conda config --set show_channel_urls yes
Install the python environment,
[itbaizhan@node4 ~]# conda create -n
superset python=3.9.15
Proceed ([y]/n)? y
Supplement: conda environment management command
Create environment: conda create -n env_name
View all environments: conda info -e / --envs
Activate environment: conda activate env_name
Deactivate: conda deactivate Delete environment in the currently activated environment
: conda remove -n env_name --all
Activate the superset environment
[itbaizhan@node4 ~]# conda activate
superset
(superset) [itbaizhan@node4 ~]# conda
deactivate #取消激活
7.3 Linux virtual machine installation and configuration Superset
Add sudo permission for itbaizhan user
[root@node4 ~]# chmod u+w /etc/sudoers
[root@node4 ~]# vim /etc/sudoers
root ALL=(ALL) ALL
#添加
itbaizhan ALL=(ALL) ALL
[root@node4 ~]# chmod u-w /etc/sudoers
Install basic dependencies before installing SuperSet
[root@node4 ~]# su itbaizhan
[itbaizhan@node4 ~]# sudo yum install -y gcc gcc-c++ libffi-devel python-devel python-pip python-wheel python-setuptools openssl-devel cyrus-sasl-devel openldap-devel
[sudo] itbaizhan 的密码:itbaizhan
[itbaizhan@node4 ~]# sudo yum install -y dnf
[itbaizhan@node4 ~]# sudo dnf install -y gcc gcc-c++ libffi-devel python3-devel python3-pip python3-wheel openssl-devel cyrus-sasl-devel openldap-devel
Install/update setuptools and pip,
[itbaizhan@node4 ~]# conda activate superset
(superset) [itbaizhan@node4 ~]# pip install --upgrade setuptools pip -i https://pypi.tuna.tsinghua.edu.cn/simple
# 查看setuptools版本
(superset) [itbaizhan@node4 ~]$ pip list|grep setuptools
setuptools 67.6.0
#如果大于65.5.0,需要将setuptools降为65.5.0版本即可,避免:cannot import name 'Log' from'distutils.log'
(superset) [itbaizhan@node4 ~]# pip install setuptools==65.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip list|grep setuptools setuptools 65.5.0
Install Superset
(superset) [root@node4 ~]# pip install apache-superset==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
Add environment variables:
(superset) [itbaizhan@node4 ~]# export FLASK_APP=superset
Initialize the superset
(superset) [itbaizhan@node4 ~]# superset db upgrade
An error message may appear: ModuleNotFoundError: No module named'cryptography.hazmat.backends.openssl.x509
(superset) [itbaizhan@node4 ~]$ pip list|grep cryptography
cryptography 39.0.2
The existing cryptography version is not compatible, you need to install 3.3.2
(superset) [itbaizhan@node4 ~]$ pip uninstall cryptography
(superset) [itbaizhan@node4 ~]$ pip install cryptography==3.3.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip list|grep cryptography
cryptography 3.3.2
(superset) [itbaizhan@node4 ~]# superset db upgrade
Error message: ModuleNotFoundError: No module named 'werkzeug.wrappers.etag' This is a bug that appeared in the superset 2.0 version, and it is solved by reducing the version of Werkzeug
(superset) [itbaizhan@node4 ~]$ pip list|grep Werkzeug
Werkzeug 2.2.3
(superset) [itbaizhan@node4 ~]$ pip uninstall -y Werkzeug
(superset) [itbaizhan@node4 ~]$ pip uninstall -y Flask
(superset) [itbaizhan@node4 ~]$ pip install Flask==2.0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]$ pip install Werkzeug==2.0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]# rm -f .superset/superset.db
(superset) [itbaizhan@node4 ~]# superset db upgrade
Error message: ModuleNotFoundError: No module named 'wtforms.ext', ext has been removed for WTForms3.0, and the version of WTForms needs to be reduced to 2.3.3
(superset) [itbaizhan@node4 ~]$ pip list|grep WTForms
WTForms 3.0.1
WTForms-JSON 0.3.5
(superset) [itbaizhan@node4 ~]$ pip uninstall -y WTForms
(superset) [itbaizhan@node4 ~]$ pip install WTForms==2.3.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
(superset) [itbaizhan@node4 ~]# rm -f .superset/superset.db
(superset) [itbaizhan@node4 ~]# superset db upgrade
Create an administrator account
(superset) [root@node4 ~]# superset fab create-admin
Username [admin]: itbaizhan
User first name [admin]: it
User last name [user]: baizhan
Email [[email protected]]: xflovejava@126.com
Password: #itbaizhan
Repeat for confirmation:
Initialize Superset
(superset) [root@node4 ~]# superset init
7.4 Start and stop superset
install gunicorn
[root@node4 ~]# su itbaizhan # 切换用户
[itbaizhan@node4 root]$ cd
[itbaizhan@node4 ~]$ conda activate superset # linux进入虚拟环境
(superset) [itbaizhan@node4 ~]$ pip install gunicorn -i https://pypi.tuna.tsinghua.edu.cn/simple
Note: gunicorn is a Python Web Server, similar to Tomcat in java
to start superset
(superset) [itbaizhan@node4 ~]$ gunicorn --workers 5 --timeout 120 --bind node4:8787 "superset.app:create_app()" --daemon # 设置5个工作进程 120s关闭 绑定node4的8787
(superset) [itbaizhan@node4 ~]$ jps # 查看进程
65627 Jps
Parameter Description
--workers:指定进程个数
--bind:绑定本机地址,即为Superset访问地址
--timeout:worker进程超时时间,超时会自动重启
--daemon:后台运行
Close the superset command
(superset) [itbaizhan@node4 ~]$ ps -ef | awk '/superset/ && !/awk/{print $2}' |xargs kill -9 # 关闭superset命令
conda deactivate # 退出虚拟环境
exit # 退出用户
7.5 Superset start and stop script
(superset) [itbaizhan@node4 ~]$ conda deactivate
[itbaizhan@node4 ~]$ vim superset.sh
[itbaizhan@node4 ~]$ vim superset.sh
[itbaizhan@node4 ~]$ cat superset.sh
#!/bin/bash
superset_status(){
result=`ps -ef | awk '/gunicorn/ && !/awk/{print $2}' | wc -l`
if [[ $result -eq 0 ]]; then
return 0
else
return 1
fi
}
superset_start(){
source ~/.bashrc
superset_status >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
conda activate superset ;gunicorn --workers 5 --timeout 120 --bind node4:8787 --daemon 'superset.app:create_app()'
else
echo "superset running!!"
fi
}
superset_stop(){
superset_status >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
echo "superset is stop"
else
ps -ef | awk '/gunicorn/ && !/awk/{print $2}' | xargs kill -9
fi
}
case $1 in
start )
echo "start Superset!!"
superset_start
;;
stop )
echo "stop Superset!!"
superset_stop
;;
restart )
echo "restart Superset!!"
superset_stop
superset_start
;;
status )
superset_status >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
echo "superset is stop"
else
echo "superset running"
fi
esac
Add execute permission to the script
[itbaizhan@node4 ~]$ chmod +x superset.sh
[itbaizhan@node4 ~]$ ll
-rwxrwxr-x 1 itbaizhan itbaizhan 1141 8月 29 19:50 superset.sh
script use
[itbaizhan@node4 ~]$ ./superset.sh start
start Superset!!
[itbaizhan@node4 ~]$ ./superset.sh status
superset running
[itbaizhan@node4 ~]$ ./superset.sh stop
stop Superset!!
7.6 Superset integrates MySQL database
Use superset to complete data visualization without using code.
Click TEST CONNECTION, and the message "Connection looks good!" will appear, indicating that the connection is successful.
7.7 Visualize using Supersert——Visualization of Top 10 Best Selling Items
Click the edit icon behind the created tm_hot_sale_product table to modify the columns.
Select columns
and select the small triangle icon in front of the column you want to modify.
You can add labels, descriptions, whether the execution is temporary, filter, dimension, etc. to the column.
Select Edit under the "METRICS (Indicators)" tab to name the statistical "indicators"
Click +charts
Click bar charts
to select content
save data
modify