The complete process of big data enterprise development

Big data enterprise development basic process

Linux commands

1 Hadoop (HDFS+Yarn) stand-alone environment construction

  • Hadoop is an open source distributed computing framework consisting of two core components: HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator).
  • HDFS is Hadoop's distributed file system for storing large-scale data sets and providing high fault tolerance and high throughput. It divides data into chunks and distributes those chunks across multiple compute nodes in the cluster. HDFS provides mechanisms for data redundancy and automatic failure recovery to ensure data reliability and availability.
  • YARN is Hadoop's cluster resource management system, which is used to allocate and manage computing resources in the cluster. It is responsible for receiving jobs submitted by users and assigning different tasks of the jobs to different computing nodes in the cluster for execution. YARN is also responsible for monitoring and managing the usage of cluster resources for resource scheduling and optimization.
  • HDFS and YARN are two core components of Hadoop that work closely together to enable distributed computing and storage. HDFS provides reliable data storage, while YARN manages computing resources and executes jobs. Users can store data on HDFS and use YARN to submit and execute jobs, enabling distributed data processing and analysis.
  • To sum up, HDFS provides distributed storage, while YARN provides distributed computing and resource management. Together, they constitute Hadoop's distributed computing framework, providing users with the ability to process large-scale data sets.

Hadoop is developed based on Java, so a jdk environment is required
for this article:

  • hadoop == 3.1.0
  • CentOS == 7.3
  • jdk == 1.8

illustrate:上面环境,大家可以直接去对应官网上下载然后上传到linux即可,本文主要讲解概念,不再单独提供安装包

① Configure the java environment

# 创建tools目录,用于存放文件
mkdir /opt/tools
# 切换至tools目录,上传JDK安装包
# 创建server目录,用于存放JDK解压后的文件
mkdir /opt/server
# 解压JDK至server目录
tar -zvxf jdk-8u131-linux-x64.tar.gz -C /opt/server

# 配置环境变量
vim /etc/profile
# 文件末尾增加
export JAVA_HOME=/opt/server/jdk1.8.0_131
export PATH=${JAVA_HOME}/bin:$PATH
# 使配置生效
source /etc/profile
# 检查是否安装成功
java -version

注意:Hadoop 组件之间需要基于 SSH 进行通讯,配置免密登录后不需要每次都输入密码。

# 配置映射,配置 ip 地址和主机名映射
vim /etc/hosts
# 文件末尾增加
192.168.80.100 server

# 生成公钥私钥
ssh-keygen -t rsa
# 授权,进入 ~/.ssh 目录下,查看生成的公匙和私匙,并将公匙写入到授权文件:
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys

② configure hadoop

  • Download and unzip hadoop

Visit http://archive.apache.org/dist/hadoop/core/hadoop-3.1.0/ to download Hadoop (.tar.gz package)

# 切换至tools目录,上传Hadoop安装包
# 解压Hadoop至server目录
tar -zvxf hadoop-3.1.0.tar.gz -C /opt/server/
  • Modify the hadoop configuration file and configure the java path
# 进入/opt/server/hadoop-3.1.0/etc/hadoop 目录下,修改以下配置
# 修改hadoop-env.sh文件,设置JDK的安装路径
vim hadoop-env.sh
export JAVA_HOME=/opt/server/jdk1.8.0_131

Modify the core-site.xml file to specify the communication address of the hdfs protocol file system and
the directory where Hadoop stores temporary files (this directory does not need to be created manually)

<configuration>
  <property>
    <!--指定 namenode 的 hdfs 协议文件系统的通信地址-->
    <name>fs.defaultFS</name>
    <value>hdfs://server:8020</value>
  </property>
  <property>
    <!--指定 hadoop 数据文件存储目录-->
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/data</value>
  </property>
</configuration>
  • Modify the number of copies

Modify hdfs-site.xml to specify the copy coefficient of dfs

<configuration>
  <property>
    <!--由于我们这里搭建是单机版本,所以指定 dfs 的副本系数为 1-->
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
  • Modify the workers file to configure all slave nodes
vim workers
# 配置所有从属节点的主机名或 IP 地址,由于是单机版本,所以指定本机即可:
server

③Initialize and start HDFS

# 关闭防火墙,不关闭防火墙可能导致无法访问 Hadoop 的 Web UI 界面
# 查看防火墙状态
sudo firewall-cmd --state
# 关闭防火墙:
sudo systemctl stop firewalld
# 禁止开机启动
sudo systemctl disable firewalld
# 初始化,第一次启动 Hadoop 时需要进行初始化,进入 /opt/server/hadoop-3.1.0/bin目录下,执行以下命令:
cd /opt/server/hadoop-3.1.0/bin
./hdfs namenode -format

Effect:
insert image description here

Hadoop 3 does not allow the use of the root user to start the cluster with one click, and the startup user needs to be configured

cd /opt/server/hadoop-3.1.0/sbin/
# 编辑start-dfs.sh、stop-dfs.sh,在顶部加入以下内容
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Start HDFS, enter the /opt/server/hadoop-3.1.0/sbin/ directory, and start HDFS:

cd /opt/server/hadoop-3.1.0/sbin/
./start-dfs.sh

Verify that it is started:

  1. Execute jps to check whether the NameNode and DataNode services have been started
[root@server bin]# jsp
41032 DataNode
41368 Jps
40862 NameNode
41246 SecondaryNameNode
  1. View the Web UI interface, the port is 9870:

insert image description here

# 可选:配置环境变量,方便以后直接启动
export HADOOP_HOME=/opt/server/hadoop-3.1.0
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
source /etc/profile

④ Hadoop (YARN) environment construction

  1. Enter the /opt/server/hadoop-3.1.0/etc/hadoop directory and modify the following configuration:

Modify the mapred-site.xml file

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
</configuration>

Modify the yarn-site.xml file to configure the auxiliary services running on NodeManager

<configuration>
  <property>
    <!--配置 NodeManager 上运行的附属服务。需要配置成 mapreduce_shuffle 后才可
以在
Yarn 上运行 MapRedvimuce 程序。-->
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>
  1. start service

Hadoop 3 does not allow the use of the root user to start the cluster with one click, and the startup user needs to be configured

# start-yarn.sh stop-yarn.sh在两个文件顶部添加以下内容
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

Enter the ${HADOOP_HOME}/sbin/ directory and start YARN:

./start-yarn.sh
  1. Verify that the startup is successful
  • Execute the jps command to check whether the NodeManager and ResourceManager services have been started
  • View the Web UI interface, the port is 8088
    insert image description here
  1. The hadoop-mapreduce-examples-x.jar that comes with Hadoop contains some sample programs, located in the
    ${HADOOP_HOME}/share/hadoop/mapreduce directory.
# 进入 ${HADOOP_HOME}/bin/ 目录下,执行以下命令
hadoop jar /opt/server/hadoop-3.1.0/share/hadoop/mapreduce/hadoop-mapreduce-
examples-3.1.0.jar pi 2 10

View Results:
insert image description here

2 Hive (hadoop-based data warehouse) installation and use

Since Hive is a Hadoop-based data warehouse software, it is usually deployed and run on a Linux system. Therefore, it is necessary to ensure that the basic environment of the server is normal and the Hadoop environment is running normally. Hive is not a software for distributed installation and operation, and its distributed features are mainly completed by Hadoop. Including distributed storage and distributed computing.

  • Hive允许将元数据存储于本地或远程的外部数据库中,这种设置可以支持Hive的多会话生产环 境,在本案例中采用MySQL作为Hive的元数据存储库。

2.1 Hive installation

Since Hive is a Hadoop-based data warehouse software, it is usually deployed and run on a Linux system. Therefore, it is necessary to ensure that the basic environment of the server is normal and the Hadoop environment is running normally. Hive is not a software for distributed installation and operation, and its distributed features are mainly completed by Hadoop. Including distributed storage and distributed computing.

# 创建服务端目录用于存放Hive安装文件
# 用于存放安装包
mkdir /opt/tools
# 用于存放解压后的文件
mkdir /opt/server

# 切换到/opt/tools目录,上传hive安装包
cd /opt/tools

A total of two installation packages are involved, namely apache-hive-3.1.2-bin.tar.gz and mysql-5.7.34-1.el7.x86_64.rpm-
bundle.tar

2.1.1 Install MySQL

  1. Uninstall mariadb (a branch of mysql) that comes with centos7
# 查找
rpm -qa|grep mariadb
# mariadb-libs-5.5.52-1.el7.x86_64
# 卸载
rpm -e mariadb-libs-5.5.52-1.el7.x86_64 --nodeps
  1. unzip mysql
# 创建mysql安装包存放点
mkdir /opt/server/mysql
# 解压
tar xvf mysql-5.7.34-1.el7.x86_64.rpm-bundle.tar -C /opt/server/mysql/
  1. Execute the installation
# 安装依赖
yum -y install libaio
yum -y install libncurses*
yum -y install perl perl-devel
# 切换到安装目录
cd /opt/server/mysql/
# 安装
rpm -ivh mysql-community-common-5.7.34-1.el7.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.34-1.el7.x86_64.rpm
rpm -ivh mysql-community-client-5.7.34-1.el7.x86_64.rpm
rpm -ivh mysql-community-server-5.7.34-1.el7.x86_64.rpm
  1. start mysql
#启动mysql
systemctl start mysqld.service
#查看生成的临时root密码
cat /var/log/mysqld.log | grep password

insert image description here

  1. Modify the initial password
# 登录mysql
mysql -u root -p
Enter password:   #输入在日志中生成的临时密码
# 更新root密码 设置为root
set global validate_password_policy=0;
set global validate_password_length=1;
set password=password('root');
  1. Run MySQL remote connection
grant all privileges on *.* to 'root' @'%' identified by 'root';
# 刷新
flush privileges;
  1. Set the mysql service to start automatically at boot
#mysql的启动和关闭 状态查看
systemctl stop mysqld
systemctl status mysqld
systemctl start mysqld
#建议设置为开机自启动服务
systemctl enable mysqld
#查看是否已经设置自启动成功
systemctl list-unit-files | grep mysqld

2.1.2 Hive installation and configuration

  1. Unzip the installation package
# 切换到安装包目录
cd /opt/tools
# 解压到/root/server目录
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/server/
  1. Add the mysql driver to the lib directory of the hive installation directory

Just select the corresponding driver according to your own mysql version, and observe whether there is a driver in the hive/lib directory, and add it if not

# 上传mysql-connector-java-5.1.38.jar
cd /opt/server/apache-hive-3.1.2-bin/lib
  1. Configure hive environment variables
cd /opt/server/apache-hive-3.1.2-bin/conf
cp hive-env.sh.template hive-env.sh
vim hive-env.sh
# 加入以下内容
HADOOP_HOME=/opt/server/hadoop-3.1.0
  1. Create a new hive-site.xml file, the content is as follows, mainly to configure the MySQL address, driver, user name and password for storing metadata
vim hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- 存储元数据mysql相关配置 /etc/hosts -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value> jdbc:mysql://server:3306/hive?
createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;chara
cterEncoding=UTF-8</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
  </property>
</configuration>

Notice:

  • Initialize the metadata database. When the hive version 1.x is used, the initialization operation can not be performed. Hive will automatically initialize when it is started for the first time, but it will not generate all metadata information tables, and only initialize necessary A part of the table will be created automatically when other tables are used later;
  • When the hive used is version 2 or above, the metadata database must be manually initialized. The initialization command:
cd /opt/server/apache-hive-3.1.2-bin/bin
./schematool -dbType mysql -initSchema

insert image description here

After successful initialization, 74 tables will be created in MySQL:
insert image description here

  1. Start Hive
# 为方便后续使用可以将hive命令添加到环境变量中
vim /etc/profile
export HIVE_HOME=/opt/server/apache-hive-3.1.2-bin
export PATH=$HIVE_HOME/bin:$PATH
# 使得环境变量生效
source /etc/profile
# 启动hive
hive

Enter the show databases command to see the default database, which means the build is successful

insert image description here

  1. Some simple commands of Hive-Cli

Use the Hive command directly without adding any parameters to enter the interactive command line.

  • Execute SQL commands
hive -e 'select * from emp';
  • Execute local sql scripts (sql scripts for execution can be in the local file system or on HDFS.)
# 本地文件系统
hive -f /usr/file/simple.sql;
# HDFS文件系统
hive -f hdfs://node01:8020/tmp/simple.sql;
  1. easy to use

Create and switch databases in hive, create tables and perform data insertion operations, and finally query whether the insertion is successful.

# 连接hive
hive
# 数据库操作
create database test;--创建数据库
show databases;--列出所有数据库
use test;--切换数据库
# 表操作
-- 建表
create table t_student(id int,name varchar(255));
-- 插入一条数据
insert into table t_student values(1,"potter");
-- 查询表数据
select * from t_student;

When inserting data, it is found that the inserting speed is extremely slow, the execution time of sql is very long, it takes 26 seconds, and the progress of the MapReduce program is displayed

insert image description here

View yarn and hdfs

  • Log in to Hadoop YARN to see if there is a MapReduce program executing, address: http://192.168.40.100:8088, you need to replace it according to your server IP
    insert image description here
  • It is found that the name of the running task is the executed SQL statement, the type of the task is MapReduce, and the final status is SUCCEEDED. Log in to Hadoop HDFS to browse the file system. According to the data model of Hive, the data of the table is finally stored in HDFS and the folder corresponding to the table.
    Address: http://192.168.40.100:9870/, need to be replaced according to your own server IP
    insert image description here

Small summary:

  • Hive SQL syntax is very similar to standard SQL, which reduces the learning cost a lot.
  • The bottom layer of Hive is the data insertion action performed through MapReduce, so the speed is slow.
  • It is very unrealistic to insert large data sets one by one, and the cost is extremely high.
  • Hive should have its own unique way of inserting data into tables, and structured files are mapped into tables.

2. 2 Hive common operations (basic grammar part, temporarily omitted)

You can refer to the official website, this part is temporarily omitted

  • Hive:https://cwiki.apache.org/confluence/display/Hive/LanguageManual

3 Flume log collection tool

3.1 Overview

3.1.1 Introduction and operating mechanism

①Introduction:

  • Flume is a highly available, highly reliable distributed massive log collection, aggregation and transmission software.
  • The core of Flume is to collect data from the data source (source), and then send the collected data to the specified destination (sink). In order to ensure the success of the delivery process, the data (channel) will be cached before being sent to the destination (sink). After the data actually arrives at the destination (sink), Flume will delete the cached data.
  • Flume supports customization of various data senders for collecting various types of data; at the same time, Flume supports customization of various data receivers for final storage of data. General collection requirements can be achieved through simple configuration of flume. It also has good custom extension capabilities for special scenarios. Therefore, flume can be applied to most daily data collection scenarios

insert image description here
②Operating mechanism:

The core role in the Flume system is agentthat the agent itself is a Java process, which generally runs on the log collection node
insert image description here

Each agent is equivalent to a data transmitter, and there are three components inside:

  1. Source: collection source, used to connect with the data source to obtain data;
  2. Sink: Sinking place, the transmission destination of collected data, used to transmit data to the next-level agent or to the final storage system;
  3. Channel: The data transmission channel inside the agent, used to transfer data from the source to the sink;
  • In the whole process of data transmission, what flows is event , which is the most basic unit of Flume's internal data transmission. The event encapsulates the transmitted data. If it is a text file, it is usually a row of records, and event is also the basic unit of a transaction. The event flows from the source to the channel and then to the sink. It is a byte array and can carry headers (header information) information. Event represents the smallest complete unit of data, from an external data source to an external destination.
  • A complete event includes: event headers, event body, and event information, where the event information is the diary records collected by flume.

3.2.2 Common acquisition structures

  • simple structure
    insert image description here
  • Concatenation between multi-level agents
    insert image description here
    insert image description here

3.2.3 Install and deploy flume

  1. Upload the installation package to the corresponding directory of linux
  2. decompress
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/server
  1. Enter the flume directory, modify flume-env.sh under conf, and configure JAVA_HOME
cd /opt/server/apache-flume-1.9.0-bin/conf
# 先复制一份flume-env.sh.template文件
cp flume-env.sh.template flume-env.sh
# 修改
vim flume-env.sh
export JAVA_HOME=/opt/server/jdk1.8.0_131

3.2 Use (collect Nginx logs to HDFS)

3.2.1 Install Nginx

# 安装nginx
yum install epel-release
yum update
yum -y install nginx

#启动nginx
systemctl start nginx #开启nginx服务
systemctl stop nginx #停止nginx服务
systemctl restart nginx #重启nginx服务

Site log file location:

cd /var/log/nginx

3.2.2 Write the configuration file and start Flume

  1. Write a configuration file

Delete guava-11.0.2.jar under the lib folder to be compatible with Hadoop 3.1.0 flume1.9

cp /opt/server/hadoop-3.1.0/share/hadoop/common/*.jar /opt/server/apache-flume-
1.9.0-bin/lib
cp /opt/server/hadoop-3.1.0/share/hadoop/common/lib/*.jar /opt/server/apache-
flume-1.9.0-bin/lib
cp /opt/server/hadoop-3.1.0/share/hadoop/hdfs/*.jar /opt/server/apache-flume-
1.9.0-bin/lib
  1. Create a configuration file, taildir-hdfs.conf

Monitor log files in the /var/log/nginx directory

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.filegroups = f1
# 此处支持正则
a3.sources.r3.filegroups.f1 = /var/log/nginx/access.log
# 用于记录文件读取的位置信息
a3.sources.r3.positionFile = /opt/server/apache-flume-1.9.0-bin/tail_dir.json
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://server:8020/user/tailDir
a3.sinks.k3.hdfs.fileType = DataStream
# 设置每个文件的滚动大小大概是 128M,默认值:1024,当临时文件达到该大小(单位:bytes)时,滚动成目标文件。如果设置成0,则表示不根据临时文件大小来滚动文件。
a3.sinks.k3.hdfs.rollSize = 134217700
# 默认值:10,当events数据达到该数量时候,将临时文件滚动成目标文件,如果设置成0,则表示不根据events数据来滚动文件。
a3.sinks.k3.hdfs.rollCount = 0
# 不随时间滚动,默认为30秒
a3.sinks.k3.hdfs.rollInterval = 10
# flume检测到hdfs在复制块时会自动滚动文件,导致roll参数不生效,要将该参数设置为1;否则HFDS文件所在块的复制会引起文件滚动
a3.sinks.k3.hdfs.minBlockReplicas = 1
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
  1. start flume
bin/flume-ng agent -c ./conf -f ./conf/taildir-hdfs.conf -n a3 -
Dflume.root.logger=INFO,console

4 Sqoop migration tool (relational database-hive data warehouse)

4.1 Introduction and installation

① Introduction

Sqoop is a tool under Apache, which is used to transfer data between relational databases and hadoop. Sqoop can be used in offline analysis to transfer business data stored in mysql to hive data warehouse. After data warehouse analysis, the results are obtained, and then through sqoop Transfer to mysql, and finally display the chart through web+echart, and display the data indicators more intuitively.

  • As shown in the figure below, sqoop has the concept of import and export, and the reference is the hadoop file system. The relational database can be mysql, oracle, and db2, and the hadoop file system can be hdfs, hive, and hbase. The essence of executing sqoop import and export is transformed into mr tasks for execution.

insert image description here

②Installation

The prerequisite for installing sqoop is that you already have the environment of java and hadoop.

  1. Upload the sqoop installation package to the server sqoop1.xxx sqoop1.99xx
  2. decompress
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/server/
  1. edit configuration file
cd /opt/server/sqoop-1.4.7.bin__hadoop-2.6.0/conf
cp sqoop-env-template.sh sqoop-env.sh
vim sqoop-env.sh
# 加入以下内容
export HADOOP_COMMON_HOME=/opt/server/hadoop-3.1.0
export HADOOP_MAPRED_HOME=/opt/server/hadoop-3.1.0
export HIVE_HOME=/opt/server/apache-hive-3.1.2-bin
  1. Add MySQL's jdbc driver package
cd /opt/server/sqoop-1.4.7.bin__hadoop-2.6.0/lib
# mysql-connector-java-5.1.38.jar

4.2 Import data

# 用于将数据导入HDFS
sqoop import (generic-args) (import-args)

Create the database mydb in mysql, and execute the mydb.sql script to create tables.
insert image description here

4.2.1 Full import

①Import all MySQL data to HDFS
# 下面的命令用于从MySQL数据库服务器中的emp表导入HDFS:
# 其中--target-dir可以用来指定导出数据存放至HDFS的目录;
./sqoop import \
--connect jdbc:mysql://192.168.80.100:3306/mydb \
--username root \
--password root \
--delete-target-dir \
--target-dir /sqoopresult \
--table emp --m 1

View imported data:

cd /root/server/hadoop-3.1.0/bin
hadoop dfs -cat /sqoopresult/part-m-00000
② Use where to import data subsets
# --where参数可以指定从关系数据库导入数据时的查询条件,根据条件将数据库结果抽取到HDFS中。
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--username root \
--password hadoop \
--where "city ='sec-bad'" \
--target-dir /wherequery \
--table emp_add
--m 1
③Use select to import data subsets
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /wherequery12 \
--query 'select id,name,deg from emp WHERE id>1203 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2

–split-by id is usually used in conjunction with the -m parameter to specify which field to divide and how many maptasks to start. The following precautions should be taken when using the above statement:

  • The parameter --table cannot be used when using the query sql statement
  • There must be a where condition in the statement
  • Where condition must be followed by $CONDITIONS string
  • Query strings must use single quotes

4.2.2 Incremental import

In actual work, data import often only needs to import new data, that is, sqoop supports incremental data import based on certain fields.

  • –check-column (col): It is used to specify the columns on which the data is incrementally imported. Generally, auto-increment fields or timestamps are specified, and multiple columns are supported at the same time.
  • –incremental (mode): Used to specify which mode to import incremental data according to:
    • append: used with the –last-value parameter to incrementally import values ​​greater than last-value
    • lastmodified: Used with the --last-value parameter, append records after the date specified by last-value
  • –last-value (value): Specifies the maximum value of the column since the last data import, this value will be automatically updated in the Sqoop job
① Append mode incremental import

Incrementally import the data after empno is 7934:

./sqoop import \
--connect jdbc:mysql://192.168.80.100:3306/mydb \
--username root  --password root \
--table emp --m 1 \
--target-dir /appendresult \
--incremental append \
--check-column empno \
--last-value  7934
②LastModified mode incremental import

Create example table:

create table test(
 id int,
 name varchar(20),
 last_mod timestamp default current_timestamp on update current_timestamp
);
-- last_mod为时间戳类型,并设置为在数据更新时都会记录当前时间,同时默认值为当前时间

Use incremental to import incrementally:

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2021-09-28 18:55:12" \
--m 1 \
--append

Note: When the lastmodified mode is used to process increments, data greater than or equal to the last-value value will be inserted as increments

  • When using lastmodified mode for incremental import, you need to specify whether the incremental data is imported in append mode (addition) or merge-key (merge) mode
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2021-09-28 18:55:12" \
--m 1 \
--merge-key id

merge-key, if the old data has changed before, it will not be imported in an appended form, but will be imported in an updated form

4.3 Export data

  1. When exporting data from the Hadoop ecosystem to the database, the database table must have been created.
  2. The default export operation is to use the INSERT statement to insert data into the target table, and it can also be specified as an update mode, and sqoop will use the update statement to update the target data.

4.3.1 Export by default

Because the default is to use the insert statement to export to the target table, if the table in the database has constraints, such as the primary key cannot be repeated, if the constraint is violated, the export will fail.

  • When exporting, you can specify all or some fields to be exported to the target table.
./sqoop export \
--connect jdbc:mysql://server:3306/userdb \
--username root \
--password root \
--table employee \
--export-dir /emp/emp_data
# 将HDFS/emp/emp_data文件中的数据导出到mysql的employee表中

When exporting, the following parameters can also be specified;

  • --input-fields-terminated-by '\t': specify the separator for HDFS files
  • –columns: If you do not use the column parameter, the order and number of fields in the Hive table must be consistent with the MySQL table by default. If the order or number of fields is inconsistent, you can add the columns parameter for export control, which is not included in the column behind –columns The name or field either has a default value, or allows null values ​​to be inserted. Otherwise, the database will refuse to accept the data exported by sqoop, causing the Sqoop job
    to fail,
  • –export-dir: export directory, this parameter must be specified when performing export

4.3.2 updateonly mode

  • – update-key, update identifier, that is, specify to update according to a certain field, such as id, you can specify multiple update identifier fields, and multiple
    fields are separated by commas.
  • – updatemod, specify updateonly (the default mode), which means only updating existing data records and not inserting new records.
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table updateonly \
--export-dir /updateonly_2/ \
--update-key id \
--update-mode updateonly
# 根据id对已经存在的数据进行更新操作,新增的数据会被忽略

4.3.3 allowinsert mode

–updatemod, specify allowinsert, update existing data records, and insert new records at the same time. It is essentially an insert & update operation.

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table allowinsert \
--export-dir /allowinsert_2/ \
--update-key id \
--update-mode allowinsert
# 根据id对已经存在的数据进行更新操作,同时导出新增的数据

4.4 Sqoop Job

  • Create a job (–create):

Here, we create a job named myjob, which can import data from RDBMS table to HDFS. The following command is used to create a job to import from emp table of DB database to HDFS file.

./sqoop job --create job_test1 \
-- import \
--connect jdbc:mysql://192.168.80.100:3306/mydb \
--username root \
--password root \
--target-dir /sqoopresult333 \
--table emp --m 1
# import前要有空格
  1. verify job

The –list parameter is used to view saved jobs:

bin/sqoop job --list
bin/sqoop job --delete job_test6
  1. Execute job
    –exec option is used to execute saved jobs
bin/sqoop job --exec myjob

5 Azkaban Scheduler

insert image description here

Azkaban is a batch workflow task scheduler launched by linkedin (LinkedIn), which is used to run a set of jobs and processes in a specific order within a workflow. Azkaban uses job configuration files to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.

5.1 Introduction and installation

5.1.1 Introduction

insert image description here

  • mysql server: store metadata, such as project name, project description, project permissions, task status, SLA rules, etc.
  • AzkabanWebServer: Provides web services externally, enabling users to manage through web pages. Responsibilities include project management, authority authorization, task scheduling, and monitoring executors.
  • AzkabanExecutorServer: Responsible for the submission and execution of specific workflows.

5.1.2 Installation

# 上传安装包到指定目录后解压
tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C /opt/server
# 修改conf目录中的azkaban.properties文件,修改时区为Asia/Shanghai
vim conf/azkaban.properties

default.timezone.id=Asia/Shanghai

Modify the commonprivate.properties file in the plugins/jobtypes directory, and turn off the memory check. Azkaban requires 3G of memory by default. If the remaining memory is insufficient, an exception will be reported.

vim plugins/jobtypes/commonprivate.properties
# 添加
memCheck.enabled=false

Start verification, startup/shutdown must go to the azkaban-solo-server-0.1.0-SNAPSHOT/ directory

bin/start-solo.sh

AzkabanSingleServer (for Azkaban solo-server mode, Exec Server and Web Server are in the same process)

insert image description here
Log in to the web page, access the console through a browser, http://server:8081, default username and password azkaban
insert image description here

5.2 use

5.2.1 Executing a single task

Create a job description file, name it first.job, and add the following content:

#first.job
type=command
command=echo 'hello world'

Package the job resource files into a zip file, and the workflow files uploaded by Azkaban only support zip files. The zip should contain the .job files needed to run the job, the job name must be unique within the project.

Note: The zip package should not contain hierarchical directories, etc., there should only be one or more xx.job files

  1. Create a project through azkaban's web management platform and upload the zip package of the job
    insert image description here
  2. Click to execute the workflow
    insert image description here
  3. View execution results
    insert image description here
  4. Click Details on the right to view the execution results
    insert image description here

5.2.2 Executing multiple tasks

  1. Create multiple tasks with dependencies, first create start.job
#start.job
type=command
command=touch /opt/server/web.log
  1. Then create a.job, dependent on start.job
#a.job
type=command
dependencies=start
command=echo "hello a job"
  1. Create end.job, dependent on a.job and b.job
#end.job
type=command
dependencies=a,b
command=echo "end job"

Put all job resource files into a zip package, create a project on the azkaban web management interface and upload the zip package
insert image description here
Execution:
insert image description here
View the running results:
insert image description here

Azkaban can also be used for:

  1. Scheduling Java programs
  2. Scheduling HDFS and MR tasks
  3. Timing task scheduling, such as: Sqoop job, etc.

6 Actual Combat: Commodity Sales Data Analysis

6.1 Preparations

lab environment:

Hadoop == 3.1.0
CentOS == 8
Hive == 3.1.2

  1. Existing merchandise sales orders are as follows:

insert image description here

The fields from left to right are: order number, sales date, province, city, product number, sales volume, sales volume

  1. The product details are as follows:
    insert image description here

The fields from left to right are: commodity number, commodity name, category number, category name, commodity price

6.2 Table creation and data loading

① create a table

Sales order form:

create table t_dml (
detail_id bigint,
sale_date date, 
province string,
city string,
product_id bigint,
cnt bigint,
amt double
)row format delimited
fields terminated by ',';

Commodity detail table:

create table t_product (
 product_id bigint,
 product_name string,
 category_id bigint,
 category_name string,
 price double
)row format delimited
fields terminated by ',';

② Load data

load data local inpath '/opt/data/t_dml.csv' into table t_dml;
load data local inpath '/opt/data/t_product.csv' into table t_product;

6.3 Sales Data Analysis

①Query the time period of sales records in t_dml:

select max(sale_date), min(sale_date) from t_dml;

②Query the total sales of each product category

select t.category_name, sum(t.amt) as total_money
from
( select a.product_id, a.amt, b.category_name
from t_dml a
join t_product b
on a.product_id=b.product_id
) t
group by t.category_name;

③Query the sales ranking list

The store owner wants to know which product is the best seller and the sales ranking, please check the top 10 sales products, display the product name, sales volume, and ranking.

select a.product_name , t.cnt_total,
rank() over (order by t.cnt_total desc) as rk
from
( select product_id, sum(cnt) as cnt_total
from t_dml
group by product_id
order by cnt_total desc
limit 10
) t
join t_product a
on t.product_id=a.product_id;

6.4 Create an intermediate table

The store owner wants to know the purchasing power of each city and county, and also wants to know which of his products is the most popular in the area, and optimizes the query by creating an intermediate table.

①Create an intermediate table for storing results

create table t_city_amt
( province string,
city string,
total_money double
);
create table t_city_prod
( province string,
city string,
product_id bigint,
product_name string,
cnt bigint
);

②Insert data

insert into t_city_amt
select province,city,sum(amt)
from t_dml group by province,city
insert into t_city_prod
select t.province,t.city,t.product_id,t.product_name,sum(t.cnt) from
(
select a.product_id,b.product_name,a.cnt,a.province,a.city
from t_dml a join t_product b
on a.product_id = b.product_id
) t
group by t.province,t.city,t.product_id,t.product_name

③Optimization

from
( select a.*, b.product_name
from t_dml a
join t_product b
on a.product_id=b.product_id
) t
insert overwrite table t_city_amt
select province, city, sum(amt)
group by province, city
insert overwrite table t_city_prod
select province, city, product_id, product_name, sum(cnt)
group by province, city, product_id, product_name;

6.5 Statistical Indicators

①Statistics of regions with the strongest purchasing power in each province

select province, city, total_money
from
(
 select province, city, total_money,
dense_rank() over (partition by province order by total_money desc) as rk
 from t_city_amt
) t
where t.rk=1
order by total_money desc;

②Statistics of best-selling products in each region

select province, city, product_id, product_name
from
( select province, city, product_id, product_name,
dense_rank() over (partition by province order by cnt desc) as rk
from t_city_prod
) t
where t.rk=1
order by province, city;

6.6 Display data with charts such as e-charts

Omit, general steps: the front-end displays the data queried from the back-end database on the page through e-charts or other charting tools.

example:
insert image description here

Actual combat: website access log analysis

Temporarily omitted, given the steps:
insert image description here

final effect:
insert image description here

Guess you like

Origin blog.csdn.net/weixin_45565886/article/details/131272686