文章目录

数据交互工具 -- HUE
数据采集工具 -- Flume

数据交互工具 – HUE

第一部分 Hue概述

Hue(Hadoop User Experience)是一个开源的 Apache Hadoop UI 系统，最早是由Cloudera Desktop 演化而来，由 Cloudera 贡献给开源社区，它是基于 PythonWeb 框架 Django 实现的。通过使用 Hue 可以在浏览器端的 Web 控制台上与Hadoop 集群进行交互来分析处理数据，例如操作 HDFS 上的数据，运行MapReduce Job 等等。Hue所支持的功能特性集合：

默认基于轻量级sqlite数据库管理会话数据，用户认证和授权，可以自定义为MySQL、Postgresql，以及Oracle
基于文件浏览器（File Browser）访问HDFS
基于Hive编辑器来开发和运行Hive查询支持基于Solr进行搜索的应用，并提供可视化的数据视图，以及仪表板（Dashboard）
支持基于Impala的应用进行交互式查询
支持Spark编辑器和仪表板（Dashboard）
支持Pig编辑器，并能够提交脚本任务
支持Oozie编辑器，可以通过仪表板提交和监控Workflow、Coordinator和Bundle
支持HBase浏览器，能够可视化数据、查询数据、修改HBase表
支持Metastore浏览器，可以访问Hive的元数据，以及HCatalog
支持Job浏览器，能够访问MapReduce Job（MR1/MR2-YARN）
支持Job设计器，能够创建MapReduce/Streaming/Java Job
支持Sqoop 2编辑器和仪表板（Dashboard）
支持ZooKeeper浏览器和编辑器
支持MySql、PostGresql、Sqlite和Oracle数据库查询编辑器

一句话总结：Hue是一个友好的界面集成框架，可以集成我们各种学习过的以及将要学习的框架，一个界面就可以做到查看以及执行所有的框架。
在这里插入图片描述
类似的产品还有 Apache Zeppelin。

第二部分 Hue编译安装

Hue官方网站：https://gethue.com/
HUE官方用户手册：https://docs.gethue.com/
官方安装文档：https://docs.gethue.com/administrator/installation/install/
HUE下载地址：https://docs.gethue.com/releases/

Hue的安装并不是那么简单，官方并没有编译好的软件包，需要从github上下载源码、安装依赖、编译安装。以下详细讲解Hue下载、编译、安装的操作过程。

安装Hue的节点上最好没有安装过MySQL，否则可能有版本冲突，这里选择将Hue安装在 linux122 上。

1、下载软件包、上传、解压(hue-release-4.3.0.zip、apache-maven-3.6.3-bin.tar.gz)
2、安装依赖包
3、安装maven
4、hue编译
5、修改hadoop配置
6、修改hue配置
7、启动hue服务

2.1、下载软件包

到官方网站下载 hue-release-4.3.0.zip；上传至服务器，并解压缩

# 安装包目录下执行
yum install unzip 
unzip hue-release-4.3.0.zip

2.2、安装依赖

# 需要Python支持(Python 2.7+ / Python 3.5+)
python --version

 

# 在 CentOS 系统中安装编译 Hue 需要的依赖库
yum install ant asciidoc cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libxml2-devel libxslt-devel make mysql mysql-devel openldap-devel python-devel sqlite-devel gmp-devel

yum install -y rsync

备注：
以上依赖仅适用CentOS/RHEL 7.X，其他情况请参考https://docs.gethue.com/administrator/installation/dependencies/
安装Hue的节点上最好没有安装过MySQL，否则可能有版本冲突安装过程中需要联网，网络不好会有各种奇怪的问题

2.3、安装Maven

编译 Hue 还需要 Maven 环境，因此在编译前需要安装 Maven。

下载 apache-maven-3.6.3-bin.tar.gz，上传虚拟机解压缩，添加环境变量

vi /etc/profile

# 添加环境变量
export MAVEN_HOME=/opt/lagou/servers/apache-maven-3.6.3
export PATH=$PATH:$MAVEN_HOME/bin

source /etc/profile

# 验证安装
mvn --version

2.4、编译

# 进入 hue 源码目录，进行编译。 使用 PREFIX 指定安装 Hue 的路径
cd /opt/lagou/software/hue-release-4.3.0
PREFIX=/opt/lagou/servers make install
cd /opt/lagou/servers

# 如果想把HUE从移动到另外一个地方，由于HUE使用了Python包的一些绝对路径,移动之后则必须执行以下命令：

# 这里不要执行
rm app.reg
rm -r build
make apps

备注：这一步持续的时间比较长，还会从网上下载 jar；需要联网
编译过程中报如下错误：更换maven源（阿里）

 <mirror>
          <id>alimaven</id>
          <name>aliyun maven</name>
          <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
          <mirrorOf>central</mirrorOf>
</mirror>

在这里插入图片描述

2.5、修改 Hadoop 配置文件

在 hdfs-site.xml 中增加配置

<!-- HUE -->
<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>

在 core-site.xml 中增加配置

<!-- HUE -->
<property>
    <name>hadoop.proxyuser.hue.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hue.groups</name>
    <value>*</value>
</property>
 
<property>
    <name>hadoop.proxyuser.hdfs.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hdfs.groups</name>
    <value>*</value>
</property>

增加 httpfs-site.xml 文件，加入配置

<configuration>
    <!-- HUE -->
    <property>
        <name>httpfs.proxyuser.hue.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>httpfs.proxyuser.hue.groups</name>
        <value>*</value>
    </property>
</configuration>

备注：修改完HDFS相关配置后，需要把配置scp给集群中每台机器，重启hdfs服务。

2.6、Hue配置

# 进入 Hue 安装目录
cd /opt/lagou/servers/hue

# 进入配置目录
cd desktop/conf

# 复制一份HUE的配置文件，并修改复制的配置文件
cp pseudo-distributed.ini.tmpl pseudo-distributed.ini
vim pseudo-distributed.ini

配置如下

# [desktop]
http_host=linux122
http_port=8000
is_hue_4=true
time_zone=Asia/Shanghai
dev=tru e
server_user=hue
server_group=hue
default_user=hue

 

# 211行左右。禁用solr，规避报错
app_blacklist=search 

 

# [[database]]。Hue默认使用SQLite数据库记录相关元数据，替换为mysql

engine=mysql
host=linux123
port=3306
user=hive
password=12345678
name=hue

# 1003行左右，Hadoop配置文件的路径
hadoop_conf_dir=/opt/lagou/servers/hadoop-2.9.2/etc/hadoop

# 在mysql中创建数据库hue，用来存放元数据
mysql -uhive -p12345678
mysql> create database hue;

# 初始化数据库,HUE根目录下执行
cd /opt/lagou/servers/hue
build/env/bin/hue syncdb
build/env/bin/hue migrate

# 检查数据

2.7、启动 Hue 服务

# 增加 hue 用户和用户组
groupadd hue
useradd -g hue hue

# 在hue安装路径下执行
build/env/bin/supervisor

在浏览器中输入：linux122:8000，可以看见以下画面，说明安装成功。

第一次访问的时候，需要设置超级管理员用户和密码。记住它(hue/12345678)。
在这里插入图片描述

第三部分 Hue整合Hadoop、Hive

修改参数文件 /opt/lagou/servers/hue/desktop/conf/pseudo-distributed.ini

3.1 集成HDFS、YARN

# 211 行。 没有安装 Solr，禁用，否则一直报错
app_blacklist=search

 

# [hadoop] -- [[hdfs_clusters]] -- [[[default]]]
# 注意端口号。下面语句只要一个
# fs_defaultfs=hdfs://linux121:8020
fs_defaultfs=hdfs://linux121:9000

webhdfs_url=http://linux121:50070/webhdfs/v1

 

# 211 行
hadoop_conf_dir=/opt/lagou/servers/hadoop-2.9.2/etc/hadoop

 

# [hadoop] -- [[yarn_clusters]] -- [[[default]]]

resourcemanager_host=linux123

resourcemanager_port=8032

submit_to=True

resourcemanager_api_url=http://linux123:8088

proxy_api_url=http://linux123:8088

history_server_api_url=http://linux123:19888

3.2 集成Hive

集成Hive需要启动 Hiveserver2 服务，在linux123节点上启动 Hiveserver2

# [beeswax]
hive_server_host=linux123
hive_server_port=10000
hive_conf_dir=/opt/lagou/servers/hive-2.3.7/conf

3.3 集成MySQL

# [librdbms] -- [[databases]] -- [[[mysql]]]；1639行

# 注意：1639行原文： ##[[mysql]] => [[mysql]]；两个##要去掉!

[[[mysql]]]

nice_name="My SQL DB"

name=hue

engine=mysql

host=linux123

port=3306

user=hive

password=12345678

备注：name是数据库名，即 database 的名称

3.4 重启Hue服务

数据采集工具 – Flume

第一部分 Flume概述

1、概述(什么是、体系结构、拓扑结构、内部原理) 
2、安装配置 
3、应用(基础、高级)

无论数据来自什么企业，或是多大量级，通过部署Flume，可以确保数据都安全、及时地到达大数据平台，用户可以将精力集中在如何洞悉数据上。

第 1 节 Flume的定义

在这里插入图片描述
特点：

分布式：flume分布式集群部署，扩展性好
可靠性好: 当节点出现故障时，日志能够被传送到其他节点上而不会丢失
易用性：flume配置使用较繁琐，对使用人员专业技术要求高
实时采集：flume采集流模式进行数据实时采集
适用场景：适用于日志文件实时采集。

其他数据采集工具还有：dataX、kettle、Logstash、Scribe、sqoop。
dataX是阿里开源软件异构数据源离线同步工具。实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
特点：

易用性：没有界面，以执行脚本方式运行，对使用人员技术要求较高
性能：数据抽取性能高
部署：可独立部署
适用场景：在异构数据库/文件系统之间高速交换数据

kettle开源ETL工具。支持数据库、FTP、文件、rest接口、hdfs、hive等平台的据进行抽取、转换、传输等操作，Java编写跨平台，C/S架构，不支持浏览器模式。
特点：

易用性：有可视化设计器进行可视化操作，使用简单
功能强大：不仅能进行数据传输，能同时进行数据清洗转换等操作
支持多种源：支持各种数据库、FTP、文件、rest接口、hdfs、Hive等源
部署方便：独立部署，不依赖第三方产品
适用场景：数据量及增量不大，业务规则变化较快，要求可视化操作，对技术人员的技术门槛要求低。

Logstash。应用程序日志、事件的传输、处理、管理和搜索的平台。可以用它来统一对应用程序日志进行收集管理，提供了Web接口用于查询和统计。

Scribe是Facebook开源的日志收集系统，它能够从各种日志源上收集日志，存储到一个中央存储系统（可以是NFS，分布式文件系统等）上，以便于进行集中统计分析处理。

第 2 节 Flume体系结构

在这里插入图片描述

Flume架构中的组件

在这里插入图片描述

第 3 节 Flume拓扑结构

串行模式

将多个flume给顺序连接起来，从最初的source开始到最终sink传送的目的存储系统。
此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。
在这里插入图片描述

复制模式(单Souce多Channel、Sink模式)

将事件流向一个或者多个目的地。这种模式将数据源复制到多个channel中，每个channel都有相同的数据，sink可以选择传送的不同的目的地。
在这里插入图片描述

负载均衡模式(单Source、Channel多Sink)

将多个sink逻辑上分到一个sink组，flume将数据发送到不同的sink，主要解决负载均衡和故障转移问题。
在这里插入图片描述

聚合模式

这种模式最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase、消息队列中。
在这里插入图片描述

第 4 节 Flume内部原理

总体数据流向：Souce => Channel => Sink
Channel: 处理器、拦截器、选择器

在这里插入图片描述
具体过程：

第二部分安装部署

Flume官网地址：http://flume.apache.org/
文档查看地址：http://flume.apache.org/FlumeUserGuide.html
下载地址：http://archive.apache.org/dist/flume/
选择的版本 1.9.0

安装步骤：

1、下载软件 apache-flume-1.9.0-bin.tar.gz，并上传到 linux123 上的
/opt/lagou/software 目录下
2、解压 apache-flume-1.9.0-bin.tar.gz 到 /opt/lagou/servers/ 目录下；并重命名为 flume-1.9.0
3、在 /etc/profile 中增加环境变量，并执行 source /etc/profile，使修改生效

export FLUME_HOME=/opt/lagou/servers/flume-1.9.0
export PATH=$ PATH:$FLUME_HOME/bin

4、将 $FLUME_HOME/conf 下的 flume-env.sh.template 改名为 flume-env.sh，并添加 JAVA_HOME的配置

cd $FLUME_HOME/conf 
mv flume-env.sh.template flume-env.sh 
vi flume-env.sh 
export JAVA_HOME=/opt/lagou/servers/jdk1.8.0_231

第三部分基础应用

Flume 支持的数据源种类有很多，可以来自directory、http、kafka等。Flume提供了Source组件用来采集数据源。
常见的 Source 有：
（1）avro source：监听 Avro 端口来接收外部 avro 客户端的事件流。avro-source接收到的是经过avro序列化后的数据，然后反序列化数据继续传输。如果是avro source的话，源数据必须是经过avro序列化后的数据。利用 Avro source可以实现多级流动、扇出流、扇入流等效果。接收通过flume提供的avro客户端发送的日志信息。

Avro是Hadoop的一个数据序列化系统，由Hadoop的创始人Doug Cutting（也是 Lucene，Nutch等项目的创始人）开发，设计用于支持大批量数据交换的应用。它的主要特点有：支持二进制序列化方式，可以便捷，快速地处理大量数据；动态语言友好，Avro提供的机制使动态语言可以方便地处理Avro数据；

在这里插入图片描述

（2）exec source：可以将命令产生的输出作为source。如ping 192.168.234.163、tail -f hive.log。
（3）netcat source：一个NetCat Source用来监听一个指定端口，并接收监听到的数据。
（4）spooling directory source：将指定的文件加入到“自动搜集”目录中。flume会持续监听这个目录，把文件当做source来处理。注意：一旦文件被放到目录中后，便不能修改，如果修改，flume会报错。此外，也不能有重名的文件。
（5）Taildir Source（1.7）：监控指定的多个文件，一旦文件内有新写入的数据，就会将其写入到指定的sink内，本来源可靠性高，不会丢失数据。其不会对于跟踪的文件有任何处理，不会重命名也不会删除，不会做任何修改。目前不支持Windows系统，不支持读取二进制文件，支持一行一行的读取文本文件。

采集到的日志需要进行缓存，Flume提供了Channel组件用来缓存数据。
常见的Channel 有：
（1）memory channel：缓存到内存中（最常用）
（2）file channel：缓存到文件中
（3）JDBC channel：通过JDBC缓存到关系型数据库中
（4）kafka channel：缓存到kafka中

缓存的数据最终需要进行保存，Flume提供了Sink组件用来保存数据。
常见的 Sink有：
（1）logger sink：将信息显示在标准输出上，主要用于测试
（2）avro sink：Flume events发送到sink，转换为Avro events，并发送到配置好
的hostname/port。从配置好的channel按照配置好的批量大小批量获取events
（3）null sink：将接收到events全部丢弃
（4）HDFS sink：将 events 写进HDFS。支持创建文本和序列文件，支持两种文件类型压缩。文件可以基于数据的经过时间、大小、事件的数量周期性地滚动
（5）Hive sink：该sink streams 将包含分割文本或者JSON数据的events直接传送到Hive表或分区中。使用Hive 事务写events。当一系列events提交到Hive时，它们马上可以被Hive查询到
（6）HBase sink：保存到HBase中
（7）kafka sink：保存到kafka中

日志采集就是根据业务需求选择合适的Source、Channel、Sink，并将其组合在一起

第 1 节入门案例

中文flume帮助文档
https://flume.liyifeng.org/
业务需求：监听本机 8888 端口，Flume将监听的数据实时显示在控制台
需求分析：

使用 telnet 工具可以向 8888 端口发送数据
监听端口数据，选择 netcat source
channel 选择 memory
数据实时显示，选择 logger sink

实现步骤：

1、安装 telnet 工具

yum install telnet

2、检查 8888 端口是否被占用。如果该端口被占用，可以选择使用其他端口完成任务

lsof -i:8888

3、创建 Flume Agent 配置文件。 flume-netcat-logger.conf
在/opt/lagou/servers/flume-1.9.0/conf 目录下创建

[root@linux123 conf]# vim flume-netcat-logger.conf

# a1是agent的名称。source、channel、sink的名称分别为：r1 c1 k1
a1.sources = r1
a1.channels = c1
a1.sinks = k1
 
# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux123
a1.sources.r1.port = 8888
 
# channel :
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100
 
# sink
a1.sinks.k1.type = logger
 
# source、channel、sink之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Memory Channel 是使用内存缓冲Event的Channel实现。速度比较快速，容量会受到 jvm 内存大小的限制，可靠性不够高。适用于允许丢失数据，但对性能要求较高的日志采集业务。

4、启动Flume Agent

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file $FLUME_HOME/conf/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console

name。定义agent的名字，要与参数文件一致
conf-file。指定参数文件位置
-D表示flume运行时动态修改 flume.root.logger 参数属性值，并将控制台日志
打印级别设置为INFO级别。日志级别包括:log、info、warn、error

5、使用 telnet 向本机的 8888 端口发送消息

telnet linux123 8888

6、在 Flume 监听页面查看数据接收情况
在这里插入图片描述

第 2 节监控日志文件信息到HDFS

业务需求：监控本地日志文件，收集内容实时上传到HDFS
需求分析：

使用 tail -F 命令即可找到本地日志文件产生的信息
source 选择 exec。exec 监听一个指定的命令，获取命令的结果作为数据源。source组件从这个命令的结果中取数据。当agent进程挂掉重启后，可能存在数据丢失；
channel 选择 memory
sink 选择 HDFS

tail -f  
等同于--follow=descriptor，根据文件描述符进行追踪，当文件改名或被删除，追踪停止

tail -F  
等同于--follow=name --retry，根据文件名进行追踪，并保持重试，即该文件被删除或改名后，如果再次创建相同的文件名，会继续追踪

实现步骤：
1、环境准备。Flume要想将数据输出到HDFS，必须持有Hadoop相关jar包。将commons-configuration-1.6.jar hadoop-auth-2.9.2.jar hadoop-common-2.9.2.jar hadoop-hdfs-2.9.2.jar commons-io-2.4.jar htrace-core4-4.1.0-incubating.jar拷贝到 $FLUME_HOME/lib 文件夹下

# 在 $HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib 有这些文件
cd $HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib
cp commons-configuration-1.6.jar $FLUME_HOME/lib/
cp hadoop-auth-2.9.2.jar $FLUME_HOME/lib/
cp hadoop-common-2.9.2.jar $FLUME_HOME/lib/
cp hadoop-hdfs-2.9.2.jar $FLUME_HOME/lib/
cp commons-io-2.4.jar $FLUME_HOME/lib/
cp htrace-core4-4.1.0-incubating.jar $FLUME_HOME/lib/

2、创建配置文件。

flume-exec-hdfs.conf ：

# Name the components on this agent 
a2.sources = r2
a2.sinks = k2
a2.channels = c2
# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/root/hive.log
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 500
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://linux121:9000/flume/%Y%m%d/%H%M
# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒500个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 500
# 设置文件类型，支持压缩。DataStream没启用压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 1分钟滚动一次
a2.sinks.k2.hdfs.rollInterval = 60
# 128M滚动一次
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
# 最小冗余数
a2.sinks.k2.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

3、启动Agent

$FLUME_HOME/bin/flume-ng agent --name a2 \
--conf-file ~/conf/flume-exec-hdfs.conf \
-Dflume.root.logger=INFO,console

4、启动Hadoop和Hive，操作Hive产生日志

start-dfs.sh
start-yarn.sh
#在命令行多次执行
hive -e “show databases”

5、在HDFS上查看文件

第 3 节监控目录采集信息到HDFS

业务需求：监控指定目录，收集信息实时上传到HDFS
需求分析：

source 选择 spooldir。spooldir 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控
channel 选择 memory
sink 选择 HDFS

spooldir Source监听一个指定的目录，即只要向指定目录添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，写入到channel。sink处理完之后，标记该文件已完成处理，文件名添加 .completed 后缀。虽然是自动监控整个目录，但是只能监控文件，如果以追加的方式向已被处理的文件中添加内容，source并不能识别。需要注意的是：

拷贝到spool目录下的文件不可以再打开编辑
无法监控子目录的文件夹变动
被监控文件夹每500毫秒扫描一次文件变动
适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步

1、创建配置文件。flume-spooldir-hdfs.conf

# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
 
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
# 忽略以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
 
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://linux121:9000/flume/upload/%Y%m%d/%H%M
# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 积攒500个Event，flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500
# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream
# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60
# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700
# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1
 
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2、启动Agent

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file ~/conf/flume-spooldir-hdfs.conf \
-Dflume.root.logger=INFO,console

3、向upload文件夹中添加文件
4、查看HDFS上的数据

HDFS Sink
在这里插入图片描述
其他重要配置：

如果要避免HDFS Sink产生小文件，参考如下参数设置：

a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path=hdfs://linux121:9000/flume/events/%Y/%m/%d/%H/%M
a1.sinks.k1.hdfs.minBlockReplicas=1
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0

第 4 节监控日志文件采集数据到HDFS、本地文件系统

业务需求：监控日志文件，收集信息上传到HDFS 和本地文件系统
需求分析：

需要多个Agent级联实现
source 选择 taildir
channel 选择 memory
最终的 sink 分别选择 hdfs、file_roll

taildir Source。Flume 1.7.0加入的新Source，相当于 spooldir source + execsource。可以监控多个目录，并且使用正则表达式匹配该目录中的文件名进行实时收集。实时监控一批文件，并记录每个文件最新消费位置，agent进程重启后不会有数据丢失的问题。
目前不适用于Windows系统；其不会对于跟踪的文件有任何处理，不会重命名也不会删除，不会做任何修改。不支持读取二进制文件，支持一行一行的读取文本文件。
在这里插入图片描述
实现步骤：
1、创建第一个配置文件
flume-taildir-avro.conf 配置文件包括：

1个 taildir source
2个 memory channel
2个 avro sink

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
 
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating
 
# source
a1.sources.r1.type = taildir
 
# 记录每个文件最新消费位置
a1.sources.r1.positionFile = /root/flume/taildir_position.json 
a1.sources.r1.filegroups = f1
 
# 备注：.*log 是正则表达式；这里写成 *.log 是错误的
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
 
# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux123 
a1.sinks.k1.port = 9091
 
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux123 
a1.sinks.k2.port = 9092
 
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
 
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 500
 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2、创建第二个配置文件
flume-avro-hdfs.conf配置文件包括：

1个 avro source
1个 memory channel
1个 hdfs sink

# Name the components on this agent 
a2.sources = r1 
a2.sinks = k1 
a2.channels = c1
 
# Describe/configure the source 
a2.sources.r1.type = avro 
a2.sources.r1.bind = linux123 
a2.sources.r1.port = 9091
 
# Describe the channel 
a2.channels.c1.type = memory 
a2.channels.c1.capacity = 10000 
a2.channels.c1.transactionCapacity = 500
 
# Describe the sink 
a2.sinks.k1.type = hdfs 
a2.sinks.k1.hdfs.path = hdfs://linux121:9000/flume2/%Y%m%d/%H 
 
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
 
# 是否使用本地时间戳 
a2.sinks.k1.hdfs.useLocalTimeStamp = true 
 
# 500个Event才flush到HDFS一次 
a2.sinks.k1.hdfs.batchSize = 500
 
# 设置文件类型，可支持压缩 
a2.sinks.k1.hdfs.fileType = DataStream 
 
# 60秒生成一个新的文件 
a2.sinks.k1.hdfs.rollInterval = 60 
a2.sinks.k1.hdfs.rollSize = 0 
a2.sinks.k1.hdfs.rollCount = 0 
a2.sinks.k1.hdfs.minBlockReplicas = 1
 
# Bind the source and sink to the channel 
a2.sources.r1.channels = c1 
a2.sinks.k1.channel = c1

3、创建第三个配置文件
flume-avro-file.conf配置文件包括：

1个 avro source
1个 memory channel
1个 file_roll sink

# Name the components on this agent 
a3.sources = r1 
a3.sinks = k1 
a3.channels = c2
 
# Describe/configure the source 
a3.sources.r1.type = avro 
a3.sources.r1.bind = linux123 
a3.sources.r1.port = 9092
 
# Describe the sink 
a3.sinks.k1.type = file_roll
 
# 目录需要提前创建好 
a3.sinks.k1.sink.directory = /root/flume/output
 
# Describe the channel 
a3.channels.c2.type = memory 
a3.channels.c2.capacity = 10000 
a3.channels.c2.transactionCapacity = 500
 
# Bind the source and sink to the channel 
a3.sources.r1.channels = c2 
a3.sinks.k1.channel = c2

4、分别启动3个Agent

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file ~/conf/flume-avro-file.conf \
-Dflume.root.logger=INFO,console &
 
$FLUME_HOME/bin/flume-ng agent --name a2 \
--conf-file ~/conf/flume-avro-hdfs.conf \
-Dflume.root.logger=INFO,console &
 
$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file ~/conf/flume-taildir-avro.conf \
-Dflume.root.logger=INFO,console &

5、执行hive命令产生日志

hive -e "show databases

6、分别检查HDFS文件、本地文件、以及消费位置文件

# 3种监控日志文件
Source的对比 exec Source：适用于监控一个实时追加的文件，但不能保证数据不丢失； 
spooldir Source：能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时 监控； 
taildir Source：既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

第四部分高级特性

第 1 节拦截器

Flume支持在运行时对event进行修改或丢弃，通过拦截器来实现；

Flume里面的拦截器是实现了org.apache.flume.interceptor.Interceptor 接口的类；

拦截器可以根据配置修改甚至丢弃 event；

Flume也支持链式的拦截器执行方式，在配置文件里面配置多个拦截器就可以了；

拦截器的顺序取决于它们配置的顺序，Event 按照顺序经过每一个拦截器；

时间添加戳拦截器

这个拦截器会向每个event的header中添加一个时间戳属性进去，key默认是“timestamp ”（也可以通过下面表格中的header来自定义），value就是当前的毫秒值（其实就是用System.currentTimeMillis()方法得到的）。如果event已经存在同名的属性，可以选择是否保留原始的值。
在这里插入图片描述
时间添加拦截器测试：
1、再次运行入门案例中的测试，观察 event header信息

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file ~/conf/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console
 
telnet linux123 8888
# 输入 hello world

在这里插入图片描述
2、在入门案例的基础上，在配置文件中增加时间拦截器的配置。命名为
timestamp.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux123
a1.sources.r1.port = 8888
# 这部分是新增 时间拦截器的 内容
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 是否保留Event header中已经存在的同名时间戳，缺省值false
a1.sources.r1.interceptors.i1.preserveExisting= false
# 这部分是新增 时间拦截器的 内容
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3、启动Agent，启动 telnet 输入信息

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file ~/conf/timestamp.conf \
-Dflume.root.logger=INFO,console
telnet linux3 8888
# 输入 hello world

在这里插入图片描述

Host添加拦截器

这个拦截器会把当前Agent的 hostname 或者 IP 地址写入到Event的header中，key默认是“host”（也可以通过配置自定义key），value可以选择使用hostname或者IP地址。
在这里插入图片描述
host添加拦截器测试：
1、在时间拦截器案例的基础上，在配置文件中增加主机名拦截器的配置。命名为
hostname.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux123
a1.sources.r1.port = 8888
# 这部分是新增 时间拦截器 的内容
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.preserveExisting= false
# 这部分是新增 时间拦截器 的内容
# 这部分是新增 主机名拦截器 的内容
a1.sources.r1.interceptors.i2.type = host
# 如果header中已经存在同名的属性是否保留
a1.sources.r1.interceptors.i2.preserveExisting= false
# true：使用IP地址；false：使用hostname
a1.sources.r1.interceptors.i2.useIP = false
# 这部分是新增 主机名拦截器 的内容
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、启动Agent，启动 telnet 输入信息

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file $FLUME_HOME/conf/hostname.conf \
-Dflume.root.logger=INFO,console
 
telnet linux123 8888
# 输入 hello world

在这里插入图片描述

正则表达式过滤拦截器

这个拦截器会把Event的body当做字符串来处理，并用配置的正则表达式来匹配。可以配置指定被匹配到的Event丢弃还是没被匹配到的Event丢弃。

第 2 节选择器

source可以向多个channel同时写数据，所以也就产生了以何种方式向多个channel写的问题：

replication(复制，缺省)。数据完整地发送到每一个channel；
multiplexing（多路复用）。通过配置来按照一定的规则进行分发；

复制选择器

默认的选择器。
在这里插入图片描述

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3

上面这个例子中，c3配置成了可选的。向c3发送数据如果失败了会被忽略。c1和c2没有配置成可选的，向c1和c2写数据失败会导致事务失败回滚。

多路复用选择器

在这里插入图片描述

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state      #以每个Event的header中的state这个属性的值作为选择channel的依据
a1.sources.r1.selector.mapping.CZ = c1     #如果state=CZ，则选择c1这个channel
a1.sources.r1.selector.mapping.US = c2 c3  #如果state=US，则选择c2 和 c3 这两个channel
a1.sources.r1.selector.default = c4        #默认使用c4这个channel

自定义选择器

自定义选择器就是开发一个 org.apache.flume.ChannelSelector 接口的实现类。实现类以及依赖的jar包在启动时候都必须放入Flume的classpath。
在这里插入图片描述

a1.sources = r1
a1.channels = c1
a1.sources.r1.selector.type = org.liyifeng.flume.channel.MyChannelSelector

第 3 节 Sink组逻辑处理器

在这里插入图片描述

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance

默认
默认的组逻辑处理器就是只有一个sink的情况，这种情况就没必要配置sink组了。前面的例子都是 source - channel - sink这种一对一，单个sink的。

故障转移

在这里插入图片描述

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

负载均衡

在这里插入图片描述

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

第 4 节事务机制与可靠性

一提到事务，首先就想到的是关系型数据库中的事务，事务一个典型的特征就是将一批操作做成原子性的，要么都成功，要么都失败。
在Flume中一共有两个事务：

Put事务。在Source到Channel之间
Take事务。Channel到Sink之间

从 Source 到 Channel 过程中，数据在 Flume 中会被封装成 Event 对象，也就是一批 Event ，把这批 Event 放到一个事务中，把这个事务也就是这批event一次性的放入Channel 中。同理，Take事务的时候，也是把这一批event组成的事务统一拿出来到sink放到HDFS上。

Flume中的 Put 事务

在这里插入图片描述
在doCommit提交之后，事务在向 Channel 存放数据的过程中，事务容易出问题。如 Sink取数据慢，而 Source 放数据速度快，容易造成 Channel 中数据的积压，如果 putList 中的数据放不进去，会如何呢？
此时会调用 doRollback 方法，doRollback方法会进行两项操作：将putList清空；抛出 ChannelException异常。source会捕捉到doRollback抛出的异常，然后source就将刚才的一批数据重新采集，然后重新开始一个新的事务，这就是事务的回滚。

Flume中的 Take 事务

Take事务同样也有takeList，HDFS sink配置有一个 batch size，这个参数决定 Sink从 Channel 取数据的时候一次取多少个，所以该 batch size 得小于 takeList 的大小，而takeList的大小取决于 transaction capacity 的大小，同样是channel 中的参数。
在这里插入图片描述

flush到HDFS的时候组容易出问题。flush到HDFS的时候，可能由于网络原因超时导致数据传输失败，这个时候调用doRollback方法来进行回滚，回滚的时候由于takeList 中还有备份数据，所以将takeList中的数据原封不动地还给channel，这时候就完成了事务的回滚。
但是，如果flush到HDFS的时候，数据flush了一半之后出问题了，这意味着已经有一半的数据已经发送到HDFS上面了，现在出了问题，同样需要调用doRollback方法来进行回滚，回滚并没有“一半”之说，它只会把整个takeList中的数据返回给channel，然后继续进行数据的读写。这样开启下一个事务的时候容易造成数据重复的问题。
Flume在数据进行采集传输的时候，有可能会造成数据的重复，但不会丢失数据。
Flume在数据传输的过程中是否可靠，还需要考虑具体使用Source、Channel、Sink的类型。
在这里插入图片描述

第 5 节高可用案例

案例：实现Agent的故障转移
在这里插入图片描述
1、配置环境
在linux121、linux122上部署Flume、修改环境变量

# 在liunx123上执行
/opt/lagou/servers
scp -r flume-1.9.0/ linux121:$PWD
scp -r flume-1.9.0/ linux122:$PWD
cd /etc
scp profile linux121:$PWD
scp profile linux122:$PWD
 
# 在linux121、linux122上分别执行
source /etc/profile

2、conf文件
linux123：flume-taildir-avro.conf

# agent name
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
 
# source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /root/flume_log/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
a1.sources.r1.fileHeader = true
 
# interceptor
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
# 在event header添加了时间戳
a1.sources.r1.interceptors.i2.type = timestamp
 
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
 
# sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
 
# set sink1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux121
a1.sinks.k1.port = 9999
 
# set sink2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux122
a1.sinks.k2.port = 9999
 
# set failover
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 100
a1.sinkgroups.g1.processor.priority.k2 = 60
a1.sinkgroups.g1.processor.maxpenalty = 10000
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

inux121：flume-avro-hdfs.conf

# set Agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1
 
# Source
a2.sources.r1.type = avro
a2.sources.r1.bind = linux121
a2.sources.r1.port = 9999
 
# interceptor
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
a2.sources.r1.interceptors.i1.value = linux121
 
# set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 500
 
# HDFS Sink
a2.sinks.k1.type=hdfs
a2.sinks.k1.hdfs.path=hdfs://linux121:8020/flume/failover/
a2.sinks.k1.hdfs.fileType=DataStream
a2.sinks.k1.hdfs.writeFormat=TEXT
a2.sinks.k1.hdfs.rollInterval=60
a2.sinks.k1.hdfs.filePrefix=%Y-%m-%d
a2.sinks.k1.hdfs.minBlockReplicas=1
a2.sinks.k1.hdfs.rollSize=0
a2.sinks.k1.hdfs.rollCount=0
a2.sinks.k1.hdfs.idleTimeout=0
 
a2.sources.r1.channels = c1
a2.sinks.k1.channel=c1

linux122：flume-avro-hdfs.conf

# set Agent name
a3.sources = r1
a3.channels = c1
a3.sinks = k1
 
# Source
a3.sources.r1.type = avro
a3.sources.r1.bind = linux122
a3.sources.r1.port = 9999
 
# interceptor
a3.sources.r1.interceptors = i1
a3.sources.r1.interceptors.i1.type = static
a3.sources.r1.interceptors.i1.key = Collector
a3.sources.r1.interceptors.i1.value = linux122
 
# set channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 500
 
# HDFS Sink
a3.sinks.k1.type=hdfs
a3.sinks.k1.hdfs.path = hdfs://linux121:8020/flume/failover/
a3.sinks.k1.hdfs.fileType=DataStream
a3.sinks.k1.hdfs.writeFormat=TEXT
a3.sinks.k1.hdfs.rollInterval=60
a3.sinks.k1.hdfs.filePrefix=%Y-%m-%d
a3.sinks.k1.hdfs.minBlockReplicas=1
a3.sinks.k1.hdfs.rollSize=0
a3.sinks.k1.hdfs.rollCount=0
a3.sinks.k1.hdfs.idleTimeout=0
 
a3.sources.r1.channels = c1
a3.sinks.k1.channel=c1

3、分别在linux121、linux122、linux123上启动对应服务（先启动下游的agent）

# linux121
flume-ng agent --name a2 --conf-file ~/conf/flume-avro-hdfs.conf
 
# linux122
flume-ng agent --name a3 --conf-file ~/conf/flume-avro-hdfs.conf
 
# linux123
flume-ng agent --name a1 --conf-file ~/conf/flume-taildir-avro2.conf

4、先hive.log中写入数据，检查HDFS目录
5、杀掉一个Agent，看看另外Agent是否能启动

3.2.4 数据交互工具 -- HUE、数据采集工具 -- Flume

文章目录

数据交互工具 – HUE

第一部分 Hue概述

第二部分 Hue编译安装

2.1、下载软件包

2.2、安装依赖

2.3、安装Maven

2.4、编译

2.5、修改 Hadoop 配置文件

2.6、Hue配置

2.7、启动 Hue 服务

第三部分 Hue整合Hadoop、Hive

3.1 集成HDFS、YARN

3.2 集成Hive

3.3 集成MySQL

3.4 重启Hue服务

数据采集工具 – Flume

第一部分 Flume概述

第 1 节 Flume的定义

第 2 节 Flume体系结构

Flume架构中的组件

第 3 节 Flume拓扑结构

串行模式

复制模式(单Souce多Channel、Sink模式)

负载均衡模式(单Source、Channel多Sink)

聚合模式

第 4 节 Flume内部原理

第二部分 安装部署

安装步骤：

第三部分 基础应用

第 1 节 入门案例

实现步骤：

第 2 节 监控日志文件信息到HDFS

第 3 节 监控目录采集信息到HDFS

第 4 节 监控日志文件采集数据到HDFS、本地文件系统

第四部分 高级特性

第 1 节 拦截器

时间添加戳拦截器

Host添加拦截器

正则表达式过滤拦截器

第 2 节 选择器

复制选择器

多路复用选择器

自定义选择器

第 3 节 Sink组逻辑处理器

故障转移

负载均衡

第 4 节 事务机制与可靠性

Flume中的 Put 事务

Flume中的 Take 事务

第 5 节 高可用案例

猜你喜欢

第二部分安装部署

第三部分基础应用

第 1 节入门案例

第 2 节监控日志文件信息到HDFS

第 3 节监控目录采集信息到HDFS

第 4 节监控日志文件采集数据到HDFS、本地文件系统

第四部分高级特性

第 1 节拦截器

第 2 节选择器

第 4 节事务机制与可靠性

第 5 节高可用案例