Flume（1）

1.传统的关系型数据库的数据流程

（1）RDBMS==>Sqoop(抽取出来)==>Hadoop()

如果是数据量较大，且需要采集，这个时候Flume出来了，提供三点

collecing 采集 source

aggregating 聚合 channel（暂时理解为找个地方比采集来的数据存储一下）

moving 移动 sink

2.学习Flume的重点

（1）我们只需要明白如何编写配置文件，组合source、channel、sink三者之间的关系

Agent：他就是source、channel、sink三者组合在一起

编写Flume的配置文件，就是编写Agent的构成。

3.Flume安装部署

1) 下载 //在进入cdh5的时候有时候他是会卡的，要等待一下或者干掉重来
2) 解压到~/app tar -xzvf xxx -C xxx目录将这个文件解压到指定目录
3) 添加到系统环境变量 ~/.bash_profile
export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
4) $FLUME_HOME/conf/flume-env.sh
JAVA_HOME
4.如何使用flume

(1)造句模板

agent_name: 配置的agent的名称
a1：就是agent的名称
a1、r1、k1、c1

# Name the components on this agent

<agent_name>.sources = <source_name>
<agent_name>.sinks = <sink_name>
<agent_name>.channels = <channel_name>

<agent_name>.sources.<source_name>.type = xx
<agent_name>.sinks.<sink_name>.type = yyy
<agent_name>.channels.<channel_name>.type = zzz

<agent_name>.sources.<source_name>.channels = <channel_name>
<agent_name>.sinks.<sink_name>.channel = <channel_name>

（2）造句实例

a1.sources = r1 //a1是agent的名字 .source暂且理解他为调用source吧，r1就是source的名字。
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat //这些东西在官网上面可以找到的
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 //？？？？？？
a1.sinks.k1.channel = c1

执行语句
./flume-ng agent \
--name a1 \ //调用agent名字为a1的那个
--conf $FLUME_HOME/conf \ //这个表示的是flume的全局配置文件
--conf-file /home/hadoop/script/flume/simple-flume.conf \ //这个就是你的agent配置文件所在的目录
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \ //表示他的监听类型
-Dflume.monitoring.port=34343 //表示给他一个监听端口

（3）Event

概念：Event就是一条数据，它是Flume数据传输里面的最小单位。

Event组成：headers+body //我们这的headers是空的，body后面那串数字是字节数组，后面就是数据

（4）Flume支持的source、channel、sink有哪些呢？

在官网上都可以看到，但常用的有这些：

source //他表示将内容拿过来
   avro
   exec ： tail -F xx.log //这是一个文件的
   Spooling Directory: //这是一个文件夹的
   Taildir //生产中最为常用 ****
   netcat

sink //将数据写到哪去了
   HDFS
   logger
   avro : 配合avro source使用
   kafka 这个streaming流式数据处理

channel
   memory 存在内存的
   file 存在本地的

*****在生产中，如果spooling或者exec挂掉了，你是没有办法，继续运行的，机器费时间，而Taildir里面的psittionFile，他会将偏移量记在flume下的taildir_position.json文件中，即使挂了，在你下次启动的时候他也能正常的运行完成作业。

5.小文件如何避免过多呢？

Hadoop中最怕的就是小文件，一个小文件占用一个block，一个block占用一个namnode，及其影响运行，在Flume中我们可以通过三个参数来判断，例如将文件写到HDFS上去。在HDFSsink中，要明白解释的意思。在生产中Size一般设置为一个块的大，并且三个一般是联合一起用，例如滚动时间10分钟，数量100当有一个条件达到的时候，就滚动一次，这样能减少小文件的数量。

hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)