Flume结构

Flume的执行过程：Sources——Channel Processor——Interceptors——Channel Selector——Channels——Sink Processor——Sinks
其中，从Soucrce到Channel和从Channel到Sink存在事务（put和take）

Avro串联

在这里插入图片描述
为了在多个Agent或跃点之间流动数据，前一个Agent的sink和当前Agent的source必须是avro类型，sink指向source的主机名（或IP地址）和端口。这是其他复杂结构的基础，但不建议连接过多的flume，因为flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

复制和多路复用

在这里插入图片描述
    Flume支持将事件流多路传输到一个或多个目的地。这是通过定义一个流复用器来实现的，该流复用器可以将事件复制或有选择地路由到一个或多个channel。
    上面这个例子可以看到，叫做foo的Agent的source可以将数据流分到三个不同的channel。在选择channel时（Channel Selector）可以是复制（Replicating）或多路复用（Multiplexing）。
    对于复制，每个事件都会发送到所有通道。对于多路复用，当事件的属性与预先配置的值匹配时，将事件传递到对应的可用通道。比如下面官方给定例子，如果事件的属性被设置为CZ，则选择c1通道；如果事件的属性被设置为US，则选择c2和c3通道；否则选择c4通道；

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

Channel Selector默认选择复制策略（Replicating）

负载均衡和故障转移

在这里插入图片描述
Sink Processor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor。
DefaultSinkProcessor 对应的是单个的 Sink ，LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group，LoadBalancingSinkProcessor可以实现负载均衡的功能，FailoverSinkProcessor可以实现故障转移的功能。

聚合

在这里插入图片描述
这种模式是非常常见的，也非常实用。日常web应用通常分布在成千上万个服务器，产生的日志非常多，处理起来也非常麻烦。用flume能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等，进行日志分析。

事务机制

    Flume的事务机制（类似数据库的事务机制）：Flume使用两个独立的事务分别负责从Soucrce到Channel，以及从Channel到Sink的事件传递。比如spooling directory source为文件的每一行创建一个事件，一旦事务中所有的事件全部传递到Channel且提交成功，那么Soucrce就将该文件标记为完成。同理，事务以类似的方式处理从Channel到Sink的传递过程，如果因为某种原因使得事件无法记录，那么事务将会回滚。且所有的事件都会保持到Channel中，等待重新传递。
    根据Flume的架构原理，Flume是不可能丢失数据的，其内部有完善的事务机制，Source到Channel是事务性的，Channel到Sink是事务性的，因此这两个环节不会出现数据的丢失，唯一可能丢失数据的情况是Channel采用memoryChannel，agent宕机导致数据丢失，或者Channel存储数据已满，导致Source不再写入，未写入的数据丢失。
    Flume不会丢失数据，但是有可能造成数据的重复，例如数据已经成功由Sink发出，但是没有接收到响应，Sink会再次发送数据，此时可能会导致数据的重复。

案例一：单数据源多出口

案例分析

使用Flume1监控文件变动，Flume1将变动内容传递给Flume2，Flume2负责存储到HDFS。同时 Flume1将变动内容传递给Flume3，Flume3负责输出到Local FileSystem。

案例步骤

创建一个空文件：touch date.txt
启动HDFS和Yarn：start-dfs.sh，start-yarn.sh

创建三个配置文件，flume1.conf，flume2.conf，flume3.conf：
第一个Agent的名字叫flume1，有一个source叫r1，有两个channel叫c1和c2，有两个sink叫k1和k2。source的类型是taildir，监控本地文件date.txt。sink的类型是avro，两个sink的端口不一样，对接其它两个Agent。channel的类型是memory。
Sink Processor有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor。要将一个数据源的数据发送到不同的地方，一个sink绑定一个channel，就需要有多个channel和sink。

 # Name the components on this agent
 a1.sources = r1
 a1.channels = c1 c2
 a1.sinks = k1 k2
 
 将数据复制给所有channel# 将数据复制给所有channel（默认，可不写）
 a1.sources.r1.selector.type = replicating
 
 # Describe/configure the source
 a1.sources.r1.type = TAILDIR
 a1.sources.r1.filegroups = f1
 a1.sources.r1.filegroups.f1 = /opt/flume-1.9.0/date.txt
 a1.sources.r1.positionFile = /opt/flume-1.9.0/file/position.json
 
 # Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 44444
 
 a1.sinks.k2.type = avro
 a1.sinks.k2.hostname = master
 a1.sinks.k2.port = 55555
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 a1.channels.c2.type = memory
 a1.channels.c2.capacity = 1000
 a1.channels.c2.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1 c2
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c2

第二个Agent的source的类型是avro，对接第一个Agent。sink的类型是HDFS。

 # Name the components on this agent
 a2.sources = r1
 a2.channels = c1
 a2.sinks = k1
 
 # Describe/configure the source
 a2.sources.r1.type = avro
 a2.sources.r1.bind = master
 a2.sources.r1.port = 44444
 
 # Describe the sink
 a2.sinks.k1.type = hdfs
 a2.sinks.k1.hdfs.path = hdfs://master:9000/a/%Y%m%d/%H
 a2.sinks.k1.hdfs.filePrefix = logs
 a2.sinks.k1.hdfs.round = true
 a2.sinks.k1.hdfs.roundValue = 1
 a2.sinks.k1.hdfs.roundUnit = hour
 a2.sinks.k1.hdfs.useLocalTimeStamp = true
 a2.sinks.k1.hdfs.batchSize = 100
 a2.sinks.k1.hdfs.fileType = DataStream
 a2.sinks.k1.hdfs.rollInterval = 30
 a2.sinks.k1.hdfs.rollSize = 134217700
 a2.sinks.k1.hdfs.rollCount = 0
 
 # Use a channel which buffers events in memory
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1

第三个Agent的source的类型也是avro，对接第一个Agent。sink的类型是file_roll。

 # Name the components on this agent
 a3.sources = r1
 a3.channels = c1
 a3.sinks = k1
 
 # Describe/configure the source
 a3.sources.r1.type = avro
 a3.sources.r1.bind = master
 a3.sources.r1.port = 55555
 
 # Describe the sink
 a3.sinks.k1.type = file_roll
 a3.sinks.k1.sink.directory = /opt/flume-1.9.0/file
 
 # Use a channel which buffers events in memory
 a3.channels.c1.type = memory
 a3.channels.c1.capacity = 1000
 a3.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a3.sources.r1.channels = c1
 a3.sinks.k1.channel = c1

分别启动flume2，flume3，flume1。注意1在最后，因为avro source需要作为服务端。

 bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console
 bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console
 bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console

输入 date > date.txt 修改文件

案例二：故障转移（Failover）

案例分析

使用Flume1监控一个端口，其sink组中的sink分别对接Flume2和Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

案例步骤

创建三个配置文件，flume1.conf，flume2.conf，flume3.conf：
第一个Agent增加了一个Sink Groups的配置，使用failover策略。注意，k2的优先级大于k1，所以k2对应的flume是activated，而k1对应的flume是standby

 # Name the components on this agent
 a1.sources = r1
 a1.channels = c1
 a1.sinks = k1 k2
 a1.sinkgroups = g1
 
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = master
 a1.sources.r1.bind = 33333
 
 # Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 44444
 
 a1.sinks.k2.type = avro
 a1.sinks.k2.hostname = master
 a1.sinks.k2.port = 55555
 
 # Sink groups
 a1.sinkgroups.g1.sinks = k1 k2
 a1.sinkgroups.g1.processor.type = failover
 a1.sinkgroups.g1.processor.priority.k1 = 50
 a1.sinkgroups.g1.processor.priority.k2 = 100
 a1.sinkgroups.g1.processor.maxpenalty = 10000
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c1

第二个Agent的sink类型是logger

 # Name the components on this agent
 a2.sources = r1
 a2.channels = c1
 a2.sinks = k1
 
 # Describe/configure the source
 a2.sources.r1.type = avro
 a2.sources.r1.bind = master
 a2.sources.r1.port = 44444
 
 # Describe the sink
 a2.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1

第二个和第三个Agent的配置类似，只是端口号不同

 # Name the components on this agent
 a3.sources = r1
 a3.channels = c1
 a3.sinks = k1
 
 # Describe/configure the source
 a3.sources.r1.type = avro
 a3.sources.r1.bind = master
 a3.sources.r1.port = 55555
 
 # Describe the sink
 a3.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a3.channels.c1.type = memory
 a3.channels.c1.capacity = 1000
 a3.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a3.sources.r1.channels = c1
 a3.sinks.k1.channel = c1

分别启动flume2，flume3，flume1。

 bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console
 bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console
 bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console

启动新终端，输入nc master 33333，然后输入东西。
    左上角是flume1，右上角是flume2，左下角是flume3，右下角是客户端。由于flume3的优先级大于flume2，所以flume3是activated的，flume3能够接收到信息。

    此时，flume3挂了，那么flume2就转正，由standby变为activated，能够接收消息。

    此时，flume3复活了，由于flume3的优先级大于flume2，flume3又可以接收消息了。

案例三：负载均衡

案例步骤

创建三个配置文件，flume1.conf，flume2.conf，flume3.conf。其中，flume2.conf和flume3.conf和案例二的flume2、flume3一样。flume1只是Sink Groups的策略改变了。下面是flume1的配置：

 # Name the components on this agent
 a1.sources = r1
 a1.channels = c1
 a1.sinks = k1 k2
 a1.sinkgroups = g1
 
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = master
 a1.sources.r1.port = 33333
 
 # Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 44444
 
 a1.sinks.k2.type = avro
 a1.sinks.k2.hostname = master
 a1.sinks.k2.port = 55555
 
 # Sink groups
 a1.sinkgroups.g1.sinks = k1 k2
 a1.sinkgroups.g1.processor.type = load_balance
 a1.sinkgroups.g1.processor.backoff = true
 a1.sinkgroups.g1.processor.selector = random
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c1

分别启动flume2，flume3，flume1。启动新终端，输入nc master 33333，然后输入东西。
左上角是flume1，右上角是flume2，左下角是flume3，右下角是客户端。

案例四：聚合

案例分析

slave1上的Flume1监控一个端口的数据，slave2上的Flume2监控本地文件date.txt，Flume1与Flume2将数据发送给master上的Flume3，Flume3将数据打印到控制台。

案例步骤

在slave1创建配置文件flume1.conf。source的类型是netcat，监听端口。：sink的类型是avro，与flume3对接。

 # Name the components on this agent
 a1.sources = r1
 a1.channels = c1
 a1.sinks = k1
 
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = localhost
 a1.sources.r1.port = 33333
 
 # Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 44444
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

在slave2创建配置文件flume2.conf。source的类型是exec，监听文件。：sink的类型是avro，与flume3对接。

 # Name the components on this agent
 a2.sources = r1
 a2.channels = c1
 a2.sinks = k1
 
 # Describe/configure the source
 a2.sources.r1.type = exec
 a2.sources.r1.command = tail -F /opt/flume-1.9.0/date.txt
 
 # Describe the sink
 a2.sinks.k1.type = avro
 a2.sinks.k1.hostname = master
 a2.sinks.k1.port = 44444
 
 # Use a channel which buffers events in memory
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1

在master创建配置文件flume3.conf。source的类型是avro，接收flume1和flume2发来的数据。：sink的类型是logger，将接收到的数据写出到控制台。

 # Name the components on this agent
 a3.sources = r1
 a3.channels = c1
 a3.sinks = k1
 
 # Describe/configure the source
 a3.sources.r1.type = avro
 a3.sources.r1.bind = master
 a3.sources.r1.port = 44444
 
 # Describe the sink
 a3.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a3.channels.c1.type = memory
 a3.channels.c1.capacity = 1000
 a3.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a3.sources.r1.channels = c1
 a3.sinks.k1.channel = c1

分别启动flume2，flume3，flume1。
在slave1输入nc localhost 33333，然后发数据
在slave2输入date > date.txt
在master可以看到接收的数据

Flume-Flume结构，单数据源多出口，故障转移（Failover），负载均衡，聚合等案例

目录

Flume结构

Avro串联

复制和多路复用

负载均衡和故障转移

聚合

事务机制

案例一：单数据源多出口

案例分析

案例步骤

案例二：故障转移（Failover）

案例分析

案例步骤

案例三：负载均衡

案例步骤

案例四：聚合

案例分析

案例步骤

猜你喜欢