Kafka transfers data to HBase through Flume


1 Overview

For data forwarding, Kafka is a good choice. Kafka can load data into the message queue, and then wait for other business scenarios to consume the data. Kafka's application interface API is very rich and supports various storage media, such as HDFS, HBase, etc. If you don't want to use the Kafka API to write code to consume Kafka topics, there are also components that can be integrated and consumed. The following author will introduce how to use Flume to quickly consume Kafka Topic data, and then forward the consumed data to HDFS.

2. Content

Before implementing this scheme, you can first take a look at the flow of the entire data, as shown in the following figure:

image

The business data is stored in the Kafka cluster in real time, and then the Kafka business topic is consumed in real time through the Flume Source component to obtain the data, and the consumed data is sent to the HDFS through the Flume Sink component for storage.

2.1 Prepare the basic environment

According to the data flow scheme shown in the figure above, components such as Kafka, Flume, and Hadoop (HDFS available) need to be prepared.

2.1.1 Start Kafka cluster and create Topic

Kafka currently does not have a batch management script, but we can re-encapsulate the kafka-server-start.sh script and the kafka-server-stop.sh script. The code is as follows:

#! /bin/bash

# Kafka代理节点地址, 如果节点较多可以用一个文件来存储
hosts=(dn1 dn2 dn3)

# 打印启动分布式脚本信息
mill=`date "+%N"`
tdate=`date "+%Y-%m-%d %H:%M:%S,${mill:0:3}"`

echo [$tdate] INFO [Kafka Cluster] begins to execute the $1 operation.

# 执行分布式开启命令    
function start()
{
    for i in ${hosts[@]}
        do
            smill=`date "+%N"`
            stdate=`date "+%Y-%m-%d %H:%M:%S,${smill:0:3}"`
            ssh hadoop@$i "source /etc/profile;echo [$stdate] INFO [Kafka Broker $i] begins to execute the startup operation.;kafka-server-start.sh $KAFKA_HOME/config/server.properties>/dev/null" &
            sleep 1
        done
}    

# 执行分布式关闭命令    
function stop()
{
    for i in ${hosts[@]}
        do
            smill=`date "+%N"`
            stdate=`date "+%Y-%m-%d %H:%M:%S,${smill:0:3}"`
            ssh hadoop@$i "source /etc/profile;echo [$stdate] INFO [Kafka Broker $i] begins to execute the shutdown operation.;kafka-server-stop.sh>/dev/null;" &
            sleep 1
        done
}

# 查看Kafka代理节点状态
function status()
{
    for i in ${hosts[@]}
        do
            smill=`date "+%N"`
            stdate=`date "+%Y-%m-%d %H:%M:%S,${smill:0:3}"`
            ssh hadoop@$i "source /etc/profile;echo [$stdate] INFO [Kafka Broker $i] status message is :;jps | grep Kafka;" &
            sleep 1
        done
}

# 判断输入的Kafka命令参数是否有效
case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    status)
        status
        ;;
    *)
        echo "Usage: $0 {start|stop|status}"
        RETVAL=1
esac


After starting the Kafka cluster, when the Kafka cluster is available, create a business topic and execute the command as follows:

# 创建一个flume_collector_data主题
kafka-topics.sh --create --zookeeper dn1:2181,dn2:2181,dn3:2181 --replication-factor 3 --partitions 6 --topic flume_collector_data


2.2 Configure Flume Agent

Then, start configuring the Flume Agent information so that Flume reads data from the flume_collector_data topic of the Kafka cluster and sends the read data to HDFS for storage. The configuration content is as follows:

# ------------------- define data source ----------------------
# source alias
agent.sources = source_from_kafka  
# channels alias
agent.channels = mem_channel  
# sink alias
agent.sinks = hdfs_sink  


#
 define kafka source
agent.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource  
agent.sources.source_from_kafka.channels = mem_channel  
agent.sources.source_from_kafka.batchSize = 5000  

#
 set kafka broker address  
agent.sources.source_from_kafka.kafka.bootstrap.servers = dn1:9092,dn2:9092,dn3:9092

#
 set kafka topic
agent.sources.source_from_kafka.kafka.topics = flume_collector_data

#
 set kafka groupid
agent.sources.source_from_kafka.kafka.consumer.group.id = flume_test_id

#
 defind hdfs sink
agent.sinks.hdfs_sink.type = hdfs 

#
 specify the channel the sink should use  
agent.sinks.hdfs_sink.channel = mem_channel

#
 set store hdfs path
agent.sinks.hdfs_sink.hdfs.path = /data/flume/kafka/%Y%m%d  

#
 set file size to trigger roll
agent.sinks.hdfs_sink.hdfs.rollSize = 0  
agent.sinks.hdfs_sink.hdfs.rollCount = 0  
agent.sinks.hdfs_sink.hdfs.rollInterval = 3600  
agent.sinks.hdfs_sink.hdfs.threadsPoolSize = 30
agent.sinks.hdfs_sink.hdfs.fileType=DataStream    
agent.sinks.hdfs_sink.hdfs.writeFormat=Text    

#
 define channel from kafka source to hdfs sink 
agent.channels.mem_channel.type = memory  

#
 channel store size
agent.channels.mem_channel.capacity = 100000
# transaction size
agent.channels.mem_channel.transactionCapacity = 10000


Then start Flume Agent and execute the command as follows:

# 在Linux后台执行命令
flume-ng agent -n agent -f $FLUME_HOME/conf/kafka2hdfs.properties &


2.3 Send data to Kafka topic

Start the Kafka Eagle monitoring system (execute the ke.sh start command) and fill in the sending data. As shown below:

image

Then, query whether the data in Topic has been written, as shown in the following figure:
image

 

Finally, go to the path corresponding to HDFS to view the data transmitted by Flume. The result is shown in the following figure:

image

3. How Kafka transfers data to HBase through Flume

3.1 创建新主题

创建一个新的Topic,执行命令如下:

# 创建一个flume_kafka_to_hbase主题
kafka-topics.sh --create --zookeeper dn1:2181,dn2:2181,dn3:2181 --replication-factor 3 --partitions 6 --topic flume_kafka_to_hbase


3.2 配置Flume Agent

然后,配置Flume Agent信息,内容如下:

# ------------------- define data source ----------------------
# source alias
agent.sources = kafkaSource
# channels alias
agent.channels = kafkaChannel
# sink alias
agent.sinks = hbaseSink


#
 set kafka channel
agent.sources.kafkaSource.channels = kafkaChannel

#
 set hbase channel
agent.sinks.hbaseSink.channel = kafkaChannel

#
 set kafka source
agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource

#
 set kafka broker address  
agent.sources.kafkaSource.kafka.bootstrap.servers = dn1:9092,dn2:9092,dn3:9092

#
 set kafka topic
agent.sources.kafkaSource.kafka.topics = flume_kafka_to_hbase

#
 set kafka groupid
agent.sources.kafkaSource.kafka.consumer.group.id = flume_test_id



#
 set channel
agent.channels.kafkaChannel.type = org.aprache.flume.channel.kafka.KafkaChannel
# channel queue
agent.channels.kafkaChannel.capacity=10000
# transaction size
agent.channels.kafkaChannel.transactionCapacity=1000



#
 set hbase sink
agent.sinks.hbaseSink.type = asynchbase
# hbase table
agent.sinks.hbaseSink.table = flume_data
# set table column
agent.sinks.hbaseSink.columnFamily= info
# serializer sink
agent.sinks.hbaseSink.serializer=org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer

#
 set hbase zk
agent.sinks.hbaseSink.zookeeperQuorum = dn1:2181,dn2:2181,dn3:2181


3.3 创建HBase表

进入到HBase集群,执行表创建命令,如下所示:

hbase(main):002:0> create 'flume_data','info'


3.4 启动Flume Agent

接着,启动Flume Agent实例,命令如下所示:

# 在Linux后台执行命令
flume-ng agent -n agent -f $FLUME_HOME/conf/kafka2hbase.properties &


3.5 在Kafka Eagle中向Topic写入数据

然后,在Kafka Eagle中写入数据,如下图所示:

image

image


3.6 在HBase中查询传输的数据

最后,在HBase中查询表flume_data的数据,验证是否传输成功,命令如下:

hbase(main):003:0> scan 'flume_data'


预览结果如下所示:

 image


4.总结

至此,Kafka中业务Topic的数据,经过Flume Source组件消费后,再由Flume Sink组件写入到HDFS,整个过程省略了大量的业务编码工作。如果实际工作当中不涉及复杂的业务逻辑处理,对于Kafka的数据转发需求,不妨可以试试这种方案。


5.结束语

这篇博客就和大家分享到这里,如果大家在研究学习的过程当中有什么问题,可以加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!


HBase+Spark技术交流讨论

整合最优质的专家资源和技术资料会定期开展线下技术沙龙,专家技术直播,专家答疑活动公告:

1. 提问: 

请到云栖社区HBase+Spark团队号

https://yq.aliyun.com/teams/382  

或  

http://hbase.group 

问答区提问


2. 技术分享: 

每周群直播视频及PDF:

https://yq.aliyun.com/teams/382/type_blog-cid_414-page_1


11月 群直播安排:

第1期:HBase多模式+Spark基本介绍  11月

第2期:HBase内核及能力   11月

第3期:Spark介绍及Spark多数据源分析  11月

第4期:Phoenix 基本介绍及二级索引   11月


3. 线下技术沙龙公告

    November 17, Nanjing [The 8th MeetUp of China HBase Technology Community] Registration: http://www.huodongxing.com/event/4464965483800

    December 1, Beijing [Ali Yunqi Developer Salon]-Fun with database technology! Registration: https://yq.aliyun.com/activity/779



image


Guess you like

Origin blog.51cto.com/15060465/2677265