Integration and Development of Flume+HBase+Kafka

 


   Today's content is to complete the integrated development of Flume+HBase+Kafka. As shown in the red box in the figure below, there are two sources of Flume on node 1: the sink output of node 2 and node 3. Node 1 performs preprocessing after receiving it, and then pushes it to HBase and Kafka in the form of AsyncHBaseSink (HBaseSink) and Kafka Sink, respectively, for offline data processing and real-time data processing.

 

1. Download the Flume source code and import the Idea development tool

  1) Download the apache-flume-1.7.0-src.tar.gz source code to the local decompression

  2) Import flume source code through idea

  Open the idea development tool, select File—>Open

 

 

  Then find the flume source code decompression file, select flume-ng-hbase-sink, and click ok to load the source code of the corresponding module.

 

 

                             

 

2. Introduction to the parameters of the official flume and hbase integration

 Flume Sink -> AsyncHBaseSink under http://flume.apache.org/FlumeUserGuide.html

 

 

  Among them, the attributes in bold must be configured, and the others are used as optimization parameters. The payloadColumn attribute tells HBase how many columns to write into the column family columnFamily.

 

3. Download log data and analyze

  Go to Sogou Labs to download user query logs (this process has been completed in the previous HBase environment deployment, if you have any questions, you can go back and check:  HBase distributed cluster deployment and design )

 1 Introduction

  The search engine query log database is designed as a collection of web query log data including some web page query requirements of Sogou search engine and user clicks for about one month (June 2008). Provide benchmark research corpus for researchers who conduct analysis of Chinese search engine user behavior

 2) Format Description

  The data format is: access time\tuser ID\t[query word]\trank of the URL in the returned results\tsequence number clicked by the user\tURL clicked by the user

  Among them, the user ID is automatically assigned according to the cookie information when the user uses the browser to access the search engine, that is, different queries input by the same browser correspond to the same user ID

 

  This data will be used as the source data of this project and stored in Node 2 and Node 3.

 

4. Configuration of integration of flume agent-3 aggregation node and HBase

  Connect node 1 with notepad++ and rename the configuration file.

 

 

  Configure the fulme-env.sh file

 

 

  Configure the flume-conf.properties file

 

 

  The format of the original template is messy, just kill it all, and enter the following content:

agent1.sources = r1
agent1.channels = kafkaC hbaseC
agent1.sinks = kafkaSink hbaseSink

agent1.sources.r1.type = avro
agent1.sources.r1.channels = hbaseC
agent1.sources.r1.bind = bigdata-pro01.kfk.com
agent1.sources.r1.port = 5555
agent1.sources.r1.threads = 5

agent1.channels.hbaseC.type = memory
agent1.channels.hbaseC.capacity = 100000
agent1.channels.hbaseC.transactionCapacity = 100000
agent1.channels.hbaseC.keep-alive = 20

agent1.sinks.hbaseSink.type = asynchbase
agent1.sinks.hbaseSink.table = weblogs
agent1.sinks.hbaseSink.columnFamily = info
agent1.sinks.hbaseSink.serializer = org.apache.flume.sink.hbase.KfkAsyncHbaseEventSerializer
agent1.sinks.hbaseSink.channel = hbaseC
agent1.sinks.hbaseSink.serializer.payloadColumn = datatime,userid,searchname,retorder,cliorder,cliurl

 

 

5. Format the log data

 1) Replace the tab in the file with a comma

cat weblog.log|tr "\t" "," > weblog2.log

 

 2) Replace the spaces in the file with commas

cat weblog2.log|tr " " "," > weblog3.log

 

[kfk@bigdata-pro01 datas]$ rm -f weblog2.log
[kfk@bigdata-pro01 datas]$ rm -f weblog.log
[kfk@bigdata-pro01 datas]$ mv weblog3.log weblog.log
[kfk@bigdata-pro01 datas]$ ls
  wc.input  weblog.log

 3) Then distribute to nodes 2 and 3

[kfk@bigdata-pro01 datas]$ scp weblog.log bigdata-pro02.kfk.com:/opt/datas/
weblog.log                                                                                                                 100%  145MB  72.5MB/s   00:02   
[kfk@bigdata-pro01 datas]$ scp weblog.log bigdata-pro03.kfk.com:/opt/datas/
weblog.log

 

6. Custom SinkHBase program design and development

 1) Imitate SimpleAsyncHbaseEventSerializer to customize KfkAsyncHbaseEventSerializer implementation class, just modify the code.

 

@Override
    public List getActions() {
        List actions = new ArrayList();
        if (payloadColumn != null) {
            byte[] rowKey;
            try {
                /*---------------------------代码修改开始---------------------------------*/
                //解析列字段
                String[] columns = new String(this.payloadColumn).split(",");
                //解析flume采集过来的每行的值
                String[] values = new String(this.payload).split(",");
                for(int i=0;i < columns.length;i++){
                    byte[] colColumn = columns[i].getBytes();
                    byte[] colValue = values[i].getBytes(Charsets.UTF_8);

                    //数据校验:字段和值是否对应
                    if(colColumn.length != colValue.length) break;
                    //时间
                    String datetime = values[0].toString();
                    //用户id
                    String userid = values[1].toString();
                    //根据业务自定义Rowkey
                    rowKey = SimpleRowKeyGenerator.getKfkRowKey(userid,datetime);
                    //插入数据
                    PutRequest putRequest =  new PutRequest(table, rowKey, cf,
                            colColumn, colValue);
                    actions.add(putRequest);
                /*---------------------------代码修改结束---------------------------------*/
                }
            } catch (Exception e) {
                throw new FlumeException("Could not get row key!", e);
            }
        }
        return actions;
    }

 

 

 2) In the SimpleRowKeyGenerator class, customize the Rowkey generation method according to the specific business

/**
   * 自定义Rowkey
   * @param userid
   * @param datetime
   * @return
   * @throws UnsupportedEncodingException
   */

  public static byte[] getKfkRowKey(String userid,String datetime)throws UnsupportedEncodingException {
    return (userid + datetime + String.valueOf(System.currentTimeMillis())).getBytes("UTF8");
  }

 

7. Customize the compiler to make a jar package

 1) In the idea tool, select File—>ProjectStructrue

 

 

 2) Select Artifacts on the left, then click the + sign on the right, and finally select JAR—>From modules with dependencies

 

 

 3) Then click ok directly

 

 

 4) Then click apply, ok in turn

 

 

 6) Click build to compile, and it will be automatically packaged into a jar package

 

 

 

 7) Go to the project directory and find the jar package you just typed

 

 

 8) Replace the package name with the package name flume-ng-hbase-sink-1.7.0.jar that comes with flume, and then upload it to the flume/lib directory to overwrite the original jar package.

 

 

8. Configuration of flume aggregation node integrated with Kafka

Continue to append the following content   to the flume-conf.properties file :

#*****************flume+Kafka***********************
agent1.channels.kafkaC.type = memory
agent1.channels.kafkaC.capacity = 100000
agent1.channels.kafkaC.transactionCapacity = 100000
agent1.channels.kafkaC.keep-alive = 20

agent1.sinks.kafkaSink.channel = kafkaC
agent1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafkaSink.brokerList = bigdata-pro01.kfk.com:9092,bigdata-pro02.kfk.com:9092,bigdata-pro03.kfk.com:9092
agent1.sinks.kafkaSink.topic = test
agent1.sinks.kafkaSink.zookeeperConnect = bigdata-pro01.kfk.com:2181,bigdata-pro02.kfk.com:2181,bigdata-pro03.kfk.com:2181
agent1.sinks.kafkaSink.requiredAcks = 1
agent1.sinks.kafkaSink.batchSize = 1
agent1.sinks.kafkaSink.serializer.class = kafka.serializer.StringEncoder

 


 

 The above is the main content of this section introduced by the blogger. This is the blogger's own learning process. I hope it can give you some guidance. If it is useful, I hope you can support it. If it is not useful to you I also hope to forgive, and please point out any mistakes. If you are looking forward to it, you can follow the blogger to get the update as soon as possible, thank you! At the same time, reprinting is also welcome, but the original address must be marked in the obvious position of the blog post, and the right of interpretation belongs to the blogger!

Guess you like

Origin blog.csdn.net/py_123456/article/details/83587120