Writing data read from the HBASE the kafka - Flink Demo (a)

1 Introduction

  Herein is "how to calculate the real hot commodity" [ 1 ] on an extended text do only validated using data Flink Kafka consumption in function, the processed data is written to the HBase process, without making its specific performance tuning . In addition, the paper will not do too much analysis Flink processing logic, because citations (unless otherwise noted, all citations refer to "How to calculate real-time hot commodity" article) written in great detail, it only gives bloggers debugging wrong committed. If the text error message pointed to welcome everyone, thank you !

  Source on GitHub, address: https://github.com/L-Wg/flinkExample ;

  Environment: Flink 1.6 + Kafka 1.1 + HBase 1.2

       OpenJDK 1.8+Maven 3.5.2

2, access to data

  This article is Kafka as a data source (the industry's more popular approach), consistent format and citation data format, data type POJO. To add a source, typically implement the interface SourceFunction <T>, but Flink and Kafka linker (connector), Flink community has already done, we can just add the corresponding dependence in the pom file. There is a point worth noting:. Flink-connector-kafka - * jar there is a version of the claim, specific requirements can participate Flink official website connector one [2] . code show as below:

DataStream<UserBehaviorSchema> dataStream=env.addSource(new FlinkKafkaConsumer010<UserBehaviorSchema>(
                topic,
                new UserBehaviorSerial(),
                properties
        ).setStartFromEarliest());

 Wherein, in the code needs to be given are: Topic to be consumed, the sequence of data objects and configured, wherein the configuration can be specified bootstrap.servers, other configuration settings needed. Call setStarFromEarliest () is to make Flink specified topic in the consumption data from the beginning, the benefits of this is written: As long as there is data you Kafka topic, the test would not have to re-kafka to write the data. Of course, this role is not just to call the method that was used on business needs on demand. Additionally, Flink there are many consumer kafka specified method, please see the official website [2] .

3, data conversion

  Flink analysis of the code, see citations, only here the following points should be noted:

  1. If kafka data in accordance with their own randomly generated because the data format, bloggers do not follow the code written customWaterExtractor () class to define the watermark and timestamp, because the value of the code currentTimeStamp may also random, so It will result in an error but the situation does not get stuck waiting.

  2. timestamp value to the data source and data remain the same level.

public static class customWaterExtractor implements AssignerWithPeriodicWatermarks<UserBehaviorSchema>{

        private static final long serialVersionUID = 298015256202705122L;

        private final long maxOutOrderness=3500;
        private long currentTimeStamp=Long.MIN_VALUE;

        @Nullable
        @Override
        public Watermark getCurrentWatermark() {
            return new Watermark(currentTimeStamp-maxOutOrderness);
        }

        @Override
        public Long extractTimestamp (UserBehaviorSchema Element, Long previousElementTimestamp) {
 //           point to note here: timestamp value is obtained by multiplying the millisecond 1000, the message needs to be designated as a timestamp field units in accordance with. 
            Long timeStamp * 1000 = element.timestamp ; 
            currentTimeStamp = Math.max (timeStamp, currentTimeStamp);
             return timeStamp; 
        } 
    }

  3. In the results returned ResultEvent class, a field to save the sinking HotTopN ranking, defaulted to zero.  

 4, data storage

  This article is accomplished by writing data to HBase extends RichSinkFunction, where, @ Override the invoke () method is invoked for each piece of data will be the rest of the open (), close () method, it is not from the point of view logs calls for each piece of data will be. To open () method is used to open the link, to achieve the best connection pool to avoid excessive link, the connection here HBase has not achieved self-realization alone.

  When data is written to HBase, I have two suggestions:

  Table 1. write data to HBase, it is best to do the work of the pre-partition table, to avoid late because the split table cause performance degradation as well as difficulties in maintenance;

  2. In order to speed up queries HBase, it is possible to develop the field as rowkey HBase tables, text and time stamp is designated as rowkey ranking table, as the two indexes will not discuss here.

5, reference link:

  [1]http://wuchong.me/blog/2018/11/07/use-flink-calculate-hot-items/

  [2]https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/connectors/kafka.html

Guess you like

Origin www.cnblogs.com/love-yh/p/11610648.html