HBase data migration to Kafka combat

1 Overview

In practical application scenarios, the data stored in HBase cluster, but for some particular reason, the need to migrate data from HBase to Kafka. Normally, data source generally to Kafka, then there consumer handling data, writes data HBase. However, if the reverse process, how to migrate data to Kafka HBase it? Today I gave everyone to share specific implementation process.

2. Content

Following general business scenario, the data source generates data into Kafka, and by the consumer (e.g. Flink, Spark, Kafka API) proceeds to post-process data HBase. This is a very typical real-time processes. Flowchart is as follows:

 

 Such real-time flow of the above-described process, data processing is easier, after the data flow is processed sequentially. However, if you reverse this process, you will encounter some problems.

2.1 mass data

HBase distributed nature, the lateral expansion of the cluster, the data in HBase are often 10 billion one hundred billion level, or higher magnitude. Such levels of data, the data for this type of reverse flow scenarios, there will be a very troublesome problem, and that is to take a few questions. How these massive data taken out from the HBase?

2.2 no data partition

We do know HBase data Get or List <Get> quick and relatively easy. And it is no such similar Hive data warehouse partition concept, can not provide the data within a certain period of time. If you want to extract the most recent week of data, it may be a full table scan to get the week's data by filtering stamp. Small number, it may not be a problem, but a large amount of data when a full table scan to HBase very difficult.

3. Solutions

For such a reverse data flow, how to deal with. In fact, we may be implemented using HBase Get and List <Get> properties. Because HBase by RowKey to build an index for the number of take RowKey level, the speed is very fast. Implementation process details are as follows:

 

 Data flow shown above, following the author is to analyze the implementation details of each process, and precautions.

3.1 Rowkey extraction

We know HBase do an index number for Rowkey take, so we can use this feature to start. We can extract huge amounts of data in Rowkey from HBase table, and then follow the extraction rules and the rules we set store the extracted Rowkey stored on HDFS.

It should be noted that a problem, that is drawn on HBase Rowkey, mass data extraction level Rowkey recommended MapReduce to achieve. This provides the benefit HBase TableMapReduceUtil classes to implement, the HBase the Rowkey filtered through phase MapReduce map task at the specified time range, reduce phase rowkey split into a plurality of files, and finally stored on the HDFS.

Here students may be in doubt, use MapReduce to extract all the Rowkey, why not directly in the column scan data processing clusters column under it? Here, when we start the MapReduce tasks, filter only when Scan HBase data Rowkey (be implemented using FirstKeyOnlyFilter), do not row cluster data processing, this will be much faster. HBase RegionServer pressure will be much smaller.

Row Column
row001 info:name
row001 info:age
row001 info:sex
row001 info: sn

Here, for example, such as a data table, in fact, we need only remove the Rowkey (row001). However, the actual business data, HBase table describes the data may have a lot of feature attributes (such as name, gender, age, identity card, etc.), there may be more than a dozen feature next column cluster some business data, but they have only one Rowkey, we need only this one Rowkey. So, we use FirstKeyOnlyFilter to achieve very appropriate.

/**
 * A filter that will only return the first KV from each row.
 * <p>
 * This filter can be used to more efficiently perform row count operations.
 */

This functional description is FirstKeyOnlyFilter period, which is used to return the first data KV, in fact, use it for official use count, here we slightly modified, used for the extraction FirstKeyOnlyFilter Rowkey.

3.2 Rowkey generation

How to generate the extracted Rowkey, there may be confirmed based on the actual number Reduce magnitude. Rowkey advice generation file, the practical amount of data to count the number Reduce. Do not try to ease of use on a HDFS file, so the back of bad maintenance. For example, such as HBase table has 100GB, we can be split into 100 files.

3.3 Data Processing

In Step 1, according to the extraction rule storage and the rules, the data extracted from HBase MapReduce Rowkey and stored by the HDFS. We then read by Rowkey files on HDFS MapReduce task by List <Get> way to get data in HBase. Dismantling details are as follows:

 

 Map stage, we read from HDFS Rowkey data file, and then take the number from the batch Get HBase by the way, and then send the data to Reduce assembly stage. In the Reduce stage, to obtain data from the Map stage, writing data to Kafka, the callback function by Kafka producer, writes Kafka get status information, whether written information to determine success based on the state data. If successful, the record of success Rowkey to HDFS, to facilitate progress statistics success; if it fails, failed to record Rowkey to HDFS, to facilitate the progress of the statistical failure.

3.4 failed to run heavy

MapReduce task to write data by Kafka, the situation may have failed, in the case of failure, we only need to record Rowkey on HDFS, when the task is executed, failed Rowkey file exists on the go checks HDFS, if after exist, then step 3 is started again, i.e., the file read failed Rowkey HDFS, then List <Get> HBase data in, data processing, and finally write Kafka, and so on, until the failure Rowkey processing HDFS completed.

 

4. The implementation code

The amount of code implemented here is not complicated, a pseudo-code provided below can be modified on this basis (e.g. Rowkey extraction, the MapReduce read Rowkey and bulk Get HBase table, and then write the like Kafka). Sample code is as follows:

public class MRROW2HDFS {

    public static void main(String[] args) throws Exception {

        Configuration config = HBaseConfiguration.create(); // HBase Config info
        Job job = Job.getInstance(config, "MRROW2HDFS");
        job.setJarByClass(MRROW2HDFS.class);
        job.setReducerClass(ROWReducer.class);

        String hbaseTableName = "hbase_tbl_name";

        Scan scan = new Scan();
        scan.setCaching(1000);
        scan.setCacheBlocks(false);
        scan.setFilter(new FirstKeyOnlyFilter());

        TableMapReduceUtil.initTableMapperJob(hbaseTableName, scan, ROWMapper.class, Text.class, Text.class, job);
        FileOutputFormat.setOutputPath(job, new Path("/tmp/rowkey.list")); // input you storage rowkey hdfs path
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

    public static class ROWMapper extends TableMapper<Text, Text> {

        @Override
        protected void map(ImmutableBytesWritable key, Result value,
                Mapper<ImmutableBytesWritable, Result, Text, Text>.Context context)
                throws IOException, InterruptedException {

            for (Cell cell : value.rawCells()) {
                // Filter date range
                // context.write(...);
            }
        }
    }
    
    public static class ROWReducer extends Reducer<Text,Text,Text,Text>{
        private Text result = new Text();
        
        @Override
        protected void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
            for(Text val:values){
                result.set(val);
                context.write(key, result);
            }
        }
    }
}

5. Summary

Reverse the entire data processing flow, and not complicated, MapReduce implementation is also very basic logic, not too complicated logic. When in the course of processing, require several details, RowKey generated on the HDFS, there may be cases bit line spaces in the read file to the HDFS Rowkey List <Get>, preferably for each data to be filtered whitespace handling. Further, the data is re-run and reconciliation and for successfully processed Rowkey Rowkey failure processing record, it is easy task fails. Be aware of the progress and completion of data migration. At the same time, we can use Kafka Eagle monitoring tools to check the progress of Kafka writes.

6. Conclusion

This blog will share here, if you have any questions in the process of research study, you can add the group to discuss or send mail to me, I will do my best to answer your questions, and the king of mutual encouragement!

In addition, a blogger book " Kafka used to live not difficult to learn " and " Hadoop big data mining from entry to advanced combat ", like a friend or a classmate, you can click the link to buy the book to buy bloggers learn the bulletin board there, thank you for your support. The following public concern number, follow the prompts, you can get free video teaching books.

Guess you like

Origin www.cnblogs.com/smartloli/p/11521659.html