Summary and demonstration of common methods of HBase data migration (backup) (integrated with MapReduce)

HBase data migration

Usually our data comes from logs and RDBMS. Sometimes, we need to migrate data. For example, backup data in RDBMS and migrate to HBase for storage or migrate some files to HBase. Common HBase data migrations include the above Several ways:

Insert data using the Put API
Use the Bulk Load method to migrate massive data (detailed and demonstrated below)
Write MapReduce programs for data migration
Use tools such as Sqoop to extract data from RDBMS into HBase
Use JDBC to extract data from RDBMS and store it in HBase

This article will mainly explain how to inherit the mapreduce program to migrate data, use the import tool importtsv of maprudce to migrate data, and use the Bulk Load method to migrate data.

Using MapReduce

How does HBase inherit the mapreduce program? First of all, we will think of adding the jar package required by hbase to the lib directory to run the mapreduce program, which works. However, this will not be done here. In fact, there is a hbase-server-0.98.6-hadoop2.jar in the lib directory of HBase, which contains the mapreduce program we want to run. You can use this jar to run the mapreduce program. How to inherit the mapreduce program, below, first explain how to use the mapreduce program that comes with hbase, and then write the mapreduce program to migrate data between tables.

1. Run the mapreduce program
Write commands in the hbase installation directory to see two important commands, which are exactly what we need to run the mapreduce program.
write picture description here

To run the mapreduce program, you need to add the hadoop installation directory and the environment variables of the hbase installation directory to the configuration file (refer to the official website document http://hbase.apache.org/book.html#hbase.mapreduce.classpath ), the following directly Use the export command to import environment variables.

export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0   
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar

Type the above command to see important mapreduce program cases
write picture description here

Next, run the rowcounter case to count the number of records in a table:

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath`   ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar rowcounter 'user'

The running results are as follows. There are three records in the user table:
write picture description here
Readers of other cases can test by themselves, such as import, export and other cases.

2. Write a mapreduce program for data migration
Application scenario: The following program needs to program to
migrate to the basic table.

First create a basic table:

create 'basic','info'

Writing a mapreduce program is not difficult, as long as you are familiar with the writing of the mapreduce program, and then refer to the official documentation to complete it. Since the author has added comments in the program, and the official documentation has given a detailed case, I will not go into details here, but directly give the code written:

package cn.just.hbase.mapreduce;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class UserToBasicMapreduce extends Configured implements Tool{

    //Mapper class
    public static class UserMapper extends TableMapper<Text, Put>{
        public Text mapOutPutKey=new Text();

        //key:rowKey   value:cell
        @Override
        public void map(ImmutableBytesWritable key, Result value,
                Mapper<ImmutableBytesWritable, Result, Text, Put>.Context context)
                throws IOException, InterruptedException {
            //set rowkey
            String rowkey=Bytes.toString(key.get());

            mapOutPutKey.set(rowkey);

            Put put=new Put(key.get());

            for(Cell cell:value.rawCells()) {
                //add cf
                if("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
                    //add name column
                    if("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                        put.add(cell);
                    }
                    //add age column
                    if("age".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                        put.add(cell);
                    }
                }
            }
            context.write(mapOutPutKey, put);

        }

    }
    //Reducer class
    public static class BasicReducer extends TableReducer<Text, Put, ImmutableBytesWritable>{

        @Override
        public void reduce(Text key, Iterable<Put> values,
                Reducer<Text, Put, ImmutableBytesWritable, Mutation>.Context context)
                throws IOException, InterruptedException {
            for(Put put:values) {
                context.write(null, put);
            }
        }
    }
    //Driver class
    public int run(String[] arg0) throws Exception {
        //create job
        Job job=Job.getInstance(this.getConf(), this.getClass().getSimpleName());

        //set job jar
        job.setJarByClass(this.getClass());

        //set job
        Scan scan = new Scan();
        scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
        scan.setCacheBlocks(false);  // don't set to true for MR jobs

        // set other scan attrs

        TableMapReduceUtil.initTableMapperJob(
          "user",        // input table
          scan,               // Scan instance to control CF and attribute selection
          UserMapper.class,     // mapper class
          Text.class,         // mapper output key
          Put.class,  // mapper output value
          job);
        TableMapReduceUtil.initTableReducerJob(
          "basic",        // output table
          BasicReducer.class,    // reducer class
          job);
        job.setNumReduceTasks(1);   // at least one, adjust as required

        boolean b = job.waitForCompletion(true);
        if(!b) {
            throw new IOException("error with job!");
        }
        return b?1:0;
    }
    public static void main(String[] args) throws Exception{
        //set configration
        Configuration configuration=HBaseConfiguration.create();
        //submit job
        int status=ToolRunner.run(configuration, new UserToBasicMapreduce(), args);
        //exit program
        System.exit(status);
    }
}

After writing the program, export the jar package and enter the following command to complete the migration:

#userToBasicMR.jar是上面程序导出的jar包
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0   
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar \
  /opt/modules/hadoop-2.5.0/jars/userToBasicMR.jar

Check the data in the basic table and find that the migration was successful:
write picture description here

Import TSV files using the importtsv tool

First of all, let's talk about what a TSV file is. Readers who know CSV files know that the file format is separated by commas, while TSV files are separated by tabs '\t'.
First create a student.tsv file and enter the following lines of data (in which directory the following commands are created):

1001    zhangsan    23  male    beijing 12457996
1002    wangwu  25  male    hangzhou    12458796
1003    wangjun 20  male    lanzhou 021547996
1004    zhaoliu 19  female  jinan   1200246
1005    shinelon    19  male    nanjing 03457996

After creating the student.tsv file upload it to the hdfs file system:

bin/hdfs dfs -mkdir -p /user/shinelon/hbase/importtsv
bin/hdfs dfs -put /opt/datas/student.tsv /user/shinelon/hbase/importtsv/

Then create a table student:

create 'student','info'

Next, use the importtsv tool to import the files in the HDFS file system into hbase (the number of columns below is equal to the number of columns in the student.tsv file created above, and should correspond):

export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0  
sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
   ${HADOOP_HOME}/bin/yarn jar \
   ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
   -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex,info:address,info:phone \
   student \
   hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/importtsv

After the mapreduce program runs, you can view the data that has been inserted into the student table:
write picture description here

Use Bulk Load to migrate massive data

From the above title, you can see that the use of Bulk Load can quickly migrate massive data, and the efficiency is very high, and the load on the cluster is very small.

Usually, the mapreduce program is used to insert data into hbase using the TableOutPutFormat method. In reduce, the put object is directly generated and written to hbase. This method is inefficient when a large amount of data is inserted, because hbase will frequently access nodes to write data. The performance of the cluster has a great impact (such as a series of chain reactions such as long GC, slow response, node timeout and exit).

Using the Bulk Load method can achieve the effect of fast storage of massive data, and the cluster load is very small, because it uses the principle that hbase data is stored in hdfs and stored in the format of HFile files, and the data is directly generated HFile stored in hdfs, and then loaded (Move, HFile files on hdfs will be gone after loading) into hbase. And it is completed with mapreduce, which is efficient and convenient, does not occupy region resources and adds load. There are two main advantages of using the Bulk Load method:

Eliminate insert pressure on hbase cluster
Improve the running speed of the job and reduce the execution time of the job

Let's load the student.tsv file created above into the student2 table using bulk load.

First, create the student2 table:

create 'student2','info'

Use the following command to upload the data in the tsv file to the hdfs file system and store it in the HFile file:

export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0  

sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
   ${HADOOP_HOME}/bin/yarn jar \
   ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
   -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex,info:address,info:phone \
   -Dimporttsv.bulk.output=/user/shinelon/hbase/hfileInput \
   student2 \
   hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/importtsv

Then is to load the HFile file into hbase:

sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
   ${HADOOP_HOME}/bin/yarn jar \
   ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar \
   completebulkload \
   hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/hfileInput \
   student2

Looking at the table in student2, you can see that the data was successfully inserted:
write picture description here

So far, this article has introduced how to inherit the mapreduce program and use tools for data migration. If there are any shortcomings, please leave a message for discussion, please respect the results of your labor, and please indicate the forwarding link for forwarding, thank you!