HBase data migration
Usually our data comes from logs and RDBMS. Sometimes, we need to migrate data. For example, backup data in RDBMS and migrate to HBase for storage or migrate some files to HBase. Common HBase data migrations include the above Several ways:
- Insert data using the Put API
- Use the Bulk Load method to migrate massive data (detailed and demonstrated below)
- Write MapReduce programs for data migration
- Use tools such as Sqoop to extract data from RDBMS into HBase
- Use JDBC to extract data from RDBMS and store it in HBase
This article will mainly explain how to inherit the mapreduce program to migrate data, use the import tool importtsv of maprudce to migrate data, and use the Bulk Load method to migrate data.
Using MapReduce
How does HBase inherit the mapreduce program? First of all, we will think of adding the jar package required by hbase to the lib directory to run the mapreduce program, which works. However, this will not be done here. In fact, there is a hbase-server-0.98.6-hadoop2.jar in the lib directory of HBase, which contains the mapreduce program we want to run. You can use this jar to run the mapreduce program. How to inherit the mapreduce program, below, first explain how to use the mapreduce program that comes with hbase, and then write the mapreduce program to migrate data between tables.
1. Run the mapreduce program
Write commands in the hbase installation directory to see two important commands, which are exactly what we need to run the mapreduce program.
To run the mapreduce program, you need to add the hadoop installation directory and the environment variables of the hbase installation directory to the configuration file (refer to the official website document http://hbase.apache.org/book.html#hbase.mapreduce.classpath ), the following directly Use the export command to import environment variables.
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar
Type the above command to see important mapreduce program cases
Next, run the rowcounter case to count the number of records in a table:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar rowcounter 'user'
The running results are as follows. There are three records in the user table:
Readers of other cases can test by themselves, such as import, export and other cases.
2. Write a mapreduce program for data migration
Application scenario: The following program needs to program to
migrate to the basic table.
First create a basic table:
create 'basic','info'
Writing a mapreduce program is not difficult, as long as you are familiar with the writing of the mapreduce program, and then refer to the official documentation to complete it. Since the author has added comments in the program, and the official documentation has given a detailed case, I will not go into details here, but directly give the code written:
package cn.just.hbase.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class UserToBasicMapreduce extends Configured implements Tool{
//Mapper class
public static class UserMapper extends TableMapper<Text, Put>{
public Text mapOutPutKey=new Text();
//key:rowKey value:cell
@Override
public void map(ImmutableBytesWritable key, Result value,
Mapper<ImmutableBytesWritable, Result, Text, Put>.Context context)
throws IOException, InterruptedException {
//set rowkey
String rowkey=Bytes.toString(key.get());
mapOutPutKey.set(rowkey);
Put put=new Put(key.get());
for(Cell cell:value.rawCells()) {
//add cf
if("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
//add name column
if("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
put.add(cell);
}
//add age column
if("age".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
put.add(cell);
}
}
}
context.write(mapOutPutKey, put);
}
}
//Reducer class
public static class BasicReducer extends TableReducer<Text, Put, ImmutableBytesWritable>{
@Override
public void reduce(Text key, Iterable<Put> values,
Reducer<Text, Put, ImmutableBytesWritable, Mutation>.Context context)
throws IOException, InterruptedException {
for(Put put:values) {
context.write(null, put);
}
}
}
//Driver class
public int run(String[] arg0) throws Exception {
//create job
Job job=Job.getInstance(this.getConf(), this.getClass().getSimpleName());
//set job jar
job.setJarByClass(this.getClass());
//set job
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
"user", // input table
scan, // Scan instance to control CF and attribute selection
UserMapper.class, // mapper class
Text.class, // mapper output key
Put.class, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
"basic", // output table
BasicReducer.class, // reducer class
job);
job.setNumReduceTasks(1); // at least one, adjust as required
boolean b = job.waitForCompletion(true);
if(!b) {
throw new IOException("error with job!");
}
return b?1:0;
}
public static void main(String[] args) throws Exception{
//set configration
Configuration configuration=HBaseConfiguration.create();
//submit job
int status=ToolRunner.run(configuration, new UserToBasicMapreduce(), args);
//exit program
System.exit(status);
}
}
After writing the program, export the jar package and enter the following command to complete the migration:
#userToBasicMR.jar是上面程序导出的jar包
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
${HADOOP_HOME}/bin/hadoop jar \
/opt/modules/hadoop-2.5.0/jars/userToBasicMR.jar
Check the data in the basic table and find that the migration was successful:
Import TSV files using the importtsv tool
First of all, let's talk about what a TSV file is. Readers who know CSV files know that the file format is separated by commas, while TSV files are separated by tabs '\t'.
First create a student.tsv file and enter the following lines of data (in which directory the following commands are created):
1001 zhangsan 23 male beijing 12457996
1002 wangwu 25 male hangzhou 12458796
1003 wangjun 20 male lanzhou 021547996
1004 zhaoliu 19 female jinan 1200246
1005 shinelon 19 male nanjing 03457996
After creating the student.tsv file upload it to the hdfs file system:
bin/hdfs dfs -mkdir -p /user/shinelon/hbase/importtsv
bin/hdfs dfs -put /opt/datas/student.tsv /user/shinelon/hbase/importtsv/
Then create a table student:
create 'student','info'
Next, use the importtsv tool to import the files in the HDFS file system into hbase (the number of columns below is equal to the number of columns in the student.tsv file created above, and should correspond):
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex,info:address,info:phone \
student \
hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/importtsv
After the mapreduce program runs, you can view the data that has been inserted into the student table:
Use Bulk Load to migrate massive data
From the above title, you can see that the use of Bulk Load can quickly migrate massive data, and the efficiency is very high, and the load on the cluster is very small.
Usually, the mapreduce program is used to insert data into hbase using the TableOutPutFormat method. In reduce, the put object is directly generated and written to hbase. This method is inefficient when a large amount of data is inserted, because hbase will frequently access nodes to write data. The performance of the cluster has a great impact (such as a series of chain reactions such as long GC, slow response, node timeout and exit).
Using the Bulk Load method can achieve the effect of fast storage of massive data, and the cluster load is very small, because it uses the principle that hbase data is stored in hdfs and stored in the format of HFile files, and the data is directly generated HFile stored in hdfs, and then loaded (Move, HFile files on hdfs will be gone after loading) into hbase. And it is completed with mapreduce, which is efficient and convenient, does not occupy region resources and adds load. There are two main advantages of using the Bulk Load method:
- Eliminate insert pressure on hbase cluster
- Improve the running speed of the job and reduce the execution time of the job
Let's load the student.tsv file created above into the student2 table using bulk load.
First, create the student2 table:
create 'student2','info'
Use the following command to upload the data in the tsv file to the hdfs file system and store it in the HFile file:
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex,info:address,info:phone \
-Dimporttsv.bulk.output=/user/shinelon/hbase/hfileInput \
student2 \
hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/importtsv
Then is to load the HFile file into hbase:
sudo HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar \
completebulkload \
hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/hbase/hfileInput \
student2
Looking at the table in student2, you can see that the data was successfully inserted:
So far, this article has introduced how to inherit the mapreduce program and use tools for data migration. If there are any shortcomings, please leave a message for discussion, please respect the results of your labor, and please indicate the forwarding link for forwarding, thank you!