Hadoop MapReduce将HDFS文本数据导入HBase

HBase本身提供了很多种数据导入的方式，通常有两种常用方式：

1.使用HBase提供的TableOutputFormat，原理是通过一个Mapreduce作业将数据导入HBase
2.另一种方式就是使用HBase原生Client API
本文就是示范如何通过MapReduce作业从一个文件读取数据并写入到HBase中。

首先启动Hadoop与HBase，然后创建一个空表，用于后面导入数据：

hbase(main):006:0> create 'mytable','cf'
0 row(s) in 10.8310 seconds

=> Hbase::Table - mytable
hbase(main):007:0> list
TABLE                                                                                                   
mytable                                                                                                 
1 row(s) in 0.1220 seconds

=> ["mytable"]
hbase(main):008:0> scan 'mytable'
ROW                         COLUMN+CELL                                                                 
0 row(s) in 0.2130 seconds

一、示例程序

下面的示例程序通过 TableOutputFormat 将HDFS上具有一定格式的文本数据导入到HBase中。

首先创建MapReduce作业，目录结构如下：

Hdfs2HBase/
├── classes
└── src
    ├── Hdfs2HBase.java
    ├── Hdfs2HBaseMapper.java
    └── Hdfs2HBaseReducer.java

Hdfs2HBaseMapper.java

package com.lisong.hdfs2hbase;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class Hdfs2HBaseMapper extends Mapper<LongWritable, Text, Text, Text> {
        public void map(LongWritable key, Text line, Context context) throws IOException,InterruptedException {
                String lineStr = line.toString();
                int index = lineStr.indexOf(":");
                String rowkey = lineStr.substring(0, index);
                String left = lineStr.substring(index+1);
                context.write(new Text(rowkey), new Text(left));
        }
}

Hdfs2HBaseReducer.java

package com.lisong.hdfs2hbase;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class Hdfs2HBaseReducer extends Reducer<Text, Text, ImmutableBytesWritable, Put> {
        public void reduce(Text rowkey, Iterable<Text> value, Context context) throws IOException,InterruptedException {
                String k = rowkey.toString();
                for(Text val : value) {
// 设置行键值
                        Put put = new Put(k.getBytes());
                        String[] strs = val.toString().split(":");
                        String family = strs[0];
                        String qualifier = strs[1];
                        String v = strs[2];
// 设置列簇、列名和列值
                        put.add(family.getBytes(), qualifier.getBytes(), v.getBytes());
                        context.write(new ImmutableBytesWritable(k.getBytes()), put);
                }
        }
}

Hdfs2HBase.java

package com.lisong.hdfs2hbase;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Hdfs2HBase {
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if(otherArgs.length != 2) {
			System.err.println("Usage: wordcount <infile> <table>");
			System.exit(2);
		}
		
		Job job = new Job(conf, "hdfs2hbase");
		job.setJarByClass(Hdfs2HBase.class);
		job.setMapperClass(Hdfs2HBaseMapper.class);
		job.setReducerClass(Hdfs2HBaseReducer.class);
		
		job.setMapOutputKeyClass(Text.class);    // +
		job.setMapOutputValueClass(Text.class);  // +
	
		job.setOutputKeyClass(ImmutableBytesWritable.class);
		job.setOutputValueClass(Put.class);
// 以表输出的格式
		job.setOutputFormatClass(TableOutputFormat.class);
		
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

		job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, otherArgs[1]);
		
		System.exit(job.waitForCompletion(true)?0:1);
	}
}

编译

$ javac -d classes/ src/*.java

打包

$ jar -cvf hdfs2hbase.jar classes

运行

创建一个 data.txt 文件，内容如下（列族是建表时创建的列族 cf ）：

r1:cf:c1:value1 
r2:cf:c2:value2 
r3:cf:c3:value3

将文件复制到hdfs上：

$ hadoop/bin/hadoop fs -put data.txt /hbase

把HBase的jar包加到 hadoop-env.sh 中。

TEMP=`ls /home/hadoop/hbase/lib/*.jar`
HBASE_JARS=`echo $TEMP | sed 's/ /:/g'`
HADOOP_CLASSPATH=$HBASE_JARS

运行MapReduce作业：

$ hadoop/bin/hadoop jar Hdfs2HBase/hdfs2hbase.jar com.lisong.hdfs2hbase.Hdfs2HBase /hbase/data.txt mytable

查询HBase表，验证数据是否已导入：

hbase(main):001:0> scan 'mytable'
ROW                         COLUMN+CELL                                                                 
 r1                         column=cf:c1, timestamp=1439223857492, value=value1                         
 r2                         column=cf:c2, timestamp=1439223857492, value=value2                         
 r3                         column=cf:c3, timestamp=1439223857492, value=value3                         
3 row(s) in 1.3820 seconds

可以看到，数据导入成功！

由于需要频繁的与存储数据的RegionServer通信，占用资源较大，一次性入库大量数据时，TableOutputFormat效率并不好。

二、拓展-TableReducer

我们可以将 Hdfs2HBaseReducer.java 代码改成下面这样，作用是一样的：

package com.lisong.hdfs2hbase;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class Hdfs2HBaseReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
	public void reduce(Text rowkey, Iterable<Text> value, Context context) throws IOException,InterruptedException {
		String k = rowkey.toString();
		for(Text val : value) {
			Put put = new Put(k.getBytes());
			String[] strs = val.toString().split(":");
			String family = strs[0];
			String qualifier = strs[1];
			String v = strs[2];
			put.add(family.getBytes(), qualifier.getBytes(), v.getBytes());
			context.write(new ImmutableBytesWritable(k.getBytes()), put);
		}
	}
}

这里直接 继承了 TableReducer ， TableReducer是部分特例化的 Reducer ，它只有三个类型参数：输入Key/Value是对应Mapper的输出，输出Key可以是任意的类型，但是输出Value必须是一个 Put 或 Delete 实例。

转自： http://www.tuicool.com/articles/jInQ3y2

Hadoop MapReduce将HDFS文本数据导入HBase

猜你喜欢