Two methods for Spark to connect to Hbase and write data

I have talked with you about the integration of Spark and Hive and how to connect to MySQL. Today I will mainly talk about Spark connecting to Hbase. This may be a bit more complicated. First of all, Hbase is columnar storage, and the structure may be different from the database mentioned above. In this way, it must be converted, and the imported dependency packages are not the same, not under the maven repository. Let’s talk to you all below.

Import dependent packages

Next, Spark integrates hbase, copy the following jar packages under lib in hbase to the jars folder in spark:
Insert picture description here
just restart Spark after importing.
Sometimes zookeeper errors are reported. At this time, you can also try to import the jar package of zookeeper-3.4.6.jar into Spark.
The spark application needs to connect to the zookeeper cluster, and then access hbase with the help of zookeeper.
Connecting to zookeeper can be set in the HBaseConfiguration instance.
If it is not set, the default connection to localhost:2181 will report an error: connection refused.

First confirm that the Hbase table has been created. The table name is account. After the above preparations are made, you can start connecting to Hbase.

Spark connects to Hbase and reads data into RDD

import org.apache.hadoop.hbase.{
    
    HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark._
import org.apache.hadoop.hbase.util.Bytes


object SparkReadHBase {
    
    
   def main(args: Array[String]): Unit = {
    
    
   	val sparkconf = new SparkConf().setAppName("SparkHBase").setMaster("local").getorCreate()
        val sc = new SparkContext(sparkconf)
        val tablename = "account"
        val conf = HBaseConfiguration.create()
     
        //设置zooKeeper集群地址
        conf.set("hbase.zookeeper.quorum","spark02,spark03,spark04")

       //设置zookeeper连接端口,默认2181
       conf.set("hbase.zookeeper.property.clientPort", "2181")
       conf.set(TableInputFormat.INPUT_TABLE, tablename)
       
       // 如果表不存在则创建表
       val admin = new HBaseAdmin(conf)
       if (!admin.isTableAvailable(tablename)) {
    
    
            val tableDesc = new HTableDescriptor(TableName.valueOf(tablename))
            admin.createTable(tableDesc)
        }

       //读取数据并转化成rdd
       
       val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
                           classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
                           classOf[org.apache.hadoop.hbase.client.Result])
       val count = hBaseRDD.count()
          
       println(count)
       hBaseRDD.foreach{
    
    case (_,result) =>{
    
    
         
                     //获取行键
                  val key = Bytes.toString(result.getRow)
                     //通过列族和列名获取列
                  val name = Bytes.toString(result.getValue("cf".getBytes,"name".getBytes))
                  val age = Bytes.toInt(result.getValue("cf".getBytes,"age".getBytes))
                 
                  println("Row key:"+key+" Name:"+name+" Age:"+age)
         }
       }
        sc.stop()
     admin.close()
     }

}

Use saveAsHadoopDataset to write data to Hbase

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions

object SparkWriteHBaseOne {
    
    
   def main(args: Array[String]): Unit = {
    
    
      val sparkConf = new SparkConf().setAppName("HBaseSpark").setMaster("local")
      val sc = new SparkContext(sparkConf)
      val conf = HBaseConfiguration.create()
      //设置zooKeeper集群地址
      conf.set("hbase.zookeeper.quorum","spark02,spark03,spark04")
      //设置zookeeper连接端口,默认2181
      conf.set("hbase.zookeeper.property.clientPort", "2181")
      val tablename = "account"
      //初始化jobconf,TableOutputFormat必须是org.apache.hadoop.hbase.mapred包下的!
      val jobConf = new JobConf(conf)
      jobConf.setOutputFormat(classOf[TableOutputFormat])
      jobConf.set(TableOutputFormat.OUTPUT_TABLE, tablename)
      val indataRDD = sc.makeRDD(Array("1,zhangsan,23","2,Lisi,25","3,wangwu,32"))
      val rdd = indataRDD.map(_.split(',')).map{
    
    arr=>{
    
    
      
       /*一个Put对象就是一行记录,在构造方法中指定主键
        * 所有插入的数据必须用org.apache.hadoop.hbase.util.Bytes.toBytes方法转换
        * Put.add方法接收三个参数:列族,列名,数据
        */
              val put = new Put(Bytes.toBytes(arr(0).toInt))
              put.add(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
              put.add(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
           
            //转化成RDD[(ImmutableBytesWritable,Put)]类型才能调用saveAsHadoopDataset
            (new ImmutableBytesWritable, put)
      }
     }
      rdd.saveAsHadoopDataset(jobConf)
      sc.stop()
  }
 
}

Use saveAsNewAPIHadoopDataset to write data to Hbase

import org.apache.hadoop.hbase.client.{
    
    Put, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._


object SparkWriteHBaseTwo {
    
    
   def main(args: Array[String]): Unit = {
    
    
     val sparkConf = new SparkConf().setAppName("HBaseTest").setMaster("local")
     val sc = new SparkContext(sparkConf)
     
     val tablename = "account"
     
     sc.hadoopConfiguration.set("hbase.zookeeper.quorum","spark02,spark03,spark04")
     sc.hadoopConfiguration.set("hbase.zookeeper.property.clientPort", "2181")
     sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)
     
     val job = new Job(sc.hadoopConfiguration)
     
     job.setOutputKeyClass(classOf[ImmutableBytesWritable])
     job.setOutputValueClass(classOf[Result])
     job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
     
     val indataRDD = sc.makeRDD(Array("1,zhangsan,23","2,Lisi,25","3,wangwu,32"))
     val rdd = indataRDD.map(_.split(',')).map{
    
    arr=>{
    
    
       
               val put = new Put(Bytes.toBytes(arr(0)))
               put.add(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
               put.add(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
              (new ImmutableBytesWritable, put)
        }
      }
      
     rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())
  }
}

Guess you like

Origin blog.csdn.net/zp17834994071/article/details/108606592