spark-streaming connect hive+HBase

background

Record the process of Spark connecting hive and HBase a while ago, mainly to ensure that the host name mapping of the host and the virtual machine are consistent

step

1. First of all, ensure that the host names corresponding to the IP to be connected in the windows hosts file, CentOS hosts file, and CentOS hostname file are consistent.

For example, the ip I want to connect to is 192.168.57.141, then the corresponding content in the C:\Windows\System32\drivers\etc\hosts file under my windows is

192.168.57.141 scentos

The corresponding content in /etc/hosts in the virtual machine is (note that the localhost part below is also not missing, otherwise windows still cannot be connected)

192.168.57.141 scentos
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

The corresponding content in /etc/hostname in the virtual machine is

scentos

After the contents of the two files in the virtual machine are modified, the restart will take effect. You can also use the following command to temporarily change the host name

[root@scentos spark-2.1.0]# hostname scentos

2. Copy hive-site.xml to the conf directory of the spark directory and turn off the tez engine

3. Check the SDS and DBS tables of hive in mysql. If hdfs has used localhost to store data before, change localhost to real ip

mysql> update SDS set LOCATION=REPLACE (LOCATION,'hdfs://localhost:8020/user/hive/warehouse','hdfs://192.168.57.141:8020/user/hive/warehouse');
mysql> update DBS set DB_LOCATION_URI=REPLACE (DB_LOCATION_URI,'hdfs://localhost:8020/user/hive/warehouse','hdfs://192.168.57.141:8020/user/hive/warehouse');

4. Copy hive-site.xml, core-site.xml and hdfs.xml to the resource directory in idea, where tez engine should be closed after hive-site.xml is copied

5. Introduce dependencies in the pom file of the idea project

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.1.0</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.1.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-hbase-handler</artifactId>
    <version>2.3.5</version>
</dependency>

6. After starting the spark cluster, hive metastore and hbase cluster, write the following code in the project, where zookeeperIp, zookeeperPort and hbaseMasterURL are replaced with your own ZooKeeper address, port and HBase address

        SparkConf conf = new SparkConf()
                .setMaster("local[*]")
                .setAppName("ActionConsumer")
                .set("spark.serializer", KryoSerializer.class.getCanonicalName())
                .registerKryoClasses(new Class[]{ConsumerRecord.class})
                .set("spark.kryoserializer.buffer.max", "512m")
                .set("hbase.zookeeper.quorum", zookeeperIp)
                .set("hbase.zookeeper.property.clientPort", zookeeperPort)
                .set("hbase.master", hbaseMasterURL);


        SparkSession session = SparkSession
                .builder()
                .config(conf)
                .enableHiveSupport()
                .getOrCreate();


        Dataset<Row> rawData = session.sql("select * from profile");
        rawData.show();

Replace the parameters of session.sql() with your own sql statement, then compile and run

Guess you like

Origin blog.csdn.net/qq_37475168/article/details/107895729