[SparkSQL] introduction, integration with Hive, Spark of th / beeline / jdbc / thriftserve2, shell mode using SQL

 

table of Contents

A, Spark SQL Introduction

Two Spark and Hive integration

Three, Spark's thriftserve2 / beeline / jdbc

Four, shell mode using SQL

A, Spark SQL Introduction

Official website: http://spark.apache.org/sql/

Study documents: http://spark.apache.org/docs/latest/sql-programming-guide.html

 

SQL on Hadoop framework:

1)Spark SQL

2)Hive

3)Impala

4)Phoenix

Spark SQL is used to process the data offline, and his programming model is DF / DS

 

Spark SQL features:

1) Integration: can a variety of complex SQL: spark.sql ( "")

2) uniform data access: connection Hive, Avro, Parquet, ORC, JSON, and JDBC external data sources in a uniform manner: spark.read.format ( "".) Load ( "")

3) Hive and integration: by virtue of sharing presence metabase, tables, the UDF can be shared. impala, Hive, Spark SQL metadata are shared metastore

4) jdbc way: the need to start the service, such as Hive of hiveserver2, mysql.

 

 Spark of SQL framework:

1) Spark branch: Spark SQL

2)Hive:Hive on Spark


Spark SQL interactive way: SQL, DF / DS

DataFrame version 1.3 is out, DataSet1.6 arise.
RDD = + Schema DataFrame
DataFrame a Dataset = [Row]

DF supports Scala, Java, Python, and R development. However, DS only supports Scala and Java. Python is not supported, so the use of Python has limitations.

 

Two Spark and Hive integration

Just two steps:

1) The Hive metadata configuration file into Spark / conf:

$ SPARK_HOME CD
CD the conf 
CP /home/hadoop/app/hive-1.1.0-cdh5.7.0/conf/hive-site.xml. (Hive-configure the site.xml connection metadata)

2) is added in Spark mysql driver package driver (since Hive metadata stored on MySQL). There are many ways to add jar package:

1) Mode 1: added after spark-shell or spark-submit:
when activated by the addition of --jars, but sometimes not spread and driver side, so that still need to use the class-path---driver:
Spark -shell \
--jars /home/hadoop/lib/mysql-connector-java-5.1.43-bin.jar \
--driver-class-path /home/hadoop/lib/mysql-connector-java-5.1.43 -bin.jar

If the package more time using --package.

2) mode 2: CP packed into SPARK_HOME / JARs
3) Embodiment 3: The jar package configuration in spark-default.conf:
Multi-use when the package:
spark.executor.extraClassPath = / Home / Hadoop / wzq_workspace / lib / *
spark.driver.extraClassPath = / home / hadoop / wzq_workspace / lib / *
single use when the package (I):
spark.executor.extraClassPath = / Home / Hadoop / lib / MySQL-5.1.43-Connector-Java-bin. JAR
spark.driver.extraClassPath = / Home / Hadoop / lib / MySQL-5.1.43-Connector-Java-bin.jar

note:

1) Hive bottom is HDFS, we must ensure that the process is on HDFS.
2) mysql service needs to be open.
3) I want to use Spark integration Hive, need to add -Phive -Phive-thriftserver compiler directive when compiling Spark.
4) Because the configuration Spark joined the hive-site.xml, it is the default file system HDFS.
     Otherwise, it is a local file system, create table when it will create / user / hive / warehouse in the local directory.

SparkSession entrance Spark SQL program. SparkSession in Spark Shell alias is spark.

 

Three, Spark's thriftserve2 / beeline / jdbc

hive has hiveServer2, spark also has a similar thriftserve2.
Location: spark / sbin

Step:
1) the $ {HIVE_HOME} /conf/hive-site.xml copied to $ {SPARK_HOME} / conf.
2) add mysql driver.

Start thriftserver Services:
sbin $ ./start-thriftserver.sh --jars ~ / Software / XXX / MySQL-Connector-the Java-5.1.27-bin.jar
position will be prompted to log file by `tail -f XX.log `you can check whether the server started successfully.
Occupied port is 10000.

Start beeline client:
bin $ ./beeline -u jdbc: hive2: // localhost: 10000 # -n hadoop by over JDBC connection Hive
jdbc: hive2: // localhost: 10000> # successful connection
jdbc: hive2: // localhost : 10000> Show the Tables;
jdbc: hive2: // localhost: 10000> the SELECT * from emp;

Note:
To under `$ SPARK_HOME / bin` by` / beeline` start start-thriftserver.sh corresponding client, start start into a hive or a client, because the environment variables to configure the `HIVE_HOME / bin`, too. Configuring the `SPARK_HOME / bin`, so to clear the position for use.

After the open server, can also be used on windows in IDEA JDBC connection service. Also no need to write the user name and password.

Observation Spark UI interface:
1) name of the application is Thrift JDBC / ODBC Server
2) more tab a JDBC / ODBC Server on the page and display a lot of information SQL and SQL execution plan.

 

Four, shell mode using SQL

Hive actually access the operating table.

shell mode using SQL in two ways:

1) spark-shell enter:

scala> spark.sql("show databases").show(false)
scala> spark.sql("use hive")
scala> spark.sql("show tables").show(false)
scala> spark.sql("select * from test").show(false)

2) spark-sql to:

$ ./spark-sql --jars bin
the Spark-SQL (default)> Show the Tables;
the Spark-SQL (default)> the SELECT * from emp;
the Spark-SQL (default)> Show Databases; # write directly SQL, use; symbol ending
with exit; exit spark-sql
also spark-submit spark-sql underlying call.

Guess you like

Origin www.cnblogs.com/huomei/p/12093999.html