1. Overview of Spark SQL
1) Spark SQL is part of the core functions of Spark, which was released in Spark 1.0 in April 2014.
2) Spark SQL can directly run SQL or HiveQL statements
3) BI tools connect to SparkSQL to query data through JDBC
4) Spark SQL supports Python, Scala, Java and R languages
5) Spark SQL is not just SQL
6) Spark SQL is far more powerful than SQL
7) Spark SQL processing data structure
8) Introduction to Spark SQL
Spark SQL is a Spark module for structured data processing
9) Vision of Spark SQL
a)Write less code
Use a unified interface to read and write different data types.
b)Read less data
The most effective way to increase the processing speed of big data is to ignore irrelevant data.
(1) Use columnar formats, such as Parquet, ORC, RCFile
(2) Use partitioning pruning, such as partitioning by day, partition by hour, etc.
(3) Use the statistical information attached to the data file for pruning: For example, each piece of data has statistical information such as maximum value, minimum value, and NULL value. When a certain data segment definitely does not contain the target data of the query condition, you can Skip this data directly. (For example, the maximum value of a section of the field age is 20, but when the query condition is >50 years old, it is obviously possible to skip this section directly)
(4) Push down various information in the query source to the data source, so as to make full use of the optimization capability of the data source itself to complete optimization such as pruning and filter condition push-down.
c)Let the optimizer do the hard work
The Catalyst optimizer optimizes SQL statements to obtain more efficient execution plans. Even if we don't consider these optimization details when writing SQL, Catalyst can help us achieve good optimization results.
2.Spark SQL service architecture
3. Spark SQL and Hive integration (spark-shell)
1) Items that need to be configured
a) Copy the hive configuration file hive-site.xml to the spark conf directory, and add the metastore url configuration (corresponding to the hive installation node, mine is 3 nodes).
vi hive-site.xml
<property>
<name>hive.metastore.uris</name>
<value>thrift://bigdata-pro03.kfk.com:9083</value>
</property>
After modification, send it to other nodes
scp hive-site.xml bigdata-pro01.kfk.com:/opt/modules/spark-2.2.0-bin/conf/
scp hive-site.xml bigdata-pro02.kfk.com:/opt/modules/spark-2.2.0-bin/conf/
b) Copy the mysql jar package in hive to the jar directory of spark, and then send it to other nodes
cp hive-0.13.1-bin/lib/mysql-connector-java-5.1.27-bin.jar spark-2.2-bin/jars/
scp mysql-connector-java-5.1.27.jar bigdata-pro01.kfk.com:/opt/modules/spark-2.2.0-bin/jars/
scp mysql-connector-java-5.1.27.jar bigdata-pro02.kfk.com:/opt/modules/spark-2.2.0-bin/jars/
c) Check the configuration items in the spark-env.sh file, if not added, skip if there is
vi spark-env.sh
HADOOP_CONF_DIR=/opt/modules/hadoop-2.6.0/etc/hadoop
2) Start the service
a) Check if mysql is started
#查看状态
service mysqld status
#启动
service mysqld start
b) Start the hive metastore service
bin/hive --service metastore
c) start hive
bin/hive
show databases;
create database kfk;
use kfk;
create table if not exists test(userid string,username string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS textfile;
load data local inpath "/opt/datas/kfk.txt" into table test;
Local kfk.txt file
more /opt/datas/kfk.txt
0001 spark
0002 hive
0003 hbase
0004 hadoop
d) start spark-shell
bin/spark-shell
spark.sql("select * from kfk.test").show
+------+--------+
|userid|username|
+------+--------+
| 0001| spark|
| 0002| hive|
| 0003| hbase|
| 0004| hadoop|
+------+--------+
4. Spark SQL and Hive integration (spark-sql)
start spark-sql
bin/spark-sql
#查看数据库
show databases;
default
kfk
#使用数据库
use kfk
#查看表
show tables;
test
#查看表数据
select * from test;
5. Use of ThriftServer and beeline in Spark SQL
By using beeline, it is possible to start an application for multiple users to operate at the same time, without having to start multiple applications, which saves resources more.
1) Start ThriftServer
sbin/start-thriftserver.sh
2) Start beeline
[kfk@bigdata-pro02 spark-2.2.0-bin]$ bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://bigdata-pro02.kfk.com:10000
Connecting to jdbc:hive2://bigdata-pro02.kfk.com:10000
Enter username for jdbc:hive2://bigdata-pro02.kfk.com:10000: kfk
Enter password for jdbc:hive2://bigdata-pro02.kfk.com:10000: ***
19/04/18 17:56:52 INFO Utils: Supplied authorities: bigdata-pro02.kfk.com:10000
19/04/18 17:56:52 INFO Utils: Resolved authority: bigdata-pro02.kfk.com:10000
19/04/18 17:56:52 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://bigdata-pro02.kfk.com:10000
Connected to: Spark SQL (version 2.2.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://bigdata-pro02.kfk.com:10000>
#查看数据库
show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| kfk |
+----------------+--+
2 rows selected (0.977 seconds)
#查看表数据
select * from kfk.test;
+--------------+----------------+--+
| test.userid | test.username |
+--------------+----------------+--+
| 0001 | spark |
| 0002 | hive |
| 0003 | hbase |
| 0004 | hadoop |
+--------------+----------------+--+
4 rows selected (1.063 seconds)
6.Spark SQL and MySQL integration
start spark-shell
sbin/spark-shell
:paste
val jdbcDF = spark
.read
.format("jdbc")
.option("url", "jdbc:mysql://bigdata-pro01.kfk.com:3306/test")
.option("dbtable", "spark1")
.option("user", "root")
.option("password", "root")
.load()
ctr+d
#打印读取数据
jdbcDF.show
+------+--------+
|userid|username|
+------+--------+
| 0001| spark|
| 0002| hive|
| 0003| hbase|
| 0004| hadoop|
+------+--------+
7. Spark SQL and HBase integration
The core of the integration of Spark SQL and HBase is that Spark Sql obtains HBase table data through hive external tables.
1) Copy the HBase package and hive package to the jars directory of spark
2) Start Hbase
bin/start-hbase.sh
3) Start Hive
bin/hive
4) start spark-shell
bin/spark-shell
val df =spark.sql("select * from weblogs limit 10").show
If a NoClassDefFoundError error is reported in this step, please refer to the blog post: java.lang.NoClassDefFoundError: org/htrace/Trace of Spark-HBase integration error
At this point, the integration of Spark and HBase is a success!