Spark SQL fast offline data analysis

 


1. Overview of Spark SQL

1) Spark SQL is part of the core functions of Spark, which was released in Spark 1.0 in April 2014.

http://ke.dajiangtai.com/content/6918/1.png

2) Spark SQL can directly run SQL or HiveQL statements

http://ke.dajiangtai.com/content/6918/2.png

3) BI tools connect to SparkSQL to query data through JDBC

http://ke.dajiangtai.com/content/6918/3.png

4) Spark SQL supports Python, Scala, Java and R languages

http://ke.dajiangtai.com/content/6918/4.png

5) Spark SQL is not just SQL

http://ke.dajiangtai.com/content/6918/5.png

6) Spark SQL is far more powerful than SQL

http://ke.dajiangtai.com/content/6918/6.png

7) Spark SQL processing data structure

http://ke.dajiangtai.com/content/6918/7.png

8) Introduction to Spark SQL

Spark SQL is a Spark module for structured data processing

http://ke.dajiangtai.com/content/6918/8.png

9) Vision of Spark SQL

a)Write less code

Use a unified interface to read and write different data types.

http://ke.dajiangtai.com/content/6918/9.png

b)Read less data

The most effective way to increase the processing speed of big data is to ignore irrelevant data.

(1) Use columnar formats, such as Parquet, ORC, RCFile

(2) Use partitioning pruning, such as partitioning by day, partition by hour, etc.

(3) Use the statistical information attached to the data file for pruning: For example, each piece of data has statistical information such as maximum value, minimum value, and NULL value. When a certain data segment definitely does not contain the target data of the query condition, you can Skip this data directly. (For example, the maximum value of a section of the field age is 20, but when the query condition is >50 years old, it is obviously possible to skip this section directly)

(4) Push down various information in the query source to the data source, so as to make full use of the optimization capability of the data source itself to complete optimization such as pruning and filter condition push-down.

c)Let the optimizer do the hard work

The Catalyst optimizer optimizes SQL statements to obtain more efficient execution plans. Even if we don't consider these optimization details when writing SQL, Catalyst can help us achieve good optimization results.

http://ke.dajiangtai.com/content/6918/10.png

2.Spark SQL service architecture

http://ke.dajiangtai.com/content/6918/11.png

3. Spark SQL and Hive integration (spark-shell)

1) Items that need to be configured

a) Copy the hive configuration file hive-site.xml to the spark conf directory, and add the metastore url configuration (corresponding to the hive installation node, mine is 3 nodes).

vi hive-site.xml

<property>

        <name>hive.metastore.uris</name>

        <value>thrift://bigdata-pro03.kfk.com:9083</value>

</property>

After modification, send it to other nodes

scp hive-site.xml bigdata-pro01.kfk.com:/opt/modules/spark-2.2.0-bin/conf/

scp hive-site.xml bigdata-pro02.kfk.com:/opt/modules/spark-2.2.0-bin/conf/

b) Copy the mysql jar package in hive to the jar directory of spark, and then send it to other nodes

cp hive-0.13.1-bin/lib/mysql-connector-java-5.1.27-bin.jar spark-2.2-bin/jars/

scp mysql-connector-java-5.1.27.jar bigdata-pro01.kfk.com:/opt/modules/spark-2.2.0-bin/jars/

scp mysql-connector-java-5.1.27.jar bigdata-pro02.kfk.com:/opt/modules/spark-2.2.0-bin/jars/

c) Check the configuration items in the spark-env.sh file, if not added, skip if there is

vi spark-env.sh

HADOOP_CONF_DIR=/opt/modules/hadoop-2.6.0/etc/hadoop

2) Start the service

a) Check if mysql is started

#查看状态

service mysqld status

#启动

service mysqld start

b) Start the hive metastore service

bin/hive --service metastore

c) start hive

bin/hive

show databases;

create database kfk;

use kfk;

create table if not exists test(userid string,username string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS textfile;

load data local inpath "/opt/datas/kfk.txt" into table test;

Local kfk.txt file

more /opt/datas/kfk.txt

0001 spark

0002 hive

0003 hbase

0004 hadoop

d) start spark-shell

bin/spark-shell

spark.sql("select * from kfk.test").show

+------+--------+

|userid|username|

+------+--------+

|  0001|   spark|

|  0002|    hive|

|  0003|   hbase|

|  0004|  hadoop|

+------+--------+

4. Spark SQL and Hive integration (spark-sql)

start spark-sql

bin/spark-sql

#查看数据库

show databases;

default

kfk

#使用数据库

use kfk

#查看表

show tables;

test

#查看表数据

select * from test;

5. Use of ThriftServer and beeline in Spark SQL

By using beeline, it is possible to start an application for multiple users to operate at the same time, without having to start multiple applications, which saves resources more.

1) Start ThriftServer

sbin/start-thriftserver.sh

2) Start beeline

[kfk@bigdata-pro02 spark-2.2.0-bin]$ bin/beeline

Beeline version 1.2.1.spark2 by Apache Hive

beeline> !connect jdbc:hive2://bigdata-pro02.kfk.com:10000

Connecting to jdbc:hive2://bigdata-pro02.kfk.com:10000

Enter username for jdbc:hive2://bigdata-pro02.kfk.com:10000: kfk

Enter password for jdbc:hive2://bigdata-pro02.kfk.com:10000: ***

19/04/18 17:56:52 INFO Utils: Supplied authorities: bigdata-pro02.kfk.com:10000

19/04/18 17:56:52 INFO Utils: Resolved authority: bigdata-pro02.kfk.com:10000

19/04/18 17:56:52 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://bigdata-pro02.kfk.com:10000

Connected to: Spark SQL (version 2.2.0)

Driver: Hive JDBC (version 1.2.1.spark2)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://bigdata-pro02.kfk.com:10000> 

#查看数据库

show databases;

+----------------+--+

| database_name  |

+----------------+--+

| default        |

| kfk            |

+----------------+--+

2 rows selected (0.977 seconds)

#查看表数据

select * from kfk.test;

+--------------+----------------+--+

| test.userid  | test.username  |

+--------------+----------------+--+

| 0001         | spark          |

| 0002         | hive           |

| 0003         | hbase          |

| 0004         | hadoop         |

+--------------+----------------+--+

4 rows selected (1.063 seconds)

6.Spark SQL and MySQL integration

start spark-shell

sbin/spark-shell

:paste

val jdbcDF = spark

  .read

  .format("jdbc")

  .option("url", "jdbc:mysql://bigdata-pro01.kfk.com:3306/test")

  .option("dbtable", "spark1")

  .option("user", "root")

  .option("password", "root")

  .load()

ctr+d

#打印读取数据

jdbcDF.show

+------+--------+

|userid|username|

+------+--------+

|  0001|   spark|

|  0002|    hive|

|  0003|   hbase|

|  0004|  hadoop|

+------+--------+

7. Spark SQL and HBase integration

The core of the integration of Spark SQL and HBase is that Spark Sql obtains HBase table data through hive external tables.

1) Copy the HBase package and hive package to the jars directory of spark

http://ke.dajiangtai.com/content/6918/12.png

2) Start Hbase

bin/start-hbase.sh

3) Start Hive

bin/hive

4) start spark-shell

bin/spark-shell


val df =spark.sql("select * from weblogs limit 10").show

If a NoClassDefFoundError error is reported in this step, please refer to the blog post:  java.lang.NoClassDefFoundError: org/htrace/Trace of Spark-HBase integration error 

At this point, the integration of Spark and HBase is a success!

Guess you like

Origin blog.csdn.net/py_123456/article/details/89641651