Big Data Course K20 - Overview of Spark's SparkSQL

Email of the author of the article: [email protected] Address: Huizhou, Guangdong

 ▲ This chapter’s program

⚪ Understand the origin of Spark’s SparkSQL;

⚪ Understand the SparkSQL features of Spark;

⚪ Understand the SparkSQL advantages of Spark;

⚪ Master Spark’s introduction to SparkSQL;

1. Overview of SparkSQL

1 Overview

Spark introduces a programming module called SparkSQL for structured data processing . It provides a programming abstraction called DataFrame (data frame) . The bottom layer of DF is still RDD and can act as a distributed SQL query engine.

2. The origin of SparkSQL

The predecessor of SparkSQL is Shark. During the development of Hadoop, in order to provide technical personnel who were familiar with RDBMS but did not understand MapReduce with a tool to get started quickly, Hive came into being and was the only SQL-on-Hadoop tool running on hadoop at the time. However, the large number of intermediate disk landing processes in the MapReduce calculation process consumes a large amount of I/O, and the operating efficiency is low.

Later, in order to improve the efficiency of SQL-on-Hadoop, a large number of SQL-on-Hadoop tools began to be produced, among which the most outstanding ones are:

1. MapR’s Drill

2. Cloudera's Impala

3. Shark

Shark is one of the components of the Berkeley Lab Spark ecosystem. It implements some improvements based on Hive, such as introducing cache management, improving and optimizing the executor, etc., and enables it to run on the Spark engine, thereby increasing the speed of SQL queries. Get 10-100 times improvement.

However, with the development of Spark, for the ambitious Spark team, Shark relies too much on hive (such as using hive's grammar parser, query optimizer, etc.), which restricts Spark's One Stack rule them all. The established policy restricts the mutual integration of various components of spark, so the sparkSQL project was proposed.

SparkSQL abandoned the original Shark code, absorbed some of Shark's advantages, such as in-memory column storage (In-Memory Columnar Storage), Hive compatibility, etc., and redeveloped the SparkSQL code.

Since it is free of dependence on hive, SparkSQL has gained great convenience in terms of data compatibility, performance optimization, and component expansion.

On June 1, 2014, Reynold Xin, the host of the Shark project and the SparkSQL project, announced that the development of Shark would be stopped and the team would put all resources on the SparkSQL project. At this point, the development of Shark came to an end.

3. SparkSQL features

1. Introduced a new RDD type SchemaRDD, which can be defined like a traditional database table.

2. Data from different sources can be mixed in the application. For example, data from HiveQL and data from SQL can be joined.

3. The query optimization framework is embedded. After parsing SQL into a logical execution plan, it finally becomes RDD calculation.

4. SparkSQL optimization

Mainly sparkSQL has been optimized in the following points:

1. In-Memory Columnar Storage

Advantages of column storage:

① When querying massive data, there is no redundant column problem. If it is based on row storage, redundant columns will be generated during queries, and redundant columns are generally eliminated in memory. Or query based on row storage to implement materialized index (create B-tree B+tree), but materialized index also consumes CPU

② Based on column storage, the data type of each column is homogeneous. The first advantage is that it can avoid frequent conversion of data types in memory. The second benefit is that more efficient compression algorithms can be used, such as incremental compression algorithms and binary compression algorithms. Gender: Male Female Male Female 0101

  SparkSQL's table data is stored in memory not in the original JVM object storage method, but in memory column storage, as shown in the figure below.

This storage method has great advantages in terms of space usage and read throughput rate .

For the original JVM object storage method, each object usually adds 12-16 bytes of additional overhead (toString, hashcode, etc. methods), such as for a 270MB e-commerce product table data&

Guess you like

Origin blog.csdn.net/u013955758/article/details/132567582