Chapter 1 SparkSQL Overview

1.1 What is SparkSQL

insert image description here
Spark SQL is Spark's Spark module for structured data processing.

SparkSQL is a module in Apache Spark for processing structured data. It provides a high-level interface for processing relational data and allows executing SQL queries, manipulating data structures such as DataFrame and DataSet in Spark.

The main functions of SparkSQL include:

  1. DataFrame and DataSet: SparkSQL introduces two data structures, DataFrame and DataSet, which are built on the basis of RDD and provide a more advanced API to operate structured data. DataFrame is a distributed data set, similar to a table in a relational database, and DataSet is a typed DataFrame that supports richer type operations.

  2. SQL query: SparkSQL allows to manipulate DataFrame and DataSet through standard SQL query language. Users can use SQL query statements to perform operations such as data filtering, projection, and aggregation.

  3. Distributed computing: SparkSQL is built on top of Spark's distributed computing engine, which can handle large-scale data sets and utilize the parallel processing capabilities of clusters for efficient data processing.

  4. In-memory computing: SparkSQL uses Spark's in-memory computing feature to load part of the data into the memory for calculation, reducing disk IO and improving computing efficiency.

  5. Catalyst optimizer: SparkSQL introduces the Catalyst optimizer, which is an extensible query optimization framework for optimizing query plans. The Catalyst optimizer can optimize query performance through a series of optimization rules and transformations.

  6. Hive compatibility: SparkSQL is compatible with Hive and can directly run Hive queries. Through Hive compatibility, users can migrate existing Hive queries to run in Spark without migrating data.

  7. Data source integration: SparkSQL supports connecting to multiple data sources, including Hive, JSON, Parquet, Avro, etc., as well as other external data sources. This enables SparkSQL to interact with different types of data.

  8. User-defined function (UDF): SparkSQL allows users to define their own UDF for custom data processing and calculation operations.

  9. Lazy execution: SparkSQL adopts a lazy execution strategy, that is, it does not execute query operations immediately, but waits until the final result is needed before performing actual calculations, which can optimize query plans and improve computing performance.

In general, SparkSQL provides a powerful data processing and query engine, combining data structures such as SQL query, DataFrame and DataSet, and making full use of Spark's distributed computing capabilities, suitable for large-scale data processing and complex query tasks. It is an important part of Apache Spark, providing users with more efficient and flexible data analysis and query functions.

1.2 Hive and SparkSQL

The predecessor of SparkSQL is Shark, which provides quick-start tools for technicians who are familiar with RDBMS but do not understand MapReduce.

Hive was the only SQL-on-Hadoop tool that ran on Hadoop in the early days. However, during the MapReduce calculation process, a large number of intermediate disk landing processes consume a large amount of I/O and reduce operating efficiency. In order to improve the efficiency of SQL-on-Hadoop, a large number of SQL-on-Hadoop tools have begun to be produced, and the most prominent ones are: ⚫
Drill

Apache Drill is an open source distributed SQL query engine that allows users to execute SQL queries on large-scale datasets, whether the data is structured or semi-structured. Drill aims to provide a high-performance, low-latency query engine that can directly query multiple data sources, including Hadoop Distributed File System (HDFS), NoSQL database, relational database, cloud storage, etc.

Following are the key features and capabilities of Apache Drill:

  1. Distributed query: Drill is built on the distributed computing engine of Apache Drill, which can run on multiple machines and use the parallel processing capability of the cluster to execute large-scale data query tasks.

  2. SQL Compatibility: Drill supports standard SQL query language, users can use standard SQL syntax to query data, which makes using Drill very convenient and flexible.

  3. Query multiple data sources: Drill can directly query multiple data sources without pre-defined schema or metadata, and supports querying Hadoop's HDFS, NoSQL databases (such as MongoDB, HBase, etc.), relational databases (such as MySQL, PostgreSQL, etc.), cloud storage (such as Amazon S3, Azure Blob Storage, etc.), etc.

  4. Semi-structured data support: Drill can query semi-structured data, such as JSON, Parquet, Avro and other formats, which makes it easier to process complex data.

  5. High performance: Drill uses distributed execution and columnar storage technologies, as well as a query optimizer to improve query performance and execution efficiency.

  6. Intelligent optimization: Drill uses an intelligent optimizer to select an appropriate execution plan to improve query performance and throughput.

  7. Dynamic Schema Discovery: Drill can dynamically discover the schema of data at query time without pre-defining schema or metadata, which makes querying unstructured or frequently changing data more convenient.

  8. Fault tolerance: Drill is fault tolerant and supports failure recovery and failover to ensure query reliability.

In general, Apache Drill is a powerful distributed SQL query engine with high performance, low latency, and querying multiple data sources, making it more convenient and efficient to perform SQL queries on large-scale data sets. It is an invaluable tool, especially for scenarios that require complex query and analysis across multiple data sources.

⚫ Impala

Impala is an open source distributed SQL query engine developed by Cloudera, designed to achieve high-performance, low-latency SQL queries, and can directly query data in storage systems such as Hadoop Distributed File System (HDFS) and HBase. Impala is based on the technical concept proposed by Google's Dremel paper, which allows users to query and analyze large-scale data sets through standard SQL statements without converting data to other formats or performing complex data migration.

Following are the key features and capabilities of Impala:

  1. Distributed query: Impala is built on the Hadoop distributed computing platform. It can execute query tasks in parallel on large-scale data sets and use cluster computing resources to improve query performance.

  2. SQL Compatibility: Impala supports standard SQL query language, users can use familiar SQL syntax to query and analyze data, thus reducing learning costs.

  3. High performance: Impala adopts the MPP (Massively Parallel Processing) architecture, using parallel computing and memory computing technologies to improve query performance and response speed. For complex queries on large-scale datasets, Impala can achieve low-latency query results.

  4. Compatible with Hive: Impala is compatible with Hive's metadata and table definitions, and can be seamlessly integrated with tools such as Hive and Hue, thus simplifying data processing and query processes.

  5. Support for complex data types: Impala supports complex data types, such as arrays, structures, and nested data, making queries on semi-structured data more flexible and convenient.

  6. Support for multiple file formats: Impala supports multiple data file formats, including Parquet, Avro, ORC, etc., so that data in these file formats can be directly queried without additional data conversion.

  7. Fault tolerance: Impala is fault tolerant, supports failover and automatic recovery, and ensures the reliability of query tasks.

In general, Impala is a high-performance, low-latency distributed SQL query engine, suitable for complex query and analysis of large-scale data sets. It is tightly integrated with the Hadoop ecosystem and can directly query data in storage systems such as HDFS and HBase, providing users with a powerful and convenient data processing and query tool.

⚫ Shark

Shark is an open source project developed by UC Berkeley AMP Lab, which is a distributed data warehouse system built on top of Apache Spark. Shark's goal is to provide a high-performance, low-latency data warehouse system capable of performing complex SQL queries and data analysis tasks, and compatible with Hive.

Key features and functions of Shark include:

  1. Distributed SQL query: Shark supports standard SQL query language to perform data query and analysis tasks. It allows users to use SQL statements to query, filter, aggregate and other operations on large-scale data sets.

  2. Based on Spark: Shark is built on top of Apache Spark, which utilizes Spark's in-memory computing and distributed computing capabilities to perform high-performance queries on large-scale data sets.

  3. Support for Hive: Shark is compatible with Hive's metadata and table definitions, and can be seamlessly integrated with the Hive ecosystem, allowing users to run existing Hive queries in Shark.

  4. High performance: Shark adopts columnar storage and query optimizer, and realizes high-performance and low-latency data query by optimizing query plan and parallel execution.

  5. Multiple data format support: Shark supports multiple data file formats, including Parquet, Avro, ORC, etc., and can directly query data in these formats.

  6. User-defined function (UDF): Shark allows users to define their own UDF for custom data processing and calculation operations.

  7. Fault tolerance: Shark is fault tolerant, supports failover and automatic recovery, and ensures the reliability of query tasks.

It should be noted that Shark was an improved version based on Hive in earlier versions. Later, with the emergence of Spark SQL, Spark SQL gradually replaced Shark. Currently, Spark SQL has become an official component of Apache Spark and provides richer and more powerful functions. Therefore, it is recommended to use Spark SQL to replace Shark for data warehouse and SQL query tasks.

Among them, Shark is one of the components of the Spark ecological environment of Berkeley Lab. It is a tool developed based on Hive. It modifies the three modules of memory management, physical planning and execution in the lower right corner shown in the figure below, and enables it to run on the Spark engine.
insert image description here
The emergence of Shark makes the performance of SQL-on-Hadoop 10-100 times higher than that of Hive.

insert image description here

  • However, with the development of Spark, for the ambitious Spark team, Shark's too much dependence on Hive (such as using Hive's syntax parser, query optimizer, etc.), restricts Spark's established policy of One Stack Rule Them All, and restricts the mutual integration of Spark's various components, so the SparkSQL project is proposed.
  • SparkSQL discards the original Shark code, absorbs some advantages of Shark, such as In-Memory Columnar Storage (In-Memory Columnar Storage), Hive compatibility, etc., and redevelops the SparkSQL code; because of getting rid of the dependence on Hive, SparkSQL has obtained great convenience in terms of data compatibility, performance optimization, and component expansion.

➢ In terms of data compatibility, SparkSQL is not only compatible with Hive, but can also obtain data from RDD, parquet files, and JSON files. Future versions even support the acquisition of RDBMS data and cassandra and other NOSQL data;

➢ In terms of performance optimization, in addition to adopting In-Memory Columnar Storage, byte-code generation and other optimization technologies, Cost Model will be introduced to dynamically evaluate queries, obtain the best physical plan, etc.;

➢ In terms of component extension, whether it is SQL syntax parser, analyzer or optimizer, it can be redefined and extended.
insert image description here
On June 1, 2014, Reynold Xin, the host of the Shark project and the SparkSQL project, announced that the development of Shark would be stopped, and the team would put all resources on the SparkSQL project. So far, the development of Shark has come to a close, but it has also developed two branches: SparkSQL and Hive on Spark.

insert image description here

Spark SQL

Spark SQL is a module in Apache Spark for processing structured data. It provides a high-level interface for processing relational data and allows executing SQL queries, manipulating data structures such as DataFrame and DataSet in Spark.

Key features of Spark SQL include:

  1. DataFrame and DataSet: Spark SQL introduces two data structures, DataFrame and DataSet, which are built on the basis of RDD and provide a more advanced API to operate structured data. DataFrame is a distributed data set, similar to a table in a relational database, and DataSet is a typed DataFrame that supports richer type operations.

  2. SQL query: Spark SQL allows to manipulate DataFrame and DataSet through the standard SQL query language. Users can use SQL query statements to perform operations such as data filtering, projection, and aggregation.

  3. Data source integration: Spark SQL supports connecting to multiple data sources, including Hive, JSON, Parquet, Avro, etc., as well as other external data sources. This enables Spark SQL to interact with different types of data.

  4. Catalyst optimizer: Spark SQL introduces the Catalyst optimizer, which is an extensible query optimization framework for optimizing query plans. The Catalyst optimizer can optimize query performance through a series of optimization rules and transformations.

  5. Hive compatibility: Spark SQL is compatible with Hive and can directly run Hive queries. Through Hive compatibility, users can migrate existing Hive queries to run in Spark without migrating data.

  6. User-defined function (UDF): Spark SQL allows users to define their own UDF for custom data processing and calculation operations.

Spark SQL provides a more advanced API and query language, making it more convenient and flexible to process structured data on Spark. It is tightly integrated with other modules of Spark (such as Spark Core and Spark Streaming), and can work seamlessly with them, providing powerful functions for distributed data processing and analysis.

Hive on Spark

Hive on Spark is a way to combine Hive with Apache Spark. It is an integration of Hive and Spark, which aims to combine the data warehouse function of Hive and the distributed computing capability of Spark, so as to provide better performance and flexibility in large-scale data processing and query.

In Hive on Spark, Hive is used as a data warehouse system to manage and query structured data, while Spark is used to perform actual computing tasks. The main goal of Hive on Spark is to accelerate Hive queries, improve query performance and scalability, and support more complex data analysis operations.

Features and benefits of Hive on Spark include:

  1. Accelerated query performance: By using Spark's distributed computing engine, Hive on Spark can accelerate query execution on large-scale datasets. Spark's in-memory computing and data parallel processing capabilities can significantly improve query performance.

  2. In-memory computing: Hive on Spark can use Spark's in-memory computing feature to load part of the data into the memory for calculation, reducing disk IO and improving computing efficiency.

  3. Dynamic partitioning and dynamic bucketing: Hive on Spark supports dynamic partitioning and dynamic bucketing, and can automatically optimize data storage and query methods according to data characteristics and query requirements.

  4. Hive UDF and UDAF support: Hive on Spark supports Hive's user-defined function (UDF) and user-defined aggregate function (UDAF), and users can run their own complex calculation logic on Spark.

  5. Resource management: Hive on Spark can use Spark's resource manager to manage resources for executing tasks to ensure job fairness and efficiency.

  6. Seamless integration of Hive and Spark: Hive on Spark seamlessly integrates with Hive's native syntax and functions, and users can continue to use familiar Hive syntax and APIs without modifying existing queries and scripts.

It should be noted that Hive on Spark is not a substitute for Hive, but an enhanced version of Hive. Users can choose to use Hive on Spark or native Hive according to specific scenarios and needs. Hive on Spark is a very valuable choice for large-scale data processing and complex query tasks, especially when the distributed computing capabilities of Spark need to be utilized.

  • Among them, SparkSQL continues to develop as a member of the Spark ecosystem, and is no longer limited to Hive, but only compatible with Hive; and Hive on Spark is a development plan for Hive, which uses Spark as one of the underlying engines of Hive, that is to say, Hive will no longer be limited to one engine, and can use Map-Reduce, Tez, Spark and other engines.
  • For developers, SparkSQL can simplify the development of RDD, improve development efficiency, and the execution efficiency is very fast, so in actual work, SparkSQL is basically used. In order to simplify the development of RDD and improve development efficiency, Spark SQL provides two programming abstractions, similar to RDD in Spark Core.

➢ DataFrame

DataFrame is a core concept in Spark SQL. It is a distributed data collection, similar to a table in a relational database or a DataFrame in Pandas. It is a two-dimensional data structure composed of rows and columns, which supports the processing and query of structured data.

Features and benefits of DataFrame include:

  1. Distributed computing: DataFrame is built on Spark's distributed computing engine, which can handle large-scale data sets and utilize the parallel processing capabilities of clusters for efficient data processing.

  2. Structured data: DataFrame is a structured data set, each column has a specific data type, similar to a table in a relational database. This makes DataFrame better suited to the processing needs of structured data.

  3. Delayed execution: Spark's DataFrame adopts a lazy execution strategy, that is, it does not execute the query operation immediately, but waits until the final result is needed before performing the actual calculation, which can optimize the query plan and improve computing performance.

  4. Rich API: Spark provides a rich DataFrame API that supports a variety of data operations and transformations, including filtering, mapping, aggregation, connection, etc., as well as SQL query operations, making data processing more convenient and flexible.

  5. Data source support: DataFrame supports a variety of data sources, including Hive, JSON, Parquet, Avro, etc., as well as other external data sources, which makes it easy to interact with different types of data.

  6. User-defined function (UDF): DataFrame allows users to define their own UDF for custom data processing and calculation operations.

  7. Optimizability: By using the Catalyst optimizer, DataFrame can be optimized before executing queries to optimize query plans and improve performance.

Using DataFrame, you can process and analyze structured data more conveniently, which is the core tool for advanced data manipulation in Spark SQL. At the same time, because DataFrame is built on the Spark distributed computing engine, it can also process large-scale data, and make full use of the computing resources of the cluster to provide high-performance data processing capabilities.

➢ DataSet

DataSet is a data collection in Spark SQL, which is a typed version of DataFrame. DataSet was introduced in Spark version 1.6. It combines the structured data processing capabilities of DataFrame and the strong type features of RDD to provide more powerful and type-safe data operations.

Features and benefits of DataSet include:

  1. Strong type: DataSet is typed, which allows users to specify the data type at compile time, so that type errors can be found at compile time, which improves the robustness and maintainability of the code. In contrast, DataFrames are untyped, and type errors are only discovered at runtime.

  2. Type safety: Since DataSet is typed, it can catch type errors at compile time, avoid type conversion errors at runtime, and reduce the possibility of bugs.

  3. API Consistency: The API of DataSet is consistent with that of DataFrame, and most of the operations of DataFrame can be used in DataSet. This makes it easier to migrate from DataFrame to DataSet.

  4. Query optimization: Like DataFrame, DataSet also supports Catalyst optimizer, which can optimize query plan and improve query performance.

  5. Distributed computing: DataSet is built on Spark's distributed computing engine, which can handle large-scale data sets and utilize the parallel processing capabilities of clusters for efficient data processing.

  6. Data source support: DataSet supports multiple data sources, including Hive, JSON, Parquet, Avro, etc., as well as other external data sources, so that it can easily interact with different types of data.

  7. User-defined function (UDF): DataSet allows users to define their own UDF for custom data processing and calculation operations.

  8. Data serialization: The data of DataSet is serialized in binary format in memory, thereby reducing memory usage and improving memory usage efficiency.

In general, DataSet is a more powerful and type-safe data structure in Spark SQL, suitable for application scenarios that require stricter type checking. For scenarios that require more flexible and dynamic data processing and interaction with other unstructured data sources, DataFrame may be more suitable.

1.3 Features of SparkSQL

Spark SQL is a module for processing structured data in Apache Spark. It provides a high-level interface for processing relational data and allows data structures such as executing SQL queries and manipulating DataFrame and DataSet in Spark. Following are the key features of Spark SQL:

  1. Structured data processing: Spark SQL supports the processing of structured data, and can process and query data through the two data structures of DataFrame and DataSet. This makes processing structured data on Spark more convenient and flexible.

  2. SQL query: Spark SQL allows to manipulate DataFrame and DataSet through the standard SQL query language. Users can use SQL query statements to perform operations such as data filtering, projection, and aggregation, so as to perform data analysis and query more conveniently.

  3. Distributed computing: Spark SQL is built on top of Spark's distributed computing engine, which can handle large-scale data sets and utilize the parallel processing capabilities of clusters for efficient data processing.

  4. In-memory computing: Spark SQL uses Spark's in-memory computing feature to load part of the data into the memory for calculation, reducing disk IO and improving computing efficiency.

  5. Catalyst optimizer: Spark SQL introduces the Catalyst optimizer, which is an extensible query optimization framework for optimizing query plans. The Catalyst optimizer can optimize query performance through a series of optimization rules and transformations.

  6. Hive compatibility: Spark SQL is compatible with Hive and can directly run Hive queries. Through Hive compatibility, users can migrate existing Hive queries to run in Spark without migrating data.

  7. Data source integration: Spark SQL supports connecting to multiple data sources, including Hive, JSON, Parquet, Avro, etc., as well as other external data sources. This enables Spark SQL to interact with different types of data.

  8. User-defined function (UDF): Spark SQL allows users to define their own UDF for custom data processing and calculation operations.

  9. Lazy execution: Spark SQL adopts a lazy execution strategy, that is, it does not execute query operations immediately, but waits until the final result is needed before performing actual calculations, which can optimize query plans and improve computing performance.

In general, Spark SQL provides a powerful data processing and query engine, combining data structures such as SQL query, DataFrame and DataSet, and making full use of Spark's distributed computing capabilities, suitable for large-scale data processing and complex query tasks. It is an important part of Apache Spark, providing users with more efficient and flexible data analysis and query functions.

1.3.1 Easy to integrate

Seamlessly integrate SQL query and Spark programming
insert image description here

1.3.2 Unified data access

Connect to different data sources in the same way
insert image description here

1.3.3 Compatible with Hive

Run SQL or HiveQL directly on the existing warehouse
insert image description here

1.3.4 Standard data connection

Connect via JDBC or ODBC
insert image description here

1.4 What is a DataFrame

  • In Spark, DataFrame is an RDD-based distributed dataset, similar to a two-dimensional table in a traditional database. The main difference between DataFrame and RDD is that the former has schema meta information, that is, each column of the two-dimensional table dataset represented by DataFrame has a name and type. This enables Spark SQL to gain insight into more structural information, so as to optimize the data source hidden behind the DataFrame and the transformation acting on the DataFrame, and finally achieve the goal of greatly improving runtime efficiency. In contrast to RDD, since there is no way to know the specific internal structure of the stored data elements, Spark Core can only perform simple and general pipeline optimization at the stage level.

  • Also, similar to Hive, DataFrame also supports nested data types (struct, array, and map). From the perspective of API ease of use, the DataFrame API provides a set of high-level relational operations, which is more friendly and has a lower threshold than the functional RDD API.

insert image description here
The figure above intuitively reflects the difference between DataFrame and RDD.

  • Although the RDD[Person] on the left takes Person as a type parameter, the Spark framework itself does not understand the internal structure of the Person class. The DataFrame on the right provides detailed structural information, so that Spark SQL can clearly know which columns are included in the data set, and what is the name and type of each column. A DataFrame is a view that provides a Schema for the data. It can be treated as a table in the database. DataFrame is also lazily executed, but its performance is higher than that of RDD.
  • The main reason: the optimized execution plan, that is, the query plan is optimized by Spark catalyst optimiser. Take the following example:
    insert image description here
  • To illustrate query optimization, let's look at the example of demographic data analysis shown in the figure above. Two DataFrames are constructed in the figure, and a filter operation is performed after joining them. If the execution plan is executed intact, the final execution efficiency is not high. Because join is a costly operation, it may also generate a larger data set. If we can push the filter down to the join, first filter the DataFrame, and then join the filtered smaller result set, the execution time can be effectively shortened. And Spark SQL's query optimizer does exactly that. In short, logical query plan optimization is a process of replacing high-cost operations with low-cost operations using equivalent transformations based on relational algebra.
    insert image description here

1.5 What is a DataSet

DataSet is a distributed collection of data. DataSet is a new abstraction added in Spark 1.6, which is an extension of DataFrame. It offers the advantages of RDDs (strong typing, ability to use powerful lambda functions) along with the benefits of Spark SQL's optimized execution engine. DataSet can also use functional transformation (operation map, flatMap, filter, etc.).

➢ DataSet is an extension of DataFrame API and the latest data abstraction of SparkSQL

➢ User-friendly API style, with both type safety checks and query optimization features of DataFrame;

➢ Use the sample class to define the structural information of the data in the DataSet, and the name of each attribute in the sample class is directly mapped to the field name in the DataSet;

➢ DataSet is strongly typed. For example, there can be DataSet[Car], DataSet[Person].

➢ DataFrame is a special column of DataSet, DataFrame=DataSet[Row], so DataFrame can be converted to DataSet by as method. Row is a type, which is the same as Car and Person, and all table structure information is represented by Row. The order needs to be specified when fetching data.

Guess you like

Origin blog.csdn.net/weixin_43554580/article/details/131826459
Recommended