Introduction to SparkSQL Chapter 1

Spark SQL
Spark SQL official introduction
● Official website
http://spark.apache.org/sql/
Spark SQL is used by SparkProcessing structured dataA module.
Spark SQL also provides multiple ways to use, including DataFrames APIand Datasets API.
1. What is SparkSQL?
Spark module for processing structured data.
Data can be processed through DataFrame and DataSet.
2. Features of SparkSQL
1. 易整合
API operations in languages ​​such as java, scala, python, and R can be used.
2. The 统一的数据访问
same way to connect to any data source.
3, 兼容Hive
4, 标准的数据连接(JDBC/ODBC)
3, SQL advantages and disadvantages
Advantages: 表达非常清晰,难度低、易学习。
Disadvantages:复杂的业务需要复杂的SQL, 复杂分析,SQL嵌套较多。机器学习较难实现 .
4. Hive and SparkSQL
Hive converts SQL to MapReduce
SparkSQL can be understood as parsing SQL into 'RDD' + optimization and execution
Insert picture description here
5. Two abstractions in SparkSQL
What RDD? ?
Elastic distributed data sets.
Insert picture description here
What is DataFrame?
DataFrame是一种以RDD为基础的带有Schema元信息的分布式数据集, Similar to the two-dimensional table of the traditional database.
Insert picture description here
What is ?? DataSaet
DataFrame containing type information is the DataSet
(DataSaet = DataFrame + type = Schema + RDD * n + type)
DataSet包含了DataFramefunctional
Insert picture description here
distinction RDD, DataFrame, DataSet the
structure diagram:
Insert picture description here
RDD [Person]
以Person为类型参数, but do not understand its internal structure.
DataFrame
提供了详细的结构信息schema列的名称和类型. This looks like a table
DataSet [Person]
不光有schema信息,还有类型信息

Published 238 original articles · praised 429 · 250,000 views

Guess you like

Origin blog.csdn.net/qq_45765882/article/details/105560112