Spark sql common interview questions for big data

1. Several ways to create a DataSet

  • 1. Convert from DataFrame to DataSet
  • 2. Create directly through SparkSession.createDataSet()
  • 3. Implicit conversion through toDS method

2. What is the difference between DataFrame and RDD

RDD features:

  • RDD is a lazy execution immutable parallel data collection that can support Lambda expressions
  • The biggest advantage of RDD is simplicity, and the API is highly user-friendly
  • The disadvantage of RDD is the performance limitation. It is a JVM memory-resident object, which determines the existence of GC limitations and the increase in the cost of Java serialization when data increases.

DataFrame features:

  • Similar to RDD, DataFrame is also a distributed data container. However, DataFrame is more like a two-dimensional table in a traditional database. In addition to data, it also records the structure information of the data, that is, the schema. At the same time, similar to Hive, DataFrame also supports nested data types (struct, array, and map). From the perspective of API ease of use, the DataFrame API provides a set of high-level relational operations that are more friendly than the functional RDD API, and the threshold is lower. Because it is similar to R and Pandas DataFrame , Spark DataFrame well inherits the development experience of traditional stand-alone data analysis

3. How does SparkSql handle structured data and unstructured data

  • Structured data: Json is converted to DataFrame, and SQL is operated through the registry
  • Unstructured data: construct a schema through reflection inference

Four. The difference between RDD DataFrame DataSet

  • In fact, the dataset is an upgraded version of the dataframe, which is equivalent to the dataframe is a subset of the dataset. The main difference is that the encoder added in the dataset after spark2.0 is not an object-oriented programming idea in the dataframe, but becomes Object-oriented programming, and dataset is equivalent to an integrated version of dataframe and rdd, making the operation more flexible

1.RDD

advantage:

  • Compile-time type safety
  • Type errors can be checked at compile time
  • Object-oriented programming style
  • Manipulate data directly through the point of the class name

Disadvantages:

  • Performance overhead of serialization and deserialization
  • Whether it is communication between clusters or IO operations, the structure and data of the object need to be serialized and deserialized
  • GC performance overhead, frequent creation and destruction of objects will inevitably increase GC

2.DataFrame

  • DataFrame introduces schema and off-heap
  • Schema: The structure of each row of RDD data is the same, and this structure is stored in the schema. Spark can read the data through the schema, so when communicating and IO, it only needs to serialize a deserialized data, and the structure is Part can be omitted

3.DataSet

  • DataSet combines the advantages of RDD and DataFrame, and brings a new concept Encoder
  • When serializing data, Encoder generates bytecode to interact with off-heap, which can achieve the effect of accessing data on demand without deserializing the entire object

Five. Spark sql running process

1) Analyze the read sql statement

  • Identify which are keywords in the SQL statement (such as select, from, where), which are expressions, which are Projection, which are Data Source, etc.
  • Determine whether the sql statement is standardized

2) Binding the sql statement and the data dictionary of the database

  • Data dictionary: columns, tables, views, etc.
  • If the related Projection DataSource etc. exist, it means that the SQL statement can be executed

3) The database selects the optimal execution plan

  • The database will provide several execution plans, which will run statistics
  • The database will select an optimal plan from the above various execution plans

4) Execution plan

  • Execute in the order of Operation -> DataSource -> Result
  • In the process of execution, sometimes you don't even need to read the physical table to return the result, such as re-running the sql statement just run, and the result can be returned directly from the buffer pool of the database

Six. Spark SQL principle

  • Cayalyst is the code name of the spark sql execution optimizer. All spark sql statements can finally be parsed and optimized through it, and finally executable Java bytecode is generated
  • The main data structure of Catalyst is a tree. All SQL statements will be stored in a tree structure. Each node in the tree has a class and 0 or more child nodes. The new node type defined in scala is TreeNode Subclass of this class
  • Another important concept of Catalyst is rules. Basically all optimizations are based on rules. Rules can be used to operate the tree. The nodes in the tree are read-only, so the tree is also read-only. The functions defined in the rules may be implemented Convert from a tree to a new tree

The entire Catalyst execution process can be divided into the following 4 stages:

  • 1) Analysis stage, analyze the logic tree, resolve references
  • 2) Logic optimization stage
  • 3 In the physical planning stage, Catalyst will generate multiple plans and compare them based on cost
  • 4) Code generation stage

Seven. Cache mode in spark sql

  • 1): Cache a temporary table through the SqlContext instance.cacheTable ("table name")
  • 2): Cache a virtual table through the DataFrame instance.cache()
    注意:registerTempTable不是action类型的算子,不发生缓存

8. The difference between join operation and left join operation in spark sql

  • The inner join operation in join and sql is very similar. The return result is that the previous set and the next set match successfully, and the unrelated ones are filtered out
  • left join\
  • Similar to the left outer join in sql, the returned result is mainly the first RDD, and the unrelated records are empty
  • In some scenarios, you can use left semi join instead of left join:
  • Because the left semi join is in (keySet) relationship, when the right table has duplicate records, the left table will skip, and the performance is higher, while the left join will always traverse, but only the left can appear in the result of the last select in the left semijoin. The column names in the table, because the right table only has the join key involved in the association calculation

Guess you like

Origin blog.csdn.net/sun_0128/article/details/107858345