Spark Big Data technologies of SQL

Spark Big Data technologies of SQL

A: Spark SQL Overview

  1. Definitions: Spark Spark SQL is a module for processing structured data, it provides two programming objects: DataFrame and the DataSet, and acts as a distributed SQL query engine.

  2. Features: Easy integration, unified data access method, compatible Hive, standard data connection

  3. DtaFrame definition: and the like RDD, DataFrame is a distributed data container. However Dataframe more like a two-dimensional table of a database, in addition to data other than the data structure information is also recorded, i.e. schema

  4. DataSet definition: DataSet is an extension DtaFrame API, the Spark is the latest data abstraction.

    1) is an extension Dataframe API, the Spark is the latest data abstraction.
    2) user-friendly API style, with both types of security checks also have a query optimization characteristics of Dataframe.
    3) Dataset supported codec, when access is required to avoid non-heap data deserialized whole object to improve the
    efficiency.
    4) Sample class is used to define a data structure information Dataset, the class name of each sample attribute map directly to the
    4/26 Beijing East Yanjiao Yan Ling Road 169 south Ark Square Tel: 010 83868569
    field name in the DataSet.
    5) Dataframe column is particularly the Dataset, DataFrame = Dataset [Row], it is possible by the method as DataFrame
    convert Dataset. Row is a type, just like these types of Car, Person, all the information I have to use the table structure
    to represent Row.
    6) DataSet is strongly typed. For example, there can Dataset [Car], Dataset [the Person].
    7) DataFrame just know the field, but do not know the type of field, so when performing these operations is no way
    at compile time checking whether the type of failure, for example, you can String to a subtraction operation, was reported in the implementation of
    mistake, and DataSet not know the field, and know the type of field, so there are more stringent error checking. Just on JSON
    analogy between objects and classes of objects

Two: Spark SQL Programming

In the old version, SparkSQL provides two SQL queries starting point: a man named SQLContext, for Spark provide their own SQL queries; a call HiveContext, for the Hive query connection.

SparkSession Spark is the latest SQL queries starting point, is essentially a combination of SQLContext and HiveContext, it is available on SQLContext and HiveContext API also can be used on SparkSession. Internal SparkSession encapsulates sparkContext, so the calculation is actually done by the sparkContext.

  1. DataFrame: In the SQL Spark SparkSession create and execute SQL DataFrame inlet, creating DataFrame three ways: Spark created by the data source; conversion from an existing RDD; can also be returned from the query Hive Table.

    1) Create a Data Source

    1. 查看Spark数据源进行创建的文件格式
    spark.read
    2. 读取json文件创建DataFrame
    val df=spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
    

    2) conversion to create from RDD

    3) Create a query returns from Hive Table

    The third chapter focuses on the follow-up

  2. SQL syntax style

    1. 创建一个DataFrame
    val df=spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
    2.对 DataFrame 创建一个临时表
    scala> df.createOrReplaceTempView("people")
    注意: 临时表是 Session 范围内的, Session 退出后,表就失效了。如果想应用范围内有效,可以
    使用全局表。注意使用全局表时需要全路径访问, 如: global_temp.people
    3. 对于 DataFrame 创建一个全局表
    scala> df.createGlobalTempView("people")
    4. 通过 SQL 语句实现查询全表
    scala> spark.sql("SELECT * FROM global_temp.people").show()
    scala> spark.newSession().sql("SELECT * FROM global_temp.people").show()
    
  3. RDD converted to DataFrame

    1. Note: If you need to operate between the RDD and the DF or DS, you will need to introduce import spark.implicits._ [spark

      Not a package name, but the name of the object sparkSession]
      Preconditions: Import implicit conversion and create an RDD.

    scala> import spark.implicits._
    import spark.implicits._
    scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
    peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[3] at
    textFile at <console>:27
    1)通过手动确定转换
    scala> peopleRDD.map{x=>val para = x.split(",");(para(0),para(1).trim.toInt)}.toDF("name","age")
    
  4. DataFrame converted to RDD

    1. It can directly call rdd

      1) 创建一个 DataFrame
      scala> val df = spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
      2)将 DataFrame 转换为 RDD
      scala> val dfToRDD = df.rdd
      

Three: DataSet: DataSet is a strongly typed data set, it is necessary to provide information corresponding to the type of

  1. Create a DataSet

    1. 创建一个样例类
    case class Person(nam: String,age:Long)
    2. 创建 DataSet
    val caseClassDS = Seq(Person("Andy", 32)).toDS()
    
    1. RDD converted to DataSet

    SparkSQL automatically will contain case classes RDD converted DataFrame, case class defines the structure of the table,
    case by reflection into a class attribute table column name.

    1. 创建一个 RDD
    val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
    2. 创建一个样例类
    case class Person(name: String, age: Long)
    3. 将 RDD 转化为 DataSet
    scala> peopleRDD.map(line => {val para = line.split(",");Person(para(0),para(1).trim.toInt)}).toDS()
    
    1. DataSet convert RDD: you can call the method rdd
    1) 创建一个 DataSet
    scala> val DS = Seq(Person("Andy", 32)).toDS()
    2)将 DataSet 转换为 RDD
    scala> DS.rdd
    
    1. DataFrame interchangeable to Data''Set
    1)创建一个 DateFrame
    scala> val df = spark.read.json("examples/src/main/resources/people.json")
    df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
    2)创建一个样例类
    scala> case class Person(name: String, age: Long)
    defined class Person
    3)将 DateFrame 转化为 DataSet
    scala> df.as[Person]
    res14: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string]
    2. DataSet 转换为 DataFrame
    1)创建一个样例类
    scala> case class Person(name: String, age: Long)
    defined class Person
    2)创建 DataSet
    scala> val ds = Seq(Person("Andy", 32)).toDS()
    ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
    3)将 DataSet 转化为 DataFrame
    scala> val df = ds.toDF
    

    Summary: The method of DataSet DataFrame conversion is very simple, just need to ROW case class encapsulated, introduced after the call toDF implicit conversion. DataFrame turn DataSet relatively simple, just after import implicit conversion method call as [row], the specific data type annotation can be cleared

  2. Three common

    1, RDD, DataFrame, Dataset all distributed under the spark elastic data sets platform for processing large data
    facilitate
    2, the three are inert mechanism, making creation, conversion, such as map method, is not performed immediately only in the face of Action
    when such foreach, the three will begin traversing operation.
    3, the three operations will automatically cache memory according to the situation spark, so that even if a large amount of data, do not worry about memory
    overflow.
    4, there are three conceptual partition
    5, three of the many common functions, such as filter, sorting
    6, performed in support of DataFrame Dataset and many operations are required to operate this package
    Import spark.implicits._
    . 7, DataFrame Dataset and pattern matching can be used to obtain the value and type of each field

  3. Between the three types

    1.RDD:

    1) RDD general use and spark mlib
    2) RDD does not support operation sparksql

    2.DataFrame:

    1) and the RDD Dataset different types DataFrame fixed to each row Row, values ​​for each column can not directly access, only the value acquired by parsing the fields,

    2) DataFrame and Dataset are generally not used simultaneously with the spark MLIB
    3) DataFrame and Dataset support operations sparksql, such as select, groupby and the like, but also the temporary registration table / window, perform sql statement operation

    4) DataFrame with some special support Dataset convenient way to preserve, such as saving to csv, can bring the header, such that each field name column glance

    3.DataSet:

    1) Dataset member and have exactly the same functions DataFrame above, except for the data type of each line.
    2) DataFrame can also be called Dataset [Row], the type of each row is Row, not resolved, each line exactly which fields, and each field have no way of knowing what type, can only use the methods mentioned above or getAS Article in common pattern matching out specific fields mentioned. The Dataset, each row is not necessarily what type of, after the custom of the case class can be free access to information of each line

    As can be seen, Dataset when you need to access a field in a column is very convenient, however, if you write some adaptability
    when a strong function, if you use Dataset, line type and are not sure, probably all kinds case class, can not be achieved adaptation, this time with DataFrame that Dataset [Row] will be able to better solve the problem.

  4. IDEA SparkSQL and create user-defined UDF

    package com.ityouxin.sparkSql
    
    import org.apache.spark.sql.SparkSession
    
    object HelloWorld {
      def main(args: Array[String]): Unit = {
        //创建 SparkConf()并设置 App 名称
        val spark = SparkSession
          .builder()
          .master("local[*]")
          .appName("Spark SQL basic example")
          .config("spark.some.config.option", "some-value")
          .getOrCreate()
        import  spark.implicits._
        val df = spark.read.json("datas/people.json")
        df.show()
        df.filter($"age" > 21).show()
        df.createOrReplaceTempView("persons")
        spark.sql("SELECT * FROM persons where age > 21").show()
        spark.stop()
      }
    }
    

Four: Spark SQL data source

  1. General Load / Save method

    1. Manually specify options

      Spark SQL interface supports operation of DataFrame a variety of data sources. DataFrame RDDs may operate a mode, may also be registered as a temporary table. After the DataFrame registered as a temporary table, you can query the DataFrame execute SQL.

      Spark SQL default data source for Parquet format. When the data source for Parquet file, Spark SQL can easily perform all the operations. Modify the configuration item spark.sql.sources.default, you can modify the default data source format.

      When the data format is not the source format file parquet, need to manually specify the format of the data source. Data source format to specify the full name (e.g.: org.apache.spark.sql.parquet), if the data format is the built-in source format, referred to only need to specify a given json, parquet, jdbc, orc, libsvm, csv, text specified format of the data. It can be used for general load data provided by read.load SparkSession method, using the write and save data saved.

      val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
      val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://hadoop102:9000/namesAndAges.parquet`")
      
    2. File Saving Options

      SaveMode perform storage operations may be employed, SaveMode defines the processing mode data. Note that these save mode does not use any locking, it is not atomic. In addition, when performed using Overwrite mode, the output before new data in the original data had been deleted. SaveMode detailed in the following table: Slightly

    3. JSON

    Spark SQL automatically presumed structure of JSON data set and load it a Dataset [Row] as may be to load a file via a JSON SparkSession.read.json ().
    Note: This file is not a traditional JSON JSON file, each line had to be a JSON string.

    val path = "examples/src/main/resources/people.json"
    val peopleDF = spark.read.json(path)
    peopleDF.printSchema()
    
    1. Parquet file

    Parquet is a popular format storage column can be efficiently stored in a recording nested fields. Parquet format is often used in the Hadoop ecosystem, it also supports all Spark SQL data types. Spark SQL provides direct read and store Parquet format approach.

    importing spark.implicits._
    import spark.implicits._
    val peopleDF = spark.read.json("examples/src/main/resources/people.json")
    peopleDF.write.parquet("hdfs://hadoop102:9000/people.parquet")
    val parquetFileDF = spark.read.parquet("hdfs:// hadoop102:9000/people.parquet")
    parquetFileDF.createOrReplaceTempView("parquetFile")
    val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
    namesDF.map(attributes => "Name: " + attributes(0)).show()
    
    1. JDBC

    Spark SQL can create DataFrame by way of reading data from a relational database JDBC, through a series of calculations DataFrame, can also write data back to a relational database.
    Note: You need the relevant class database driven into the path of spark.

    /load 默认的加载文件的格式由参数spark.sql.source.default决定
        val df = spark.read.load("datas/users.parquet")
        df.show()
       // spark.sql("select * from parquet.'datas/users.parquet'").show()
        //spark.sql("select * from json.'datas/people.json'").show()
      //jdbc
        spark.read.format("jdbc")
          .option("url","jdbc:mysql://localhost:3306/day10")
          .option("dbtable","user")
          .option("user","root")
          .option("password","123")
          .load().show()
        println("-------------------")
        val option = new Properties
        option.setProperty("user","root")
        option.setProperty("password","123")
        spark.read.jdbc("jdbc:mysql://localhost:3306/day10","user",option).show()
    
    1. Hive

    Apach Hive is a SQL engine on Hadoop, Spark SQL may contain Hive support compile time, it can not package
    contains. Spark SQL support contains Hive can support Hive table visit, UDF (user-defined functions) and Hive Query
    Language (HiveQL / HQL) and so on. Needs to be emphasized is that if you want to include in the Spark SQL Hive library, the matter does not need
    to install Hive. In general, it is best to introduce Hive support at compile time Spark SQL, so you can use these features
    of the. If you download a binary version of the Spark, it should have been added Hive support at compile time.

Published 82 original articles · won praise 6 · views 1505

Guess you like

Origin blog.csdn.net/weixin_38255444/article/details/104235627