Spark distributed processing of large data actual notes (three): Spark SQL

Foreword

    Spark is a large-scale, fast computing cluster platform, the number of attempts to improve the public author gymnastics ability to learn through practical exercise notes Spark official website and show the excitement of the Spark. About the framework introduced and environment configuration can refer to the following:

    1. Large data processing framework Hadoop, Spark Introduction

    2.linux under Hadoop installation and configuration environment

    Installation and configuration environment 3.linux Spark

    Reference herein configured to: Deepin 15.11, the Java 1.8.0_241, Hadoop 2.10.0, the Spark 2.4.4, 2.11.12 Scala

    Contents of this article is:

        A, Spark SQL Getting Started

            1.Spark Session

            2. Create DataFrames

            3 .SQL statement runs

            4. Create DataSets

            5.RDD Interoperability

            6.UDF custom function

        Second, the data source

            1. General features

            2.Hive Table

            3.JDBC database

        Third, performance tuning

 

A, Spark SQL Getting Started

    Spark Spark is a module for processing the SQL structured data. And based Spark RDD API different, Spark SQL provides a query structured data and calculation results information and other interfaces . In the interior, Spark SQL to use this additional information to perform additional optimizations. There are several ways to interact with Spark SQL, including  SQL  and  Dataset API . When calculated using the same execution engine, it can quickly calculate Either API / language. This unity means that developers will be able to provide the most natural way to express the basis for easy switching back and forth between a given transformation API different.

    1.Spark Session

    Spark SQL the entry point for all functions is SparkSession  class. To create a SparkSession, use SparkSession.builder only () on it. If prompted Warning has been created, the previously created SparkSession representatives, some settings will not take effect, you can stop the current SparkSession by .stop method:

scala> val spark = SparkSession.builder().appName("Spark SQL").config("spark.some.config.option","some-value").getOrCreate()spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@28a821f9

    2. Create DataFrames

    In one SparkSession, the application may be from one existing RDD , the hive table or from  Spark source data created in a DataFrames.

scala> import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Datasetscala> import org.apache.spark.sql.Row;import org.apache.spark.sql.Rowscala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala> df.show()+----+-------+| age|   name|+----+-------+|null|Michael||  30|   Andy||  19| Justin|+----+-------+

    A Dataset is a distributed data is set in a new interface Dataset Spark 1.6 are added, it provides  the advantage of RDD (strongly typed, it is possible to use a strong lambda function) with the advantages of Spark SQL execution engine . Dataset may be constructed from a subject and using the conversion function JVM (map, flatMap, filter, etc.). A DataFrame is a designated columns of Dataset. Here some examples of using Dataset including structured data processing:

scala> import spark.implicits._import spark.implicits._scala> df.printSchema()root |-- age: long (nullable = true) |-- name: string (nullable = true)scala> df.select("name").show()+-------+|   name|+-------+|Michael||   Andy|| Justin|+-------+scala> df.select($"name", $"age" + 1).show()+-------+---------+|   name|(age + 1)|+-------+---------+|Michael|     null||   Andy|       31|| Justin|       20|+-------+---------+scala> df.filter($"age" > 21).show()+---+----+|age|name|+---+----+| 30|Andy|+---+----+scala> df.groupBy("age").count().show()+----+-----+| age|count|+----+-----+|  19|    1||null|    1||  30|    1|+----+-----+

    3.SQL statement runs

    SparkSession the sql function allows applications to programming the way to run SQL queries , and as a result  DataFrame  return.

// 创建临时视图scala> df.createOrReplaceTempView("people")scala> val sqlDF = spark.sql("SELECT * FROM people")sqlDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala> sqlDF.show()+----+-------+| age|   name|+----+-------+|null|Michael||  30|   Andy||  19| Justin|+----+-------+

    Spark SQL view is a temporary session level , that is, will disappear with the disappearance of the session. If you want to view a temporary pass each other in all session and is available until the Spark application exits, you can create a global temporary view . Global temporary view exists in the system database  global_temp , we must add the library name to refer to it, for example. SELECT * FROM global_temp.view1.

// 创建全局视图scala> df.createGlobalTempView("people")scala> spark.sql("SELECT * FROM global_temp.people").show()+----+-------+| age|   name|+----+-------+|null|Michael||  30|   Andy||  19| Justin|+----+-------+

    4. Create DataSets

    Dataset and RDD similar, however, not to use Java Serialization Kryo encoder or a sequence for processing or object-oriented transport over the network. While the encoder and the sequence of standard is responsible for the sequence of objects into a byte, the encoder is dynamically generated code , and uses a method that allows to perform many Spark image filtering, sorting, and hashing this operation, no word Festival de-serialization format into objects.

//注意:Scala 2.10中case class最多只能支持22个字段。可以通过自定义类突破限制。scala> case class Person(name: String, age: Long)defined class Person// 为case class创建编码器scala> val caseClassDS = Seq(Person("Andy",32)).toDS()caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]// 通过导入spark.implicits._自动提供最常见类型的编码器。scala> caseClassDS.show()+----+---+|name|age|+----+---+|Andy| 32|+----+---+scala> val primitiveDS = Seq(1, 2, 3).toDS()primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]scala> val path = "file:///usr/local/spark/examples/src/main/resources/people.json"path: String = file:///usr/local/spark/examples/src/main/resources/people.json// 通过提供一个类,可以将DataFrame转换为Dataset。 映射将按名称进行scala> val peopleDS = spark.read.json(path).as[Person]peopleDS: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string]scala> peopleDS.show()+----+-------+| age|   name|+----+-------+|null|Michael||  30|   Andy||  19| Justin|+----+-------+

    5.RDD Interoperability

    Spark SQL supports two different methods for converting existing RDD become Dataset, are used to infer reflection and Schema specified programmatically Schema.

    Spark SQL interface supports automatic conversion of Scala contains a  Case classes  RDD is DataFrame. Case class defines the table  Schema . Case class parameter names using reflection read and becomes the name of the column. Case class may be nested or complex types such as Array, or comprising Seq. The RDD can be converted implicitly to a DataFrame is then registered as a table. Table can be used for subsequent SQL statements.

scala> import spark.implicits._import spark.implicits._scala> val peopleDF = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0),attributes(1).trim.toInt)).toDF()peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]scala> peopleDF.createOrReplaceTempView("people")scala> val teenagersDF = spark.sql("SELECT name,age FROM people WHERE age BETWEEN 13 AND 19")teenagersDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]// 可以通过字段索引访问结果中一行的列scala> teenagersDF.map(teenager => "Name:" + teenager(0)).show()+-----------+|      value|+-----------+|Name:Justin|+-----------+// 或者通过字段名scala> teenagersDF.map(teenager => "Name:"+teenager.getAs[String]("name")).show()+-----------+|      value|+-----------+|Name:Justin|+-----------+// 没有用于Dataset [Map [K,V]]的预定义编码器,定义为隐式scala> implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0// row.getValuesMap [T]一次将多个列检索到Map [String,T]中scala> teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()res5: Array[Map[String,Any]] = Array(Map(name -> Justin, age -> 19))

    When the case class  can not be defined before the execution, for example, the structure of the record is encoded in a string, or a text is parsed dataset and different users projection fields are not the same. A DataFrame can use the following three steps to programmatically create.

  • Creating RDD's Row (line) from the original RDD.

  • After Step 1 is created, create a Schema represents StructType RDD mating structure in Row (row).

  • The method provided by SparkSession createDataFrame Schema applied to the RDD ROWS (rows).

scala> import org.apache.spark.sql.types._import org.apache.spark.sql.types._scala> val peopleRDD = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt")peopleRDD: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[18] at textFile at <console>:34scala> val schemaString = "name age"schemaString: String = name agescala> val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,StringType,nullable = true))fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,StringType,true))scala> val schema = StructType(fields)schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))// 将RDD转换为Rowscala> val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[22] at map at <console>:44scala> val peopleDF = spark.createDataFrame(rowRDD,schema)peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: string]scala> peopleDF.createOrReplaceTempView("people")scala> val results = spark.sql("SELECT name FROM people")results: org.apache.spark.sql.DataFrame = [name: string]// spark SQL返回内容可进行正常操作scala> results.map(attributes => "Name:"+attributes(0)).show()+------------+|       value|+------------+|Name:Michael||   Name:Andy|| Name:Justin|+------------+

    6.UDF custom function

Built-in functions to provide common DataFrames polymerization, for example, COUNT () , CountDistinct () , AVG () , max () , min () and the like. While these functions are designed to DataFrames, but are not limited to a user predefined aggregation function, you can also create their own functions.

import org.apache.spark.sql.expressions.MutableAggregationBufferimport org.apache.spark.sql.expressions.UserDefinedAggregateFunctionimport org.apache.spark.sql.types._import org.apache.spark.sql.Rowimport org.apache.spark.sql.SparkSessionobject MyAverage extends UserDefinedAggregateFunction {    def inputSchema: StructType = StructType(StructField("inputColumn",LongType) :: Nil)    def bufferSchema: StructType = {StructType(StructField("sum",LongType) :: StructField("count",LongType) :: Nil)}    def dataType:DataType = DoubleType    def deterministic:Boolean = true    def initialize(buffer:MutableAggregationBuffer):Unit = {        buffer(0) = 0L        buffer(1) = 0L        }    def update(buffer:MutableAggregationBuffer,input:Row):Unit = {        if(!input.isNullAt(0)){            buffer(0) = buffer.getLong(0) + input.getLong(0)            buffer(1) = buffer.getLong(1) + 1            }    }    def merge(buffer1:MutableAggregationBuffer,buffer2:Row):Unit = {    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)    }    def evaluate(buffer:Row):Double = buffer.getLong(0).toDouble / buffer.getLong(1)}scala> spark.udf.register("myAverage", MyAverage)res22: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = MyAverage$@65d6f337scala> val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")result: org.apache.spark.sql.DataFrame = [average_salary: double]scala> spark.udf.register("myAverage", MyAverage)res23: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = MyAverage$@65d6f337scala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/employees.json")df: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]scala> df.createOrReplaceTempView("employees")scala> df.show()+-------+------+|   name|salary|+-------+------+|Michael|  3000||   Andy|  4500|| Justin|  3500||  Berta|  4000|+-------+------+scala> val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")result: org.apache.spark.sql.DataFrame = [average_salary: double]scala> result.show()+--------------+|average_salary|+--------------+|  3750.0     |+--------------+

Second, the data source

    Spark SQL support operations of the various data sources (source data) through the interface DataFrame. DataFrame can  Relational Transformations (conversion relationship) operation, can also be used to create  Temporary View (temporary view). The DataFrame registered as a temporary view (temporary view) allows you to query their data run the SQL. This section describes the use of Spark Data Sources  to load and save data in general methods, and relates to a  Built-in Data Sources (built-in data source)  specific Options (specific options).

    1. General features

    In the simplest form, the default data source (Parquet, unless otherwise disposed spark.sql.sources.default) will be used for all operations.

scala> val usersDF = spark.read.load("file:///usr/local/spark/examples/src/main/resources/users.parquet")usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string ... 1 more field]scala> usersDF.show()+------+--------------+----------------+                                        |  name|favorite_color|favorite_numbers|+------+--------------+----------------+|Alyssa|          null|  [3, 9, 15, 20]||   Ben|           red|              []|+------+--------------+----------------+// 保存之后再用hdfs dfs -ls data 命令即可查看parquet文件scala> usersDF.select("name","favorite_color").write.save("data/user.parquet")

    You may also manually specify the data source format. For built-in source, you can also use their short name (json, parquet, jdbc, orc , libsvm, csv, text). Loaded from any data source type (source type data) dataframes can use this syntax (grammar) into other types.

scala> val peopleDF = spark.read.format("json").load("file:///usr/local/spark/examples/src/main/resources/people.json")peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala> peopleDF.select("name", "age").write.format("parquet").save("data/namesAndAges.parquet")

    Save operation can choose  SaveMode , specify how to handle existing data if present. Important that these saving mode does not use any locking. In addition, when performing Overwrite, the data will be erased before new data is written. DataFrames saveAsTable command may also be used as Hive metastore saved to persistent tables (persistent table). For file-based data source, may be performed on the output (output)  bucket  and  sort  or  Partition . Bucketing and  sorting  applies only to  persistent tables.

scala> peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

    2.Hive Table

    Spark SQL also supports reading and writing are stored in  Apache Hive  data. However, the Hive has a large number of dependencies , so these dependencies are not included in the default Spark distribution. If found in the class path dependency Hive, Spark will automatically load them. Please note that these Hive dependencies must also exist on all nodes work, because they will need to access the Hive serialization and de-serialization library (SerDes), to access data stored in the Hive.

    When you create a Hive table, you need to define how / write data / read from the file system, namely "Input Format" and "Output format." You also need to define how the data in the table deserialize line, or rows of data into a sequence, namely "serde". The following options can be used to specify the storage format ( "SerDe", "the format INPUT", "Output the format" ), e.g., CREATE TABLE src (id int) USING hive OPTIONS (fileFormat 'parquet'). By default, we will read plain text file form. Please note, Hive stores processing programs are not supported when creating a table, you can create a table using the stored handler Hive end, and use the Spark SQL to read it.

    3.JDBC database    

    Further comprising Spark SQL may be used to read data from other JDBC database data source. This feature should be better than using  JdbcRDD . This is because as a result DataFrame returned, and can easily process or in connection with other data sources in Spark SQL. JDBC data source is also easier to use from Python, or Java, since it does not require users ClassTag. (Note that this is different from the Spark SQL JDBC server, allowing other applications to use Spark SQL query run).

    Spark SQL relation to the end point, as will be further detailed description of the content that is Spark Spark Streaming stream processing.

    The foregoing notes refer to the following links:

    Spark distributed processing of large data actual notes (a): Quick Start

    Spark distributed processing of large data actual notes (b): RDD, shared variables

 

 

 

    You might have missed these ~

    "Plane by the high-frequency" in Data Analysis Section

    "Plane by the high-frequency" of data structures and algorithms articles

    "High-frequency surface by" big data research papers

    "By high-frequency surface" of the machine learning articles

    "High-frequency surface by" the depth learning articles

 

I knew you "look"

 

发布了21 篇原创文章 · 获赞 8 · 访问量 6635

Guess you like

Origin blog.csdn.net/qq_36936730/article/details/104487466