Foreword
Spark is a large-scale, fast computing cluster platform, the number of attempts to improve the public author gymnastics ability to learn through practical exercise notes Spark official website and show the excitement of the Spark. About the framework introduced and environment configuration can refer to the following:
1. Large data processing framework Hadoop, Spark Introduction
2.linux under Hadoop installation and configuration environment
Installation and configuration environment 3.linux Spark
Reference herein configured to: Deepin 15.11, the Java 1.8.0_241, Hadoop 2.10.0, the Spark 2.4.4, 2.11.12 Scala
Contents of this article is:
A, Spark SQL Getting Started
1.Spark Session
2. Create DataFrames
3 .SQL statement runs
4. Create DataSets
5.RDD Interoperability
6.UDF custom function
Second, the data source
1. General features
2.Hive Table
3.JDBC database
Third, performance tuning
A, Spark SQL Getting Started
Spark Spark is a module for processing the SQL structured data. And based Spark RDD API different, Spark SQL provides a query structured data and calculation results information and other interfaces . In the interior, Spark SQL to use this additional information to perform additional optimizations. There are several ways to interact with Spark SQL, including SQL and Dataset API . When calculated using the same execution engine, it can quickly calculate Either API / language. This unity means that developers will be able to provide the most natural way to express the basis for easy switching back and forth between a given transformation API different.
1.Spark Session
Spark SQL the entry point for all functions is SparkSession class. To create a SparkSession, use SparkSession.builder only () on it. If prompted Warning has been created, the previously created SparkSession representatives, some settings will not take effect, you can stop the current SparkSession by .stop method:
scala> val spark = SparkSession.builder().appName("Spark SQL").config("spark.some.config.option","some-value").getOrCreate()
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@28a821f9
2. Create DataFrames
In one SparkSession, the application may be from one existing RDD , the hive table or from Spark source data created in a DataFrames.
scala> import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Dataset
scala> import org.apache.spark.sql.Row;
import org.apache.spark.sql.Row
scala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
A Dataset is a distributed data is set in a new interface Dataset Spark 1.6 are added, it provides the advantage of RDD (strongly typed, it is possible to use a strong lambda function) with the advantages of Spark SQL execution engine . Dataset may be constructed from a subject and using the conversion function JVM (map, flatMap, filter, etc.). A DataFrame is a designated columns of Dataset. Here some examples of using Dataset including structured data processing:
scala> import spark.implicits._
import spark.implicits._
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.select("name").show()
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
scala> df.select($"name", $"age" + 1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
scala> df.filter($"age" > 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
3.SQL statement runs
SparkSession the sql function allows applications to programming the way to run SQL queries , and as a result DataFrame return.
// 创建临时视图
scala> df.createOrReplaceTempView("people")
scala> val sqlDF = spark.sql("SELECT * FROM people")
sqlDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> sqlDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Spark SQL view is a temporary session level , that is, will disappear with the disappearance of the session. If you want to view a temporary pass each other in all session and is available until the Spark application exits, you can create a global temporary view . Global temporary view exists in the system database global_temp , we must add the library name to refer to it, for example. SELECT * FROM global_temp.view1.
// 创建全局视图
scala> df.createGlobalTempView("people")
scala> spark.sql("SELECT * FROM global_temp.people").show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
4. Create DataSets
Dataset and RDD similar, however, not to use Java Serialization Kryo encoder or a sequence for processing or object-oriented transport over the network. While the encoder and the sequence of standard is responsible for the sequence of objects into a byte, the encoder is dynamically generated code , and uses a method that allows to perform many Spark image filtering, sorting, and hashing this operation, no word Festival de-serialization format into objects.
//注意:Scala 2.10中case class最多只能支持22个字段。可以通过自定义类突破限制。
scala> case class Person(name: String, age: Long)
defined class Person
// 为case class创建编码器
scala> val caseClassDS = Seq(Person("Andy",32)).toDS()
caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
// 通过导入spark.implicits._自动提供最常见类型的编码器。
scala> caseClassDS.show()
+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+
scala> val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> val path = "file:///usr/local/spark/examples/src/main/resources/people.json"
path: String = file:///usr/local/spark/examples/src/main/resources/people.json
// 通过提供一个类,可以将DataFrame转换为Dataset。 映射将按名称进行
scala> val peopleDS = spark.read.json(path).as[Person]
peopleDS: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string]
scala> peopleDS.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
5.RDD Interoperability
Spark SQL supports two different methods for converting existing RDD become Dataset, are used to infer reflection and Schema specified programmatically Schema.
Spark SQL interface supports automatic conversion of Scala contains a Case classes RDD is DataFrame. Case class defines the table Schema . Case class parameter names using reflection read and becomes the name of the column. Case class may be nested or complex types such as Array, or comprising Seq. The RDD can be converted implicitly to a DataFrame is then registered as a table. Table can be used for subsequent SQL statements.
scala> import spark.implicits._
import spark.implicits._
scala> val peopleDF = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0),attributes(1).trim.toInt)).toDF()
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
scala> peopleDF.createOrReplaceTempView("people")
scala> val teenagersDF = spark.sql("SELECT name,age FROM people WHERE age BETWEEN 13 AND 19")
teenagersDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
// 可以通过字段索引访问结果中一行的列
scala> teenagersDF.map(teenager => "Name:" + teenager(0)).show()
+-----------+
| value|
+-----------+
|Name:Justin|
+-----------+
// 或者通过字段名
scala> teenagersDF.map(teenager => "Name:"+teenager.getAs[String]("name")).show()
+-----------+
| value|
+-----------+
|Name:Justin|
+-----------+
// 没有用于Dataset [Map [K,V]]的预定义编码器,定义为隐式
scala> implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0
// row.getValuesMap [T]一次将多个列检索到Map [String,T]中
scala> teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
res5: Array[Map[String,Any]] = Array(Map(name -> Justin, age -> 19))
When the case class can not be defined before the execution, for example, the structure of the record is encoded in a string, or a text is parsed dataset and different users projection fields are not the same. A DataFrame can use the following three steps to programmatically create.
-
Creating RDD's Row (line) from the original RDD.
-
After Step 1 is created, create a Schema represents StructType RDD mating structure in Row (row).
-
The method provided by SparkSession createDataFrame Schema applied to the RDD ROWS (rows).
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val peopleRDD = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[18] at textFile at <console>:34
scala> val schemaString = "name age"
schemaString: String = name age
scala> val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,StringType,nullable = true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,StringType,true))
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))
// 将RDD转换为Row
scala> val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[22] at map at <console>:44
scala> val peopleDF = spark.createDataFrame(rowRDD,schema)
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: string]
scala> peopleDF.createOrReplaceTempView("people")
scala> val results = spark.sql("SELECT name FROM people")
results: org.apache.spark.sql.DataFrame = [name: string]
// spark SQL返回内容可进行正常操作
scala> results.map(attributes => "Name:"+attributes(0)).show()
+------------+
| value|
+------------+
|Name:Michael|
| Name:Andy|
| Name:Justin|
+------------+
6.UDF custom function
Built-in functions to provide common DataFrames polymerization, for example, COUNT () , CountDistinct () , AVG () , max () , min () and the like. While these functions are designed to DataFrames, but are not limited to a user predefined aggregation function, you can also create their own functions.
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object MyAverage extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(StructField("inputColumn",LongType) :: Nil)
def bufferSchema: StructType = {StructType(StructField("sum",LongType) :: StructField("count",LongType) :: Nil)}
def dataType:DataType = DoubleType
def deterministic:Boolean = true
def initialize(buffer:MutableAggregationBuffer):Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
def update(buffer:MutableAggregationBuffer,input:Row):Unit = {
if(!input.isNullAt(0)){
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
def merge(buffer1:MutableAggregationBuffer,buffer2:Row):Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
def evaluate(buffer:Row):Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
scala> spark.udf.register("myAverage", MyAverage)
res22: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = MyAverage$@65d6f337
scala> val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result: org.apache.spark.sql.DataFrame = [average_salary: double]
scala> spark.udf.register("myAverage", MyAverage)
res23: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = MyAverage$@65d6f337
scala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/employees.json")
df: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]
scala> df.createOrReplaceTempView("employees")
scala> df.show()
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
+-------+------+
scala> val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result: org.apache.spark.sql.DataFrame = [average_salary: double]
scala> result.show()
+--------------+
|average_salary|
+--------------+
| 3750.0 |
+--------------+
Second, the data source
Spark SQL support operations of the various data sources (source data) through the interface DataFrame. DataFrame can Relational Transformations (conversion relationship) operation, can also be used to create Temporary View (temporary view). The DataFrame registered as a temporary view (temporary view) allows you to query their data run the SQL. This section describes the use of Spark Data Sources to load and save data in general methods, and relates to a Built-in Data Sources (built-in data source) specific Options (specific options).
1. General features
In the simplest form, the default data source (Parquet, unless otherwise disposed spark.sql.sources.default) will be used for all operations.
scala> val usersDF = spark.read.load("file:///usr/local/spark/examples/src/main/resources/users.parquet")
usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string ... 1 more field]
scala> usersDF.show()
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
// 保存之后再用hdfs dfs -ls data 命令即可查看parquet文件
scala> usersDF.select("name","favorite_color").write.save("data/user.parquet")
You may also manually specify the data source format. For built-in source, you can also use their short name (json, parquet, jdbc, orc , libsvm, csv, text). Loaded from any data source type (source type data) dataframes can use this syntax (grammar) into other types.
scala> val peopleDF = spark.read.format("json").load("file:///usr/local/spark/examples/src/main/resources/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> peopleDF.select("name", "age").write.format("parquet").save("data/namesAndAges.parquet")
Save operation can choose SaveMode , specify how to handle existing data if present. Important that these saving mode does not use any locking. In addition, when performing Overwrite, the data will be erased before new data is written. DataFrames saveAsTable command may also be used as Hive metastore saved to persistent tables (persistent table). For file-based data source, may be performed on the output (output) bucket and sort or Partition . Bucketing and sorting applies only to persistent tables.
scala> peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
2.Hive Table
Spark SQL also supports reading and writing are stored in Apache Hive data. However, the Hive has a large number of dependencies , so these dependencies are not included in the default Spark distribution. If found in the class path dependency Hive, Spark will automatically load them. Please note that these Hive dependencies must also exist on all nodes work, because they will need to access the Hive serialization and de-serialization library (SerDes), to access data stored in the Hive.
When you create a Hive table, you need to define how / write data / read from the file system, namely "Input Format" and "Output format." You also need to define how the data in the table deserialize line, or rows of data into a sequence, namely "serde". The following options can be used to specify the storage format ( "SerDe", "the format INPUT", "Output the format" ), e.g., CREATE TABLE src (id int) USING hive OPTIONS (fileFormat 'parquet'). By default, we will read plain text file form. Please note, Hive stores processing programs are not supported when creating a table, you can create a table using the stored handler Hive end, and use the Spark SQL to read it.
3.JDBC database
Further comprising Spark SQL may be used to read data from other JDBC database data source. This feature should be better than using JdbcRDD . This is because as a result DataFrame returned, and can easily process or in connection with other data sources in Spark SQL. JDBC data source is also easier to use from Python, or Java, since it does not require users ClassTag. (Note that this is different from the Spark SQL JDBC server, allowing other applications to use Spark SQL query run).
Spark SQL relation to the end point, as will be further detailed description of the content that is Spark Spark Streaming stream processing.
The foregoing notes refer to the following links:
Spark distributed processing of large data actual notes (a): Quick Start
Spark distributed processing of large data actual notes (b): RDD, shared variables
You might have missed these ~
"Plane by the high-frequency" in Data Analysis Section
"Plane by the high-frequency" of data structures and algorithms articles
"High-frequency surface by" big data research papers
"By high-frequency surface" of the machine learning articles
"High-frequency surface by" the depth learning articles
I knew you "look"