Spark Integration

 

Spark Integration

A, Spark architecture and optimizer

1.Spark Architecture (focus)

  • Hive direct access to the existing data (metadata links Hive addressing)
  • Provides JDBC / ODBC interfaces for third-party tools for data processing by means of Spark
  • Providing a higher level interface to conveniently process data (sql operator, DataFrame class)
  • Support for multiple operating mode: SQL, API programming (Spark-sql, spark-shell, spark API)
  • It supports a variety of external data sources: Parquet, JSON, RDBMS, csv, text, hive, hbase etc.

2.Spark Optimizer

SQL statement first by Parser module is parsed syntax tree, this tree is called Unresolved Logical Plan; Unresolved Logical Plan resolved to Logical Plan by means of metadata Analyzer module; this time to conduct in-depth through a variety of rule-based optimization strategy optimize, get optimized logical plan; logic implementation plan is still the optimized logic, are not to be understood Spark system, then you need to convert this logic implementation plan for the Physical plan;

  • optimization

    1, in the projection above the query filter 2, to check whether the pressure filtration

 

 

 

 

Two, Spark + SQL's API (focus)

Introduction: the Spark In fact, the Hive is similar to the maintenance has Schema data, maintained by a meta-database and a data warehouse, DataFrame is the RDD class package for data with a schema, providing sql can directly manipulate schema data interfaces, and provides direct analytic function sql native sql statement. Meanwhile Spark can also hive of metastore jdbc interface or an interface link to an external metadata database, the database will be realized by this meta-data operations on the data warehouse and RDBMS to the hive.

  • SparkContext

  • SQLContext

    • Spark SQL programming entry
  • HiveContext

    • SQLContext subset contains more features
  • SparkSession (Spark 2.x recommended)

    • SparkSession: SQLContext merged with HiveContext
    • Spark provides interactive features single entry point, and allows the use of DataFrame Dataset API and programming Spark

———————————————————————————————————————————

1.DataSet Profile

Specific domain object (Seq, Array, RDD) in the strongly typed collections based RDD, the biggest difference is that RDD: the DS has a data structure schema information . DataSet = RDD + Schema

  • Namely DataSet Schema data structure information data set (matrix) of
scala> spark.createDataset(1 to 3).show
scala> spark.createDataset(List(("a",1),("b",2),("c",3))).show
scala> spark.createDataset(sc.parallelize(List(("a",1,1),("b",2,2)))).show
  • CreateDataSet () The parameter may be: Seq, the Array, RDD
  • The above three lines are generated Dataset: Dataset [Int], Dataset [(String, Int)], Dataset [(String, Int, Int)]
  • Dataset = RDD + Schema, and so Dataset RDD most common functions, such as map, filter, etc.

———————————————————————————————————————————

2.DataFrame Profile

Based on the DS, but the element is limited to Row class , which is defined so that the operation can be targeted optimization DF, DF operation thus often more than 3 times faster than DS. Row equivalent record in SQL

DataFrame = eet [ROW] + Schema

  • DataFrame=Dataset[Row]

  • Similar to the traditional two-dimensional data table

  • RDD on the basis of the added (data configuration information) the Schema

  • DataFrame Schema supports nested data types

    • Corresponding to HBase larger data structure, the resource consumption
    • struct
    • map
    • array
  • Provide more SQL-like operations API, such as direct operator to use sql statement sql query as parameter :df.sql("select * from table_name")

———————————————————————————————————————————

Creating 3.RDD and DF / DS of

WHAT is the schema of this:

  • DS / DF conversion to RDD
case class Point(label:String,x:Double,y:Double)
val points=Seq(Point("bar",3.0,5.6),Point("foo",-1.0,3.0)).toDF("label","x","y")
//转换
val s=points.rdd
  • The RDD organization for the DS or DF

    • toDF / toDS operator
    • This method generates the data type Schema The RDD internal reflection information automatically infer
case class Person(name:String,age:Int)
//反射获取RDD内的样例类的Schema来构造DF
import spark.implicits._
val people=sc.textFile("file:///home/hadooop/data/people.txt")
		.map(_.split(","))
		.map(p => Person(p(0), p(1).trim.toInt)).toDF()  ////map构造实例类,toDS或者toDF方法
people.show
people.registerTempTable("people") //将DF注册为临时表,以使用sql语句
val teenagers = spark.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
teenagers.show()
//也可以不定义样例类直接使用Array、Seq等集合的toDF方法,参数给定为列名,类型会自动推断
  • To define structured by DF DF + Schema (focus)

    Implicit need to address issues and format data header reading problems sc.textfile formatted text

    spark.read. format can not be solved automatically converted to the format field, all String

    So we solve the complicated data type conversion issues through predefined schema re-read manner

    Provided that all fields are required, otherwise the need for additional field definitions do not care

    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types.{StructType, StructField, StringType}
    val myschema=new StructType().add("order_id",StringType).add("order_date",StringType).add("customer_id",StringType).add("status",StringType)
    val orders=spark.read.schema(myschema).csv("file:///root/orders.csv")
    

     

  • ROW + Schema tissue by DF (applicable to external sources of raw data )

    //构造ROW对象和指定Schema的方法组合
    people=sc.textFile("file:///home/hadoop/data/people.txt")
    val schemaString = "name age" // 以字符串的方式定义DataFrame的Schema信息
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types.{StructType, StructField, StringType}
    
    // StructType类自定义Schema
    val schema = StructType(schemaString.split(" ").map(fieldName =>StructField(fieldName,StringType, true))) //类型是Arrayp[StructField]
    
    //Row类包装数据内容
    val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) 
    
    // 创建DF类,参数为Row类和StructType类
    val peopleDataFrame = spark.createDataFrame(rowRDD, schema) 
    
    // 将DataFrame注册成临时表
    peopleDataFrame.registerTempTable("people")
    val results = spark.sql("SELECT name FROM people")
    results.show
    

     

———————————————————————————————————————————

4. Common Operation

//引用select
ds.select("name")
ds.select(ds("name"))
ds.select(col("name"))
ds.select(column("name"))
ds.select('name')
ds.select($"name")
//统计指标
df.describe("colname").show()

//日期处理类(JAVA)
import java.time.LocalDate
import java.time.format.DateTimeFormatter
LocalDate.parse("2018-08-01 12:22:21", DateTimeFormatter.ofPattern("yyyy-MM-dd hh:mm:ss")).getDayOfWeek
//包装这个JAVA方法为scala函数,方便使用
def TimeParse(x:String):String={
    LocalDate.parse(x,DateTimeFormatter.ofPattern("yyyy-MM-dd hh:mm:ss")).getDayOfWeek.toString
     }
//join操作,列比较时使用三元符号===
val joined=orderDS.join(order_itemsDS,orderDS("id")===order_itemsDS("order_id"))

//常见操作
val df = spark.read.json("file:///home/hadoop/data/people.json")
// 使用printSchema方法输出DataFrame的Schema信息
df.printSchema()
// 使用select方法来选择我们所需要的字段
df.select("name").show()
// 使用select方法选择我们所需要的字段,并未age字段加1!!!!!!!!!!!!!!
df.select(df("name"), df("age") + 1).show()
//  使用filter方法完成条件过滤
df.filter(df("age") > 21).show()
// 使用groupBy方法进行分组,求分组后的总数
df.groupBy("age").count().show()
//sql()方法执行SQL查询操作
df.registerTempTable("people") //先要将df注册为临时表
spark.sql("SELECT * FROM people").show //直接在sql中查询注册的表

5, type conversion

Fields are read to obtain a String method, the need for type conversion

//1、单列转化方法
import org.apache.spark.sql.types._
val data = Array(("1", "2", "3", "4", "5"), ("6", "7", "8", "9", "10"))
val df = spark.createDataFrame(data).toDF("col1", "col2", "col3", "col4", "col5")

import org.apache.spark.sql.functions._
df.select(col("col1").cast(DoubleType)).show()

+----+
|col1|
+----+
| 1.0|
| 6.0|
+----+

//2、循环转变
//然后就想能不能用这个方法循环把每一列转成double,但没想到怎么实现,可以用withColumn循环实现。
val colNames = df.columns

var df1 = df
for (colName <- colNames) {
  df1 = df1.withColumn(colName, col(colName).cast(DoubleType))
}
df1.show()

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 1.0| 2.0| 3.0| 4.0| 5.0|
| 6.0| 7.0| 8.0| 9.0|10.0|
+----+----+----+----+----+

//3、通过:_*
//但是上面这个方法效率比较低,然后问了一下别人,发现scala 有array:_*这样传参这种语法,而df的select方法也支持这样传,于是最终可以按下面的这样写
val cols = df.columns.map(x => col(x).cast("Double"))
df.select(cols: _*).show()

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 1.0| 2.0| 3.0| 4.0| 5.0|
| 6.0| 7.0| 8.0| 9.0|10.0|
+----+----+----+----+----+

//这样就可以很方便的查询指定多列和转变指定列的类型了:
val name = "col1,col3,col5"
df.select(name.split(",").map(name => col(name)): _*).show()
df.select(name.split(",").map(name => col(name).cast(DoubleType)): _*).show()

 

 

 

 

Three, Spark external data source operation (focus)

1.Parquet file (the default file)

A popular column storage format to binary storage data file contains metadata (ROW Schema data and metadata)

import org.apache.spark.sql.types._
val schema=StructType(Array(StructField("name",StringType),
					        StructField("favorite_color",StringType),
					        StructField("favorite_numbers",ArrayType(IntegerType))))
val rdd=sc.parallelize(List(("Alyssa",null,Array(3,9,15,20)),("Ben","red",null)))
val rowRDD=rdd.map(p=>Row(p._1,p._2,p._3))
val df=spark.createDataFrame(rowRDD,schema)


val df=spark.read.parquet("/data/users/")	//该目录下必须已存在parquet文件
df.write.parquet("/data/users_new/")				//在该目录下生成parquet文件

——————————————————————————————————————————

2.Hive表

Integrated hive:

Spark SQL Hive and Integration: an external interface hive open metadata repository, spark ligated into the database, the data acquired by the metadata information hive. In fact, the same as the Hive metadata management.

Integrated Test Environment (shell development)

  • Open Hive's service metastore
#hive打开元数据库的外部接口 9083
nohup hive --service metastore &  #nohup 绑定系统,终端退出服务也会运行

  • Spark-shell into the database link hive
//Spark 链接到hive的元数据库
spark-shell --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083

//或者,自行建立session
val spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

//保存表到hive上
df.saveAsTable("tbl_name")

 

Generating integrated environment (IDE Development)

Spark SQL Hive and integration: 1, hive-site.xml to 2 copies, additional content $ / under conf {SPARK_HOME}

<property>
<name>hive.metastore.uris</name>
<value>thrift://master的IP:9083</value>
</property>

(Optional, xml without that configuration is automatically linked jdbc to mysql) 3, start metadata services: nohup hive --service metastore & 4, SparkSession create their own applications to configure and enable warehouse address Hive support

val spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", warehouseLocation)  //warehouse可选,默认在启动时的目录下
.enableHiveSupport()
.getOrCreate()

 

spark stored in the hive

hive --service metastore
spark-shell
spark.sql("show tables").show
//或者
val df=spark.table("toronto") //返回的是DataFrame类
df.printSchema
df.show
df.write.saveAsTable("dbName.tblName")

//hive
select * from dbName.tblName;

———————————————————————————————————————————

3.MySQL table (MySQL)

RDBMS relational database management system, integrated with the hive is basically the same, but did not pass the hive of metastore service, but directly linked to the use of jdbc mysql

spark-shell --driver-class-path /opt/hive/lib/mysql-connector-java-5.1.38.jar
val df=spark.read.format("jdbc").option("delimiter",",").option("header","true").option("url","jdbc:mysql://192.168.137.137:3306/test").option("dbtable","orders").option("user","root").option("password","rw").load()
//

 

$spark-shell --jars /opt/spark/ext_jars/mysql-connector-java-5.1.38.jar //使用jdbc的jar去连RDBMS

val url = "jdbc:mysql://localhost:3306/test" //test是一个数据库名
val tableName = "TBLS"  //TBLS是库中的一个表名
// 设置连接用户、密码、数据库驱动类
val prop = new java.util.Properties
prop.setProperty("user","hive")
prop.setProperty("password","mypassword")
prop.setProperty("driver","com.mysql.jdbc.Driver")
// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,prop)
jdbcDF.show

//DF存为新的表
jdbcDF.write.mode("append").jdbc(url,"t1",prop)

 

 

 

Four, Spark + SQL function

1. Built-in functions (org.apache.spark.sql.funtions.scala)

  • Scala distinguish the function, when the function sql spark columns for tables.
  • Built-in functions are for DataFrame, not hive of UDF
category Function example
Aggregate function countDistinct**、sumDistinct**
Aggregate functions sort_array**、explode**
Date and time functions hour**、quarter、next_day**
Mathematical Functions asin**、atan、sqrt、tan、round**
Windowing function row_number
String Functions concat**、format_number、regexp_extract**
Other functions isNaN**、sha、randn、callUDF**

 

2. Custom Functions

  • Defined Functions

  • Registration function

    • SparkSession.udf.register (): only sql () effective (sql Spark of integration)
    • functions.udf (): valid for DataFrame API (spark-based database RDD)
  • Function call

//注册自定义函数,注意是匿名函数
spark.udf.register("hobby_num", (s: String) => s.split(',').size)
spark.sql("select name, hobbies, hobby_num(hobbies) as hobby_num from hobbies").show

 

Five, Spark-SQL

Hive move because a lot of R & D personnel, and therefore Spark-SQL operation is similar to hive

But Spark-SQL-based framework is not MR, so the speed is very fast compared to.

  • Spark SQL CLI is using Hive cell storage services and execution in local mode simple command line query tool entered the note, Spark SQL CLI is unable to communicate with the thrift JDBC server
  • Spark SQL CLI等同于Hive CLI(old CLI)、Beeline CLI(new CLI)
  • Start Spark SQL CLI, run the following in the Spark directory
./bin/spark-sql

 

Sixth, performance optimization

1. Serialization

  • Java serialization, Spark default

  • Kryo serialization, about 10 times faster than Java serialization, but not all sequences of the type

    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
    //向Kryo注册自定义类型
    conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]));//case class
    
    

    If there is no registration required serialization class, Kyro can still work as usual, but will store the full class name (full class name) of each object, which is often more a waste of space than the default Java serialization

2. Considerations

  1. Using an array of objects, primitive type instead of Java, Scala collections (e.g., the HashMap)
  2. Avoid nested structure
  3. Make use of numbers as Key , not a string
  4. Greater use of RDD MEMORY_ONLY_SER
  5. Load CSV, JSON, the only load the required fields
  6. Only when it needs to persist the intermediate results (RDD / DS / DF)
  7. Intermediate results to avoid unnecessary (RDD / DS / DF) generating
  8. DF faster execution speed is approximately three times faster than the DS
  9. RDD partition and custom spark.default.parallelism, Sets the default number of each stage of task
  10. The big variable is broadcasted , instead of using
  11. Try the local data processing and data transmission to minimize work across nodes

Table 3. connection (join operation)

  • Contains all the tables predicates (predicate)
select * from t1 join t2 on t1.name = t2.full_name 
where t1.name = 'mike' and t2.full_name = 'mike'

  • The largest table in the first place
  • Broadcast smallest table
  • Minimizing the number of table join

Guess you like

Origin www.cnblogs.com/whoyoung/p/11424303.html