Spark code readability and performance optimization - eight exemplary (a service logic, a variety of solution)
1. Housewives
In the example of seven at the end kind of made a demand "while the total value of a statistical table all the fields corresponding to the total number of de-duplicated, and requires a corresponding field value is not empty." If you've seen examples of Seven, obviously you should know how to solve.
The purpose of writing this article as follows:
And then describe in detail the needs of business, in order to avoid misunderstanding
Provide an analysis of the problems of the business
Show examples of a variety of solutions
2. Demand show
A table prior tb_express, examples are as follows:
name
address
trade
fix
express
Wang Wu
Chengdu, Sichuan
d72_network
ty002
zto
null
Yubei
z03_locker
bk213
sf-express
Li Lei
Changsha, Hunan
null
null
sf-express
……
……
……
……
……
John Doe
Guangzhou, Guangdong
t92_locker
tu87
table
Table description
A total of 50 full table fields, here shown only 5 (hereinafter show for convenience, will only to five as an example)
Full table a total of 1 billion data
Since the data source issues, there will be a lot of field null value circumstances. (According to the last statistics to know: fix field a total of 950 million non-null value, the total number of non-null values in other fields in the range of 200-300 million)
Business needs:
Need to count all the total value of the field to the total number of field values after de-duplication , and requested field is not empty
3. Analysis
3.1 One problem (lower performance SQL)
In fact, the business itself is very simple needs, first of all may be the first thought is to be treated with SQL, examples are as follows:
SELECTcount(name),count(distinct name)FROM tb_express
WHERE name ISNOTNULLAND name !='';
However, because the null of each field is different, so SQL can not count all the fields at once, 50 fields had to run again. You can see the downside to this: cluster resource consumption, it takes a long time.
3.2 Second problem (data skew)
So, you might try to write code to solve a problem. General examples prepared as follows:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructField, StructType}
/**
* Description: '字段值总数'与'字段值去重后的总数'的统计(错误示例)
* <br/>
* Date: 2019/12/2 16:57
*
* @author ALion
*/
object CountDemo {
val expressSchema: StructType = StructType(Array(
StructField("name", StringType),
StructField("address", StringType),
StructField("trade", StringType),
StructField("fix", StringType),
StructField("express", StringType)
))
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("CountDemo")
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
val expressDF = spark.read.schema(expressSchema).table("tb_express")
val resultRDD = expressDF.rdd
// .flatMap { row =>
// val name = row.get(row.fieldIndex("name"))
// val address = row.get(row.fieldIndex("address"))
// val trade = row.get(row.fieldIndex("trade"))
// val fix = row.get(row.fieldIndex("fix"))
// val express = row.get(row.fieldIndex("express"))
//
// val buffer = ArrayBuffer[(String, String)]()
// // 去除null值
// // 字段名设置为key,字段值设置为value
// if (name != null) buffer.append(("name", name.toString))
// if (address != null) buffer.append(("address", address.toString))
// if (trade != null) buffer.append(("trade", trade.toString))
// if (fix != null) buffer.append(("fix", fix.toString))
// if (express != null) buffer.append(("express", express.toString))
//
// buffer
// }
.flatMap { row =>
// 更函数式的写法,也更简短
Array("name", "address", "trade", "fix", "express")
.flatMap { name =>
Option(row.get(row.fieldIndex(name))) match {
case Some(v) => Some((name, v.toString))
case None => None
}
}
}.groupByKey()
.mapValues { iter => (iter.size, iter.toSet.size) } // 此处计算'字段值总数'与'字段值去重后的总数'
// 拉取数据,打印结果
resultRDD.collect()
.foreach { case (fieldName, (count, distinctCount)) =>
println(s"字段名 = $fieldName, 字段值总数 = $count, 字段值去重后的总数 = $distinctCount")
}
spark.stop()
}
}
Write code, usually the first thought is this way, each grouping field names, field values are statistics on it.
However, the problem here is that the code data leads groupByKey inclined, as mentioned in the foregoing we "fix field Total 950 000 000 non-null value, the total number of non-null values of other fields in the range of 200-300 million." So, if the field is key, it is groupByKey, will lead to a shuffle node data is much larger than other nodes.
3.3 Problems tris (data within a data inclined obliquely)
This time you will probably want to, "I put the key in addition to the field names, add a field value, make a deposit value, so it does not tilt up?." Some examples are the body:
However, in front of a hidden secret, that "fix field a total of 950 million non-null value, and 80 percent are the same value, while the value of these business requirements can not be ruled out", so once up and running, the data will still tilt, card a point on. (But you obviously have to think of using reduceByKey solved)
In addition, you can use a digital number (Int) to represent the field name (String), in order to reduce memory consumption. But subsequent examples for easy viewing, or field names used.
Example 4. A variety of solutions
Read previous demand analysis, obviously we know that the problem lies in data skew, need to find ways to solve it. Then the general solution is reduceByKey, but we can also think of other ways, here I will write some examples for reference.
4.1 The use of a random number is added to the way the key, to solve the problem of data skew
val resultRDD = expressDF.rdd
.flatMap { row =>
val random = new Random()
// 更函数式的写法,也更简短
Array("name", "address", "trade", "fix", "express")
.flatMap { name =>
Option(row.get(row.fieldIndex(name))) match {
// 随机范围取100,最终会导致数据分成100份。根据当前集群启动的节点数合理取值,可以达到更好的效果。
case Some(v) => Some((name + "_" + random.nextInt(100), v.toString))
case None => None
}
}
}.groupByKey()
.map { case (k, v) =>
// 完成本次聚合,并去掉随机数
(k.split("_")(0), (v.size, v.toSet))
}.groupByKey() // 同样的,你也可以写reduceByKey,不过此处几乎没有效率影响
.mapValues {iter =>
// (字段值总数, 字段值去重后的总数)
(iter.map(_._1).sum, iter.map(_._2).reduce(_ ++ _).size)
}
4.2 Use reduceByKey, modify the key data structure, and then change subsequent treatment
val resultRDD = expressDF.rdd
.flatMap { row =>
// 更函数式的写法,也更简短
Array("name", "address", "trade", "fix", "express")
.flatMap { name =>
Option(row.get(row.fieldIndex(name))) match {
case Some(v) => Some(((name, v.toString), 1))
case None => None
}
}
}.reduceByKey(_ + _) // 将问题分析最后示例中的groupByKey替换为reduceByKey即可解决
.map { case ((name, value), count) => (name, (value, count)) }
// 第二次groupByKey虽然还是以字段名为key,但是因为数据量很小,所以会很快处理完。
// 当然你这里也可以使用reduceByKey。
.groupByKey()
.map { case (name, iter) =>
// (字段名, 字段值总数, 字段值去重后的总数)
(name, iter.map(_._2).sum, iter.size)
}
4.3 does not modify the key data structure, prepare aggregator
Aggregator classes CountAggregator
class CountAggregator(var count: Int, var countSet: mutable.HashSet[String]) {
def +=(element: (Int, String)) : CountAggregator = {
this.count += element._1
this.countSet += element._2
this
}
def ++=(that: CountAggregator): CountAggregator = {
this.count += that.count
this.countSet ++= that.countSet
this
}
}
object CountAggregator {
def apply(): CountAggregator =
new CountAggregator(0, mutable.HashSet[String]())
}
treeAggregate using parallel computing, to reduce performed at each node, and finally merged into one node reduce
Do not like the previous example, to generate a value for each field name (data line number field, the number of fields will be generated almost name), it is used as the polymerization key, reducing memory footprint
Shortcoming
Need to reduce the final result as the data becomes small, for example, a plurality of base values (maximum value, the count value), a small amount of data set (10 before sorting)
The example here, the end of each field to a weight of polymerized Set, if a large amount of data within the Set memory footprint increases, it may cause the driver to hang end