Spark DataFrame handles data skew issues

Due to crawler crawling and other reasons, there will be too many log entries for a single ID. In spark, logs with the same ID will be shuffled to a single node for processing, causing the system to run slowly!
Because the access of these users is inherently invalid, these users can be filtered out directly.
Without further ado, the output and code of scala's DataFrame version are as follows (see code comments for reference links):
quote
spark version: 1.6.1
Original DataFrame (with fake users):
+---------+------+
|       id| movie|
+---------+------+
|       u1|WhoAmI|
|       u2|Zoppia|
|       u2|  Lost|
|FakeUserA|Zoppia|
|FakeUserA|  Lost|
|FakeUserA|Zoppia|
|FakeUserA|  Lost|
|FakeUserA|Zoppia|
|FakeUserA|  Lost|
|FakeUserB|  Lost|
|FakeUserB|  Lost|
|FakeUserB|  Lost|
|FakeUserB|  Lost|
+---------+------+

Fake Users with count (threshold=2):
+---------+-----+
|       id|count|
+---------+-----+
|FakeUserA|    6|
|FakeUserB|    4|
+---------+-----+

Fake Users:
Set(FakeUserA, FakeUserB)

Valid users after filter:
+---+------+
| id| movie|
+---+------+
| u1|WhoAmI|
| u2|Zoppia|
| u2|  Lost|
+---+------+




import org.apache.log4j.{Level, Logger}
import org.apache.spark. {SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

/**
  * Created by colinliang on 2017/8/14.
  */
case class IDMovie(id: String, movie: String)
object BroadcastTest {
  def main(args: Array[String]): Unit = {
    Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
    val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
    val sc = new SparkContext(conf)
    println("spark version: " + sc.version)
    sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
    val spark = new SQLContext(sc)



    val idvids = List(
      IDMovie("u1", "WhoAmI")
      , IDMovie("u2", "Zoppia")
      , IDMovie("u2", "Lost")
      , IDMovie("FakeUserA", "Zoppia")
      , IDMovie("FakeUserA", "Lost")
      , IDMovie("FakeUserA", "Zoppia")
      , IDMovie("FakeUserA", "Lost")
      , IDMovie("FakeUserA", "Zoppia")
      , IDMovie("FakeUserA", "Lost")
      , IDMovie("FakeUserB", "Lost")
      , IDMovie("FakeUserB", "Lost")
      , IDMovie("FakeUserB", "Lost")
      , IDMovie("FakeUserB", "Lost")
      );


    val df = spark
      .createDataFrame(idvids)
      .repartition(col("id"))

    println("Original DataFrame (with fake users): ")
    df.show()

// val df_fakeUsers_with_count=df.sample(false,0.1).groupBy(col("id")).count().filter(col("count")>2).limit(10000)//Actually can be based on Only a portion of the data needs to be sampled
    val df_fakeUsers_with_count=df.groupBy(col("id")).count().filter(col("count")>2)
    /** The groupby in the DataFrame is in the form of aggregation, does not involve shuffle, and is very fast. See: https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html
      For more aggregation functions, see: https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.sql.functions$
      In addition, multiple columns of data after groupBy can also be aggregated through the agg() function
      */
    println("Fake Users with count (threshold=2):")
    df_fakeUsers_with_count.show()


    val set_fakeUsers=df_fakeUsers_with_count.select("id").collect().map(_(0)).toList.map(_.toString).toArray[String].toSet
    println("Fake Users:")
    println(set_fakeUsers)
    val set_fakeUsers_broadcast=sc.broadcast(set_fakeUsers)
    /** broadcast tutorial: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
      * Official documentation: http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
      */

    val udf_isValidUser = udf((id: String) => !set_fakeUsers_broadcast.value.contains(id)) //You can also use set_highCountUsers.contains(id) directly, but the efficiency is low, because the number of deserialization may be more, see http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
    val df_filtered=df.filter(udf_isValidUser(col("id")) ) //Filter out this part of the user
    /** If you want to keep some users instead of filtering them out, and the number of users is small, you don't need to define UDF:
      * https://stackoverflow.com/questions/39234360/filter-spark-scala-dataframe-if-column-is-present-in-set
      * val validValues = Set("A", "B", "C")
      * data.filter($"myColumn".isin(validValues.toSeq: _*))
      */
    /** If you want to keep some users and the number of users is relatively large, you can use the DataFrame of broadcast:
      * https://stackoverflow.com/questions/33824933/spark-dataframe-filtering-retain-element-belonging-to-a-list
      * import org.apache.spark.sql.functions.broadcast
      * initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")
      */
    println("\nValid users after filter:")
    df_filtered.show()
  }
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326171691&siteId=291194637