不同条件剔除数据

模拟数据

1,张三,23
2,李四,24
3,王五,25
4,赵六,26
5,田七,25

需求:
剔除年龄为25、姓名为张三的数据

代码

object CullTest {
  def main(args: Array[String]): Unit = {
    import org.apache.spark.sql.SparkSession
    import org.apache.log4j.{Level, Logger}
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession
      .builder
      .appName(this.getClass.getSimpleName)
      .master(master = "local[*]")
      .getOrCreate()
    import spark.implicits._
    import spark.sql
    val df = spark.read.textFile(path = "./data/t2")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
    df.cache()
    df.createTempView(viewName = "view")
    // 剔除年龄等于25以及名字为张三的用户
    val start_time1 = System.currentTimeMillis()
    sql(sqlText = "select * from view where age = 25 or name = '张三'")
      .createTempView(viewName = "view1")
    sql(sqlText = "select * from view where id not in (select id from view1)").show()
    val end_time1 = System.currentTimeMillis()
    println("方法一耗时:" + (end_time1 - start_time1))

    val start_time2 = System.currentTimeMillis()
    sql(sqlText = "select * from view")
      .where("age != 25")
      .where("name != '张三'")
      .show()
    val end_time2 = System.currentTimeMillis()
    println("方法二耗时:" + (end_time2 - start_time2))


    spark.stop()
  }
}

结果

+---+----+---+
| id|name|age|
+---+----+---+
|  2|  李四| 24|
|  4|  赵六| 26|
+---+----+---+

方法一耗时:856
+---+----+---+
| id|name|age|
+---+----+---+
|  2|  李四| 24|
|  4|  赵六| 26|
+---+----+---+

方法二耗时:216

解释

方法一是不可取的,原因前面说过了;
方法二采用一步一步的剔除,这种性能显然高于第一种。

发布了237 篇原创文章 · 获赞 140 · 访问量 17万+

猜你喜欢

转载自blog.csdn.net/Android_xue/article/details/103375977
今日推荐