模拟数据
1,张三,23
2,李四,24
3,王五,25
4,赵六,26
5,田七,25
需求:
剔除年龄为25、姓名为张三的数据
代码
object CullTest {
def main(args: Array[String]): Unit = {
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.appName(this.getClass.getSimpleName)
.master(master = "local[*]")
.getOrCreate()
import spark.implicits._
import spark.sql
val df = spark.read.textFile(path = "./data/t2")
.map(_.split(","))
.map(x => (x(0), x(1), x(2)))
.toDF("id", "name", "age")
df.cache()
df.createTempView(viewName = "view")
// 剔除年龄等于25以及名字为张三的用户
val start_time1 = System.currentTimeMillis()
sql(sqlText = "select * from view where age = 25 or name = '张三'")
.createTempView(viewName = "view1")
sql(sqlText = "select * from view where id not in (select id from view1)").show()
val end_time1 = System.currentTimeMillis()
println("方法一耗时:" + (end_time1 - start_time1))
val start_time2 = System.currentTimeMillis()
sql(sqlText = "select * from view")
.where("age != 25")
.where("name != '张三'")
.show()
val end_time2 = System.currentTimeMillis()
println("方法二耗时:" + (end_time2 - start_time2))
spark.stop()
}
}
结果
+---+----+---+
| id|name|age|
+---+----+---+
| 2| 李四| 24|
| 4| 赵六| 26|
+---+----+---+
方法一耗时:856
+---+----+---+
| id|name|age|
+---+----+---+
| 2| 李四| 24|
| 4| 赵六| 26|
+---+----+---+
方法二耗时:216
解释
方法一是不可取的,原因前面说过了;
方法二采用一步一步的剔除,这种性能显然高于第一种。