Introduction: When the data type in the RDD has a CaseClass sample class, obtain the attribute name and type through reflection Reflecttion , construct a Schema, apply it to the RDD data set, and convert it to a DataFrame.
/**
* @author liu a fu
* @date 2021/1/17 0017
* @version 1.0
* @DESC 反射类型推断
*/
/**
* 封装电影评分数据
*
* @param userId 用户ID
* @param itemId 电影ID
* @param rating 用户对电影评分
* @param timestamp 评分时间戳
*/
case class MoviesRating(userid: Int,
moviesid: Int,
rating: Double,
timestamp: Long
)
object _06RatDataToDF {
def main(args: Array[String]): Unit = {
//1-准备环境
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName.stripSuffix("$")).setMaster("local[*]")
val spark: SparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
//引入隐式转换 RDD ->DataFrame
import spark.implicits._
val fileRDD: RDD[String] = spark.sparkContext.textFile("data/input/sql/ml-100k/u.data") //读取数据源
//filter 筛选 滤掉
val ratingDF: DataFrame = fileRDD.filter(line => line != null && line.trim.split("\t").length == 4)
.mapPartitions(iter => {
iter.map(line => {
val arr: Array[String] = line.split("\t")
MoviesRating(arr(0).toInt, arr(1).toInt, arr(2).toDouble, arr(3).toLong)
})
}).toDF()
ratingDF.show()
ratingDF.printSchema()
}
}
- This method requires that the RDD data type must be CaseClass , and the field names in the converted DataFrame are the attribute names in CaseClass.