scala中rdd与dataframe的各种创建方式

创建RDD

1,从字符串创建rdd

sc.parallelize(xxx)

如:val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))

2,从文件创建rdd

读文本文件

val citylevel = sc.textFile(HDFS_PATH)

.map(_.split(","))

.map(attributes=>Row(attributes(0).trim,attributes(1).trim))

创建DataFrame

1,从字符串创建dataframe

var test_df = Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)).toDF("imei","feature","id")

2,从rdd创建dataframe

rdd.toDF(xxx)

如:import spark.implicits._

val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))

val testDF=testrdd.toDF("id","score","iemi")

3,从文件创建dataframe

(1)读parquet格式文件 val parquetFileDF = spark.read.parquet(HDFS_PATH)

(2)文本文件 :先从文件创建rdd,再从rdd转成dataframe

import spark.implicits._

val citylevel = sc.textFile(HDFS_PATH)

.map(_.split(","))

.map(attributes=>Row(attributes(0).trim,attributes(1).trim))

val cityDF = citylevel.toDF("cityid","citylevel")

发布了32 篇原创文章 · 获赞 8 · 访问量 8万+

猜你喜欢

转载自blog.csdn.net/lipku/article/details/103537083