SparkSQL之DataFrame

从上一节我们知道,关于Spark SQL有三种API:有SQL API 、DataFrame API、Dataset API。
测试的时候可以用spark-shell、spark-sql来测试,在生产上,用的都是IDEA来开发。,开发完成之后,打包上传,然后用spark-submit提交。

如何构建DataFrames ?

A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
Dataset 可以从JVM对象中进行构建,然后用各种算子对它进行操作。

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
DataFrames 可以从结构化的文件、Hive表、外部数据库、RDD来进行构建。 结构化的文件在工作中非常常见,比如hdfs上的数据,转成DataFrames。Hive表 现在工作中用的也非常的多。外部数据,比如从外面接过来的用户行为信息,比如说存在MySQL里的配置信息等。RDD也是需要的。

DataFrames 操作 JSON文件

JSON文件是结构化的文件,是带有schema的。
JSON可以直接通过加载的方式变成一个DF,不过不是所有的都可以,比如文本就不行。

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

现在有个 xxx.json 的JSON文件:

{"name":"zhangsan","age":15}
{"name":"lisi"}
{"name":"wangwu","age":26}
{"name":"zhaoliu","age":38}
{"name":"Jim","sex":"man"}
{"name":"Marry","age":23,"sex":"man"}

如用MapReduce来实现的话,从MapReduce读进来之后,需要用在map方法里面,通过JSON工具把它解析出来。但是每一行的key value的多少都一样,比如第二行就一个key value。用MapReduce处理起来比较麻烦,里面的代码要经常变化。

如果用DataFrames 来处理的话,就比较方便了。
举个例子;

[hadoop@hadoop001 resources]$ ls
employees.json  full_user.avsc  kv1.txt  people.csv  people.json  people.txt  user.avsc  users.avro  users.orc  users.parquet
[hadoop@hadoop001 resources]$ pwd
/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources
[hadoop@hadoop001 resources]$ cat people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
[hadoop@hadoop001 resources]$ 
//可以看出来,直接就可以读取,然后显示出来了,没有的地方会显示为null
scala> spark.read.json("file:///home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> val df = spark.read.json("file:///home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
//从这里可以看出来:这里加载之后就是DataFrame 

scala> df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

//如果是DataFrame ,可以用printSchema这个把它的schema打印出来,相当于desc
//可以看出来,这些字段的类型,它是可以自动推导出来的
scala> df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

从以上可以看出来,用DataFrame不用考虑文件里的每一行key value有多少,如果某个字段没有就为null;也不用考虑里面字段的数据类型。
继续:

//在上面df基础上创建people视图
scala> df.createOrReplaceTempView("people")

scala> val teenagerNamesDF = spark.sql("SELECT name FROM people")
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
teenagerNamesDF: org.apache.spark.sql.DataFrame = [name: string]

scala> val teenagerNamesDF = spark.sql("SELECT * FROM people")
teenagerNamesDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> teenagerNamesDF.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

//重新创建了people的视图
scala> df.createOrReplaceTempView("people")

//下面直接就可以写sql了
scala> val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
19/07/01 23:14:40 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
teenagerNamesDF: org.apache.spark.sql.DataFrame = [name: string]

scala> teenagerNamesDF.show()
+------+
|  name|
+------+
|Justin|
+------+


//从这里看,多了一个people的表
//isTemporary为false表示,为物理表,为真实存在的,不是临时存在的;true为临时表,是临时存在的,当这个session关掉之后,这个临时表就会消失
//database为空,表示它没有数据库
scala> spark.sql("show tables").show
+--------+------------+-----------+
|database|   tableName|isTemporary|
+--------+------------+-----------+
| default|        dept|      false|
| default|         emp|      false|
| default|sparksqltest|      false|
|        |      people|       true|
+--------+------------+-----------+

//可以看到df这个DataFrame有两个列
scala> df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

//如果只选择其中一个列,可以这样,必须要掌握的
//(   如果要两个就这样: df.select("name","age").show  )
scala> df.select("name").show
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+


//这个见到知道就可以了
//或者这样:
scala> df.select($"name").show
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+
//如果在IDEA里,需要添加import spark.implicits._隐式转换
//但是在spark-shell里可以直接这样,不用添加,因为它已经导入过了

//计算过滤
scala> df.filter(df("age")>20).show
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

思考:读取一个JSON文件,怎么就直接变成了DF了?它的底层实现原理是什么?
见下一篇博客。

但是文本的话直接加载就达不到效果了。
来看:

//这是一个文本
[hadoop@hadoop001 resources]$ cat people.txt 
Michael, 29
Andy, 30
Justin, 19

//用spark.read.text这个来读取一个TXT文本文件
//我们知道文本是没有schema的
scala> val people =  spark.read.text("file:///home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.txt")
people: org.apache.spark.sql.DataFrame = [value: string]

//可以看出来它的字段就一个:value,里面有一大坨
scala> people.show
+-----------+
|      value|
+-----------+
|Michael, 29|
|   Andy, 30|
| Justin, 19|
+-----------+

//里面就一个字段,而且肯定是string类型
scala> people.printSchema
root
 |-- value: string (nullable = true)

那么考虑一下,如何把读取的文本文件,读取之后,就变成一个带有schema的?
就需要自定义外部数据源了。这个是很重要的。
见下一节。

DataFrames 操作 文本文件

参考官网:http://spark.apache.org/docs/latest/sql-getting-started.html#interoperating-with-rdds

现在有一个文本文件:

//id,名字,手机号,邮箱,中间由 | 来分割
//注意21,22 没有名字,23名字为空
[hadoop@hadoop001 data]$ cat student.data 
1|Burke|1-300-746-8446|[email protected]
2|Kamal|1-668-571-5046|[email protected]
3|Olga|1-956-311-1686|[email protected]
4|Belle|1-246-894-6340|[email protected]
5|Trevor|1-300-527-4967|[email protected]
6|Laurel|1-691-379-9921|[email protected]
7|Sara|1-608-140-1995|[email protected]
8|Kaseem|1-881-586-2689|[email protected]
9|Lev|1-916-367-5608|[email protected]
10|Maya|1-271-683-2698|[email protected]
11|Emi|1-467-270-1337|[email protected]
12|Caleb|1-683-212-0896|[email protected]
13|Florence|1-603-575-2444|[email protected]
14|Anika|1-856-828-7883|[email protected]
15|Tarik|1-398-171-2268|[email protected]
16|Amena|1-878-250-3129|[email protected]
17|Blossom|1-154-406-9596|[email protected]
18|Guy|1-869-521-3230|[email protected]
19|Malachi|1-608-637-2772|[email protected]
20|Edward|1-711-710-6552|[email protected]
21||1-711-710-6552|[email protected]
22||1-711-710-6552|[email protected]
23|NULL|1-711-710-6552|[email protected]

如果和上一节json文件读取一样,spark.read.text(这个文件),那么它只有一个列 就是value,value里面一大坨,没有任何意义。
现在考虑一下,如何把读取的文本文件,读取之后,就变成一个带有schema的?
这涉及到RDD和DataFrame的转换。(务必掌握)

对于一个已经存在的RDD,Spark SQL有两种方法来把它转换成Datasets。

第一种方法,通过反射的方法

第一种方法,通过反射的方法,来推导RDD里面的schema。
当你已经知道它的schema ,已经知道它的数据结构时,用这种方法,代码很精简。

Spark SQL中的Scala接口支持:把一个包含 case class 的RDD 转换成DataFrame。这个case class定义了表的schema。case class 的参数名 通过反射 来被读取 然后成为列的名称。

现在用第一种方法来操作一下:

scala> case class Student(id:String,name:String,phone:String,email:String)
defined class Student

//注意如果是IDEA,需要添加隐式转换:import spark.implicits._

//注意map(_.split("\\|"))  这里需要添加两个‘\’来转义一下,不然有问题
scala> val studentDF = spark.sparkContext.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(attribute => Student(attribute(0),attribute(1),attribute(2),attribute(3))).toDF
studentDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]
//这样就把文本读取给一个RDD,然后RDD通过case class,通过反射,就转成了一个DataFrame 了

//然后就可以对这个DataFrame 进行操作了
//可以看到总共23行,就显示了20行;而且每一行如果太长,后面显示不全
scala> studentDF.show
+---+--------+--------------+--------------------+
| id|    name|         phone|               email|
+---+--------+--------------+--------------------+
|  1|   Burke|1-300-746-8446|ullamcorper.velit...|
|  2|   Kamal|1-668-571-5046|pede.Suspendisse@...|
|  3|    Olga|1-956-311-1686|Aenean.eget.metus...|
|  4|   Belle|1-246-894-6340|vitae.aliquet.nec...|
|  5|  Trevor|1-300-527-4967|dapibus.id@acturp...|
|  6|  Laurel|1-691-379-9921|adipiscing@consec...|
|  7|    Sara|1-608-140-1995|Donec.nibh@enimEt...|
|  8|  Kaseem|1-881-586-2689|cursus.et.magna@e...|
|  9|     Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|    Maya|1-271-683-2698|accumsan.convalli...|
| 11|     Emi|1-467-270-1337|        est@nunc.com|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|   Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
+---+--------+--------------+--------------------+
only showing top 20 rows

//加两个参数就可以解决上面问题了,false表示显示完全,30表示显示30行,默认最多显示20行
scala> studentDF.show(30,false)
+---+--------+--------------+-----------------------------------------+
|id |name    |phone         |email                                    |
+---+--------+--------------+-----------------------------------------+
|1  |Burke   |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk|
|2  |Kamal   |1-668-571-5046|pede.Suspendisse@interdumenim.edu        |
|3  |Olga    |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu   |
|4  |Belle   |1-246-894-6340|vitae.aliquet.nec@neque.co.uk            |
|5  |Trevor  |1-300-527-4967|dapibus.id@acturpisegestas.net           |
|6  |Laurel  |1-691-379-9921|adipiscing@consectetueripsum.edu         |
|7  |Sara    |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu        |
|8  |Kaseem  |1-881-586-2689|cursus.et.magna@euismod.org              |
|9  |Lev     |1-916-367-5608|Vivamus.nisi@ipsumdolor.com              |
|10 |Maya    |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu |
|11 |Emi     |1-467-270-1337|est@nunc.com                             |
|12 |Caleb   |1-683-212-0896|Suspendisse@Quisque.edu                  |
|13 |Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca   |
|14 |Anika   |1-856-828-7883|euismod@ligulaelit.co.uk                 |
|15 |Tarik   |1-398-171-2268|turpis@felisorci.com                     |
|16 |Amena   |1-878-250-3129|lorem.luctus.ut@scelerisque.com          |
|17 |Blossom |1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk        |
|18 |Guy     |1-869-521-3230|senectus.et.netus@lectusrutrum.com       |
|19 |Malachi |1-608-637-2772|Proin.mi.Aliquam@estarcu.net             |
|20 |Edward  |1-711-710-6552|lectus@aliquetlibero.co.uk               |
|21 |        |1-711-710-6552|lectus@aliquetlibero.co.uk               |
|22 |        |1-711-710-6552|lectus@aliquetlibero.co.uk               |
|23 |NULL    |1-711-710-6552|lectus@aliquetlibero.co.uk               |
+---+--------+--------------+-----------------------------------------+


scala> studentDF.printSchema
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)

//用studentDF这个DataFrame创建一张视图
//然后就可以查询这张视图了
scala> studentDF.createOrReplaceTempView("student")


scala> spark.sql("select * from student where id = 20").show(false)
+---+------+--------------+--------------------------+
|id |name  |phone         |email                     |
+---+------+--------------+--------------------------+
|20 |Edward|1-711-710-6552|lectus@aliquetlibero.co.uk|
+---+------+--------------+--------------------------+


//打印前三条数据
//为什么不直接 studentDF.take(3),因为take返回的是数组,需要遍历打印一下
//源码;: def take(n: Int): Array[T] = head(n)
scala> studentDF.take(3).foreach(println)
[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]
[2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu]
[3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu]



scala> studentDF.first()
res12: org.apache.spark.sql.Row = [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

scala> studentDF.head()
res13: org.apache.spark.sql.Row = [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

//源码:  def first(): T = head()
//def head(): T = head(1).head
//def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)



//现在来筛选出:id大于10的
//方法1:
scala> spark.sql("select * from student where id >10").show
+---+--------+--------------+--------------------+
| id|    name|         phone|               email|
+---+--------+--------------+--------------------+
| 11|     Emi|1-467-270-1337|        est@nunc.com|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|   Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
| 21|        |1-711-710-6552|lectus@aliquetlib...|
| 22|        |1-711-710-6552|lectus@aliquetlib...|
| 23|    NULL|1-711-710-6552|lectus@aliquetlib...|
+---+--------+--------------+--------------------+

//方法2:
scala> studentDF.filter("id>10").show
+---+--------+--------------+--------------------+
| id|    name|         phone|               email|
+---+--------+--------------+--------------------+
| 11|     Emi|1-467-270-1337|        est@nunc.com|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|   Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
| 21|        |1-711-710-6552|lectus@aliquetlib...|
| 22|        |1-711-710-6552|lectus@aliquetlib...|
| 23|    NULL|1-711-710-6552|lectus@aliquetlib...|
+---+--------+--------------+--------------------+

//name为空的
scala> studentDF.filter("name ='' or name='NULL' or name='null'").show
+---+----+--------------+--------------------+
| id|name|         phone|               email|
+---+----+--------------+--------------------+
| 21|    |1-711-710-6552|lectus@aliquetlib...|
| 22|    |1-711-710-6552|lectus@aliquetlib...|
| 23|NULL|1-711-710-6552|lectus@aliquetlib...|
+---+----+--------------+--------------------+



//排序:按照name进行降序,如果name相同,再按照id的降序排列
//不会可以用IDEA看源码
scala> studentDF.sort($"name".desc,$"id".desc).show(30)
+---+--------+--------------+--------------------+
| id|    name|         phone|               email|
+---+--------+--------------+--------------------+
|  5|  Trevor|1-300-527-4967|dapibus.id@acturp...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
|  7|    Sara|1-608-140-1995|Donec.nibh@enimEt...|
|  3|    Olga|1-956-311-1686|Aenean.eget.metus...|
| 23|    NULL|1-711-710-6552|lectus@aliquetlib...|
| 10|    Maya|1-271-683-2698|accumsan.convalli...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
|  9|     Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
|  6|  Laurel|1-691-379-9921|adipiscing@consec...|
|  8|  Kaseem|1-881-586-2689|cursus.et.magna@e...|
|  2|   Kamal|1-668-571-5046|pede.Suspendisse@...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
| 11|     Emi|1-467-270-1337|        est@nunc.com|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
|  1|   Burke|1-300-746-8446|ullamcorper.velit...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
|  4|   Belle|1-246-894-6340|vitae.aliquet.nec...|
| 14|   Anika|1-856-828-7883|euismod@ligulaeli...|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 22|        |1-711-710-6552|lectus@aliquetlib...|
| 21|        |1-711-710-6552|lectus@aliquetlib...|
+---+--------+--------------+--------------------+


//修改别名
scala> studentDF.select($"phone".as("mobile")).show()
+--------------+
|        mobile|
+--------------+
|1-300-746-8446|
|1-668-571-5046|
|1-956-311-1686|
|1-246-894-6340|
|1-300-527-4967|
|1-691-379-9921|
|1-608-140-1995|
|1-881-586-2689|
|1-916-367-5608|
|1-271-683-2698|
|1-467-270-1337|
|1-683-212-0896|
|1-603-575-2444|
|1-856-828-7883|
|1-398-171-2268|
|1-878-250-3129|
|1-154-406-9596|
|1-869-521-3230|
|1-608-637-2772|
|1-711-710-6552|
+--------------+
only showing top 20 rows



//现在实现一下两个DF进行join
//join源码;def join(right: Dataset[_], joinExprs: Column): DataFrame = join(right, joinExprs, "inner")
//有很多join,默认为inner
// import org.apache.spark.sql.functions._
//df1.join(df2, $"df1Key" === $"df2Key", "outer")

//定义两个DF:student1 和 student2 
scala> val student1 = spark.sparkContext.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(attribute => Student(attribute(0),attribute(1),attribute(2),attribute(3))).toDF
student1: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]

scala> val student2 = spark.sparkContext.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(attribute => Student(attribute(0),attribute(1),attribute(2),attribute(3))).toDF
student2: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]

//join:(多个join的话,括号里后面继续写)
scala> student1.join(student2,student1("id")===student2("id")).show()
+---+--------+--------------+--------------------+---+--------+--------------+--------------------+
| id|    name|         phone|               email| id|    name|         phone|               email|
+---+--------+--------------+--------------------+---+--------+--------------+--------------------+
|  7|    Sara|1-608-140-1995|Donec.nibh@enimEt...|  7|    Sara|1-608-140-1995|Donec.nibh@enimEt...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
| 11|     Emi|1-467-270-1337|        est@nunc.com| 11|     Emi|1-467-270-1337|        est@nunc.com|
|  3|    Olga|1-956-311-1686|Aenean.eget.metus...|  3|    Olga|1-956-311-1686|Aenean.eget.metus...|
|  8|  Kaseem|1-881-586-2689|cursus.et.magna@e...|  8|  Kaseem|1-881-586-2689|cursus.et.magna@e...|
| 22|        |1-711-710-6552|lectus@aliquetlib...| 22|        |1-711-710-6552|lectus@aliquetlib...|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
|  5|  Trevor|1-300-527-4967|dapibus.id@acturp...|  5|  Trevor|1-300-527-4967|dapibus.id@acturp...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
|  6|  Laurel|1-691-379-9921|adipiscing@consec...|  6|  Laurel|1-691-379-9921|adipiscing@consec...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 23|    NULL|1-711-710-6552|lectus@aliquetlib...| 23|    NULL|1-711-710-6552|lectus@aliquetlib...|
|  9|     Lev|1-916-367-5608|Vivamus.nisi@ipsu...|  9|     Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
|  1|   Burke|1-300-746-8446|ullamcorper.velit...|  1|   Burke|1-300-746-8446|ullamcorper.velit...|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
| 10|    Maya|1-271-683-2698|accumsan.convalli...| 10|    Maya|1-271-683-2698|accumsan.convalli...|
|  4|   Belle|1-246-894-6340|vitae.aliquet.nec...|  4|   Belle|1-246-894-6340|vitae.aliquet.nec...|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
+---+--------+--------------+--------------------+---+--------+--------------+--------------------+
only showing top 20 rows
第二种方法,通过一个可编程的接口来创建DataFrame

上面第一种方法,是已经知道了文件里的数据结构,那如果你不知道文件内容的数据结构呢??

第二种方法,通过一个可编程的接口来创建一个Datasets(DataFrame),这个可编程的接口允许你去自己构建一个schema ,然后把这个schema应用于一个已经存在的RDD。就是说自己指定schema,然后把schema作用到RDD上面。
第二种方法代码更详细,你可以通过这种方法来构造Datasets(DataFrame),不过在运行时才能知道列及其类型。

当你不能事先定义 case classe 的时候,可以考虑用下面三步曲来构建DataFrame

  • 1.从原始的RDD来创建一个带有Rows的RDD(就是一行一行的)
  • 2.定义一个schema,名字叫做StructType ,这个StructType 区匹配第一步创建的RDD的Rows的结构
  • 3.把你的schema和Rows的RDD通过createDataFrame 这个方法,综合起来操作一波

关键点源码:

//StructType源码:
case class StructType(fields: Array[StructField]) extends DataType with Seq[StructField] {
。。。。。。

//StructField源码:
case class StructField(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    metadata: Metadata = Metadata.empty) {
 。。。。

//从上面源码可以看到:
StructType =  Array[StructField]) 
StructField = name  dataType nullable

来看一下操作过程:

scala> import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}

 //1.Create an RDD of Rows from the original RDD;把源生的RDD转成RDD Row
scala> val student = spark.sparkContext.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(attribute => Row(attribute(0),attribute(1),attribute(2),attribute(3)))
student: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[113] at map at <console>:25


//2.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
//定义一个schema,名字叫做StructType
scala> val structType = StructType(Array(StructField("id",StringType,true),StructField("name",StringType,true),StructField("phone",StringType,true),StructField("email",StringType,true)))
structType: org.apache.spark.sql.types.StructType = StructType(StructField(id,StringType,true), StructField(name,StringType,true), StructField(phone,StringType,true), StructField(email,StringType,true))


//3.Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
//你的schema和Rows的RDD通过createDataFrame 这个方法,综合起来操作一波
scala> val stuDF = spark.createDataFrame(student,structType)
stuDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]

//这个schema就是自己定义的
scala> stuDF.printSchema
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)


scala> stuDF.show()
+---+--------+--------------+--------------------+
| id|    name|         phone|               email|
+---+--------+--------------+--------------------+
|  1|   Burke|1-300-746-8446|ullamcorper.velit...|
|  2|   Kamal|1-668-571-5046|pede.Suspendisse@...|
|  3|    Olga|1-956-311-1686|Aenean.eget.metus...|
|  4|   Belle|1-246-894-6340|vitae.aliquet.nec...|
|  5|  Trevor|1-300-527-4967|dapibus.id@acturp...|
|  6|  Laurel|1-691-379-9921|adipiscing@consec...|
|  7|    Sara|1-608-140-1995|Donec.nibh@enimEt...|
|  8|  Kaseem|1-881-586-2689|cursus.et.magna@e...|
|  9|     Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|    Maya|1-271-683-2698|accumsan.convalli...|
| 11|     Emi|1-467-270-1337|        est@nunc.com|
| 12|   Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|   Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|   Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|   Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|     Guy|1-869-521-3230|senectus.et.netus...|
| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|  Edward|1-711-710-6552|lectus@aliquetlib...|
+---+--------+--------------+--------------------+
only showing top 20 rows


scala> stuDF.map(x => "name: " + x(1))
res34: org.apache.spark.sql.Dataset[String] = [value: string]

scala> stuDF.map(x => "name: " + x(1)).show()
+--------------+
|         value|
+--------------+
|   name: Burke|
|   name: Kamal|
|    name: Olga|
|   name: Belle|
|  name: Trevor|
|  name: Laurel|
|    name: Sara|
|  name: Kaseem|
|     name: Lev|
|    name: Maya|
|     name: Emi|
|   name: Caleb|
|name: Florence|
|   name: Anika|
|   name: Tarik|
|   name: Amena|
| name: Blossom|
|     name: Guy|
| name: Malachi|
|  name: Edward|
+--------------+
only showing top 20 rows

小知识点:
RDD进行map之后还是一个RDD;
DataFrame进行map之后是一个Dataset。

猜你喜欢

转载自blog.csdn.net/liweihope/article/details/94409676
今日推荐