快学Big Data -- Spark SQL总结(二十四)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/xfg0218/article/details/82381676

Spark  SQL 总结

概述

Spark  Sql 是用来处理结构化数据的一个模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用。

特点

  1. spark  sql 要比hive执行的速度要快,原因在于spark sql不用通过mapreduce来执行程序,减少了执行的复杂性。
  2. Spark sql 可以将数据转化为RDD(内存中),大大提高了执行的效率。
  3. 易整合、统一的数据访问方式、兼容hive、标准的数据连接

DataFrames

概述

   与RDD类似,DataFrame也是一个分布式数据容器。然而DataFrame更像传统数据库的二维表格,除了数据以外,还记录数据的结构信息,即schema。同时,与Hive类似,DataFrame也支持嵌套数据类型(struct、array和map)。从API易用性的角度上 看,DataFrame API提供的是一套高层的关系操作,比函数式的RDD API要更加友好,门槛更低。由于与R和Pandas的DataFrame类似,Spark DataFrame很好地继承了传统单机数据分析的开发体验。

查询实例

1-1)、准备数据

[root@hadoop1 testDate]# cat persion.text

1,xiaozhang,23

2,xiaowang,24

3,xiaoli,25

4,xiaoxiao,26

5,xiaoxiao,27

6,xiaolizi,39

7,xiaodaye,10

8,dageda,12

9,daye,24

10,dada,25

 

1-2)、上传到HDFS上

[root@hadoop1 testDate]# hadoop fs -mkdir /sparkSql

[root@hadoop1 testDate]# hadoop fs -put /usr/local/testDate/persion.text  /sparkSq

1-3)、启动Spark

[root@hadoop1 bin]# ./spark-shell --master spark://hadoop1:7077,hadoop2:7077 --total-executor-cores 10  --executor-memory 1g  --executor-cores  2

[root@hadoop1 bin]# ./spark-shell --master spark://hadoop1:7077,hadoop2:7077

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties

To adjust logging level use sc.setLogLevel("INFO")

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2

 

*************************

scala> sc

res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4294ef3e

1-4)、SQL常用RDD操作实例

A)、常用函数操作

sp_address.txt 下载地址:http://download.csdn.net/download/xfg0218/9757901

 

上传到HDFS

[root@hadoop1 testDate]# hadoop fs -put /opt/testDate/sp_address.txt  /sparkSql

1-1)、读取数据

scala> val rdd1 = sc.textFile("hdfs://hadoop1:9000/sparkSql/sp_address.txt")

rdd1: org.apache.spark.rdd.RDD[String] = hdfs://hadoop1:9000/sparkSql/sp_address.txt MapPartitionsRDD[1] at textFile at <console>:24

1-2)、对数据进行分source

scala> val rdd2 = rdd1.map(_.split(","))

rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26

 

1-3)、创建模式匹配的类

scala> case class sp_address(ID:Int,PLACE_TYPE:String,PLACE_CODE:String,PLACE_NAME:String)

defined class sp_address

 

1-4)、进行转换为匹配的类

scala> val rdd3 = rdd2.map(rdd => sp_address(rdd(0).toInt,rdd(1),rdd(2),rdd(3)))

rdd3: org.apache.spark.rdd.RDD[sp_address] = MapPartitionsRDD[3] at map at <console>:30

1-5)、导入隐式转换,如果不导入无法将RDD转换成DataFrame

scala> import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.SQLContext

 

scala>  val sql = new SQLContext(sc)

warning: there was one deprecation warning; re-run with -deprecation for details

sql: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@451d6933

1- 6)、将数据转换成DataFrame

scala> import sql.implicits._

import sql.implicits._

scala> val rdd4 = rdd3.toDF

17/02/19 13:52:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

17/02/19 13:52:17 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

rdd4: org.apache.spark.sql.DataFrame = [ID: int, PLACE_TYPE: string ... 3 more fields]

scala> val rdd5 = rdd3.toDF.registerTempTable(“rdd4”)

scala> val rdd6= rdd5.sql(“select * from rdd4”).show()

1-7)、显示表中的数据

scala> rdd4.show()

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  3|        01|    130000|       河北省|

|  4|        01|    140000|       山西省|

|  5|        01|    150000|    内蒙古自治区|

|  6|        01|    210000|       辽宁省|

|  7|        01|    220000|       吉林省|

|  8|        01|    230000|      黑龙江省|

|  9|        01|    310000|       上海市|

| 10|        01|    320000|       江苏省|

| 11|        01|    330000|       浙江省|

| 12|        01|    340000|       安徽省|

| 13|        01|    350000|       福建省|

| 14|        01|    360000|       江西省|

| 15|        01|    370000|       山东省|

| 16|        01|    410000|       河南省|

| 17|        01|    420000|       湖北省|

| 18|        01|    430000|       湖南省|

| 19|        01|    440000|       广东省|

| 20|        01|    450000|   广西壮族自治区|

+---+----------+----------+----------+

only showing top 20 rows

 

可以看出默认的显示了前20条的记录

1-8)、显示前5条记录

scala> rdd4.show(5)

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  3|        01|    130000|       河北省|

|  4|        01|    140000|       山西省|

|  5|        01|    150000|    内蒙古自治区|

+---+----------+----------+----------+

only showing top 5 rows

 

1-9)、设置是否默认显示20个字符,如果超过20个字符则显示为*****,系统默认的true

scala> rdd4.show(false)

scala> rdd4.show(true)

 

1-10)、查找数据显示执行的约束,显示前三条记录,并且显示长字符

scala> rdd4.show(3,false)

+---+----------+----------+----------+

|ID |PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|1  |01        |110000    |北京市       |

|2  |01        |120000    |天津市       |

|3  |01        |130000    |河北省       |

+---+----------+----------+----------+

only showing top 3 rows

1-11)、以数组的形式显示数据,可以把数据当做元组

scala> rdd4.collect()

res9: Array[org.apache.spark.sql.Row] = Array([1,01,110000,北京市], [2,01,120000,天津市], [3,01,130000,河北省], [4,01,140000,山西省], [5,01,150000,内蒙古自治区], [6,01,210000,辽宁省], [7,01,220000,吉林省], [8,01,230000,黑龙江省], [9,01,310000,上海市], [10,01,320000,江苏省], [11,01,330000,浙江省], [12,01,340000,安徽省], [13,01,350000,福建省], [14,01,360000,江西省], [15,01,370000,东省], [16,01,410000,河南省], [17,01,420000,湖北省], [18,01,430000,湖南省], [19,01,440000,广东省], [20,01,450000,广西壮族自治区], [21,01,460000,海南省], [22,01,500000,重庆市], [23,01,510000,川省], [24,01,520000,贵州省], [25,01,530000,云南省], [26,01,540000,西藏自治区], [27,01,610000,陕西省], [28,01,620000,甘肃省], [29,01,630000,青海省], [30,01,640000,宁夏回族自治区], [31,01,650000,新疆维吾尔自治区], [32,01,710000,台湾省], [33,01,810000,香港特别行政区], [34,01,820000,澳门特别行政区], [35,02,110100,市辖区], [36,02,110200,县], [37,02,120100,市...

1-12)、功能与collect一样,只不过把数据转化为了List

scala> rdd4.collectAsList()

res10: java.util.List[org.apache.spark.sql.Row] = [[1,01,110000,北京市], [2,01,120000,天津市], [3,01,130000,河北省], [4,01,140000,山西省], [5,01,150000,内蒙古自治区], [6,01,210000,辽宁省], [7,01,220000,吉林省], [8,01,230000,黑龙江省], [9,01,310000,上海市], [10,01,320000,江苏省], [11,01,330000,浙江省], [12,01,340000,安徽省], [13,01,350000,福建省], [14,01,360000,江西省], [15,01,370000,山东省], [16,01,410000,河南省], [17,01,420000,湖北省], [18,01,430000,湖南省], [19,01,440000,广东省], [20,01,450000,广西壮族自治区], [21,01,460000,海南省], [22,01,500000,重庆市], [23,01,510000,四川省], [24,01,520000,贵州省], [25,01,530000,云南省], [26,01,540000,西藏自治区], [27,01,610000,陕西省], [28,01,620000,甘肃省], [29,01,630000,青海省], [30,01,640000,宁夏回族自治区], [31,01,650000,新疆维吾尔自治区], [32,01,710000,台湾省], [33,01,810000,香港特别行政区], [34,01,820000,澳门特别行政区], [35,02,110100,市辖区], [36,02,110200,县], [37,02,120...

 

1-13)、获取指定字段的统计信息:describe(cols: String*)
scala> rdd4.describe("ID","PLACE_TYPE","PLACE_CODE").show()

+-------+------------------+------------------+------------------+

|summary|                ID|        PLACE_TYPE|        PLACE_CODE|

+-------+------------------+------------------+------------------+

|  count|              3674|              3674|              3674|

|   mean|            1837.5| 2.878606423516603|400493.43195427326|

| stddev|1060.7367722484216|0.3538356835565222|162524.50646097676|

|    min|                 1|                01|            110000|

|    max|              3674|                03|            820301|

+-------+------------------+------------------+------------------+

 

Count: 统计的个数

Mean :平均数

Stddev:偏差

Min/max : 最小/最大

1-13)、获取第一行记录
scala> rdd4.first

res12: org.apache.spark.sql.Row = [1,01,110000,北京市]

1-14)、获取前一行或者前N行记录(head,take)

scala> rdd4.head

res13: org.apache.spark.sql.Row = [1,01,110000,北京市]

 

scala> rdd4.head(3)

res14: Array[org.apache.spark.sql.Row] = Array([1,01,110000,北京市], [2,01,120000,天津市], [3,01,130000,河北省])

 

scala> rdd4.take(3)

res16: Array[org.apache.spark.sql.Row] = Array([1,01,110000,北京市], [2,01,120000,天津市], [3,01,130000,河北省])

1-15)、获取前n行数据,并以List的形式展现(takeAsList(n: Int))

scala> rdd4.takeAsList(4)

res17: java.util.List[org.apache.spark.sql.Row] = [[1,01,110000,北京市], [2,01,120000,天津市], [3,01,130000,河北省], [4,01,140000,山西省]]

1-16)、where 关键字的使用

scala> rdd4.where("ID > 5 AND ID < 10").show()

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  6|        01|    210000|       辽宁省|

|  7|        01|    220000|       吉林省|

|  8|        01|    230000|      黑龙江省|

|  9|        01|    310000|       上海市|

+---+----------+----------+----------+

1-17)、filter 关键字过滤数据

scala> rdd4.filter("ID=1 OR PLACE_TYPE=1").show()

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  3|        01|    130000|       河北省|

|  4|        01|    140000|       山西省|

|  5|        01|    150000|    内蒙古自治区|

|  6|        01|    210000|       辽宁省|

|  7|        01|    220000|       吉林省|

|  8|        01|    230000|      黑龙江省|

|  9|        01|    310000|       上海市|

| 10|        01|    320000|       江苏省|

| 11|        01|    330000|       浙江省|

| 12|        01|    340000|       安徽省|

| 13|        01|    350000|       福建省|

| 14|        01|    360000|       江西省|

| 15|        01|    370000|       山东省|

| 16|        01|    410000|       河南省|

| 17|        01|    420000|       湖北省|

| 18|        01|    430000|       湖南省|

| 19|        01|    440000|       广东省|

| 20|        01|    450000|   广西壮族自治区|

+---+----------+----------+----------+

only showing top 20 rows

1-18)、按照指定的字段显示数据前五条数据

scala> rdd4.select("ID","PLACE_TYPE").show(5)

+---+----------+

| ID|PLACE_TYPE|

+---+----------+

|  1|        01|

|  2|        01|

|  3|        01|

|  4|        01|

|  5|        01|

+---+----------+

only showing top 5 rows

 

scala> rdd4.select(rdd4("ID"),rdd4("PLACE_TYPE")).show(5)

+---+----------+

| ID|PLACE_TYPE|

+---+----------+

|  1|        01|

|  2|        01|

|  3|        01|

|  4|        01|

|  5|        01|

+---+----------+

only showing top 5 rows

1-19)、可以对指定字段进行特殊处理(selectExpr)

scala> rdd4.selectExpr("PLACE_NAME","PLACE_TYPE AS TYPE","ROUND(ID)").show(5)

+----------+----+------------+

|PLACE_NAME|TYPE|round(ID, 0)|

+----------+----+------------+

|       北京市|  01|           1|

|       天津市|  01|           2|

|       河北省|  01|           3|

|       山西省|  01|           4|

|    内蒙古自治区|  01|           5|

+----------+----+------------+

only showing top 5 rows

1-20)、获取指定的字段(apply)

scala> val idCol1 = rdd4.apply("ID")

idCol1: org.apache.spark.sql.Column = ID

 

scala> val idCol2 = rdd4("ID")

idCol2: org.apache.spark.sql.Column = ID

1-21)、去除指定字段,保留其他字段(drop)

scala> rdd4.drop("PLACE_CODE")

res31: org.apache.spark.sql.DataFrame = [ID: int, PLACE_TYPE: string ... 1 more field]

 

scala> rdd4.drop(rdd4("PLACE_CODE"))

res34: org.apache.spark.sql.DataFrame = [ID: int, PLACE_TYPE: string ... 1 more field]

1-22)、获取指定DataFrame的前n行记录,得到一个新的DataFrame对象(limit不是action)

scala> rdd4.limit(5).show(false)

+---+----------+----------+----------+

|ID |PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|1  |01        |110000    |北京市       |

|2  |01        |120000    |天津市       |

|3  |01        |130000    |河北省       |

|4  |01        |140000    |山西省       |

|5  |01        |150000    |内蒙古自治区    |

+---+----------+----------+----------+

 

1-23)、对字段进行排序,默认的升续(orderBy 前面加-表示降序,或者desc形式)

A)、按照降序获取数据
scala> rdd4.orderBy(-rdd4("ID")).show(5)

+----+----------+----------+----------+

|  ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+----+----------+----------+----------+

|3674|        03|    820301|       路环区|

|3673|        03|    820201|       凼仔区|

|3672|        03|    820105|      风顺堂区|

|3671|        03|    820104|       大堂区|

|3670|        03|    820103|      望德堂区|

+----+----------+----------+----------+

only showing top 5 rows

 

 

scala> rdd4.orderBy(rdd4("ID").desc).show(5)

+----+----------+----------+----------+

|  ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+----+----------+----------+----------+

|3674|        03|    820301|       路环区|

|3673|        03|    820201|       凼仔区|

|3672|        03|    820105|      风顺堂区|

|3671|        03|    820104|       大堂区|

|3670|        03|    820103|      望德堂区|

+----+----------+----------+----------+

only showing top 5 rows

B)、按照升续获取数据

scala> rdd4.orderBy(rdd4("ID")).show(5)

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  3|        01|    130000|       河北省|

|  4|        01|    140000|       山西省|

|  5|        01|    150000|    内蒙古自治区|

+---+----------+----------+----------+

only showing top 5 rows

 

1-24)、根据字段进行group by 操作

A)、根据字段排

scala> rdd4.groupBy("ID")

res40: org.apache.spark.sql.RelationalGroupedDataset = org.apache.spark.sql.RelationalGroupedDataset@31ebddc4

 

B)、根据Column排序

scala> rdd4.groupBy(rdd4("ID"))

res42: org.apache.spark.sql.RelationalGroupedDataset = org.apache.spark.sql.RelationalGroupedDataset@4c44045c

 

scala> rdd4.groupBy(rdd4("ID")).count()

res43: org.apache.spark.sql.DataFrame = [ID: int, count: bigint]

 

scala> rdd4.groupBy(rdd4("ID")).max()

res44: org.apache.spark.sql.DataFrame = [ID: int, max(ID): int]

 

scala> rdd4.groupBy(rdd4("ID")).min()

res45: org.apache.spark.sql.DataFrame = [ID: int, min(ID): int]

 

scala> rdd4.groupBy(rdd4("ID")).mean()

res46: org.apache.spark.sql.DataFrame = [ID: int, avg(ID): double]

 

scala> rdd4.groupBy(rdd4("ID")).sum()

res47: org.apache.spark.sql.DataFrame = [ID: int, sum(ID): bigint]

 

scala> rdd4.groupBy(rdd4("ID")).sum().show(5)

17/02/19 15:52:53 WARN TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@5dda3e67

17/02/19 15:52:53 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@23370c54 in task 32

17/02/19 15:52:53 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@730a70db in task 32

17/02/19 15:52:53 WARN Executor: Managed memory leak detected; size = 17039360 bytes, TID = 32

+----+-------+

|  ID|sum(ID)|

+----+-------+

| 148|    148|

| 463|    463|

| 471|    471|

| 496|    496|

| 833|    833|

+----+-------+

only showing top 20 rows

 

 

scala> rdd4.groupBy(rdd4("ID")).mean().show(5)

17/02/19 15:53:28 WARN TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@13c918f3

17/02/19 15:53:28 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@778503aa in task 34

17/02/19 15:53:28 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@3db3cb33 in task 34

17/02/19 15:53:28 WARN Executor: Managed memory leak detected; size = 17039360 bytes, TID = 34

+---+-------+

| ID|avg(ID)|

+---+-------+

|148|  148.0|

|463|  463.0|

|471|  471.0|

|496|  496.0|

|833|  833.0|

+---+-------+

only showing top 5 rows

1-25)、返回一个不包含重复记录的DataFrame

scala> rdd4.distinct.show(5)

17/02/19 15:55:13 WARN TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@7d9e4118 (已经在内存中读取的数据)

17/02/19 15:55:13 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@1750be28 in task 36

17/02/19 15:55:13 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@35f4400d in task 36

17/02/19 15:55:13 WARN Executor: Managed memory leak detected; size = 17039360 bytes, TID = 36

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|175|        02|    370100|       济南市|

|258|        02|    450100|       南宁市|

|298|        02|    513400|   凉山彝族自治州|

|432|        03|    120102|       河东区|

|555|        03|    130636|       顺平县|

+---+----------+----------+----------+

only showing top 5 rows

 

1-26)、根据指定字段去重

scala> rdd4.dropDuplicates(Seq("ID")).show(5)

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|148|        02|    341200|       阜阳市|

|463|        03|    130129|       赞皇县|

|471|        03|    130184|       新乐市|

|496|        03|    130401|       市辖区|

|833|        03|    150822|       磴口县|

+---+----------+----------+----------+

only showing top 5 rows

1-27)、聚合(agg)

scala> rdd4.agg("ID"->"max","PLACE_TYPE"->"sum").show(5)

+-------+---------------+

|max(ID)|sum(PLACE_TYPE)|

+-------+---------------+

|   3674|        10576.0|

+-------+---------------+

1-28)、对两个DateFream进行聚合(unionAll)

scala> rdd4.limit(2).unionAll(rdd4.limit(1))

warning: there was one deprecation warning; re-run with -deprecation for details

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  1|        01|    110000|       北京市|

+---+----------+----------+----------+

1-29)、Join 的使用

A)、准备数据

[root@hadoop1 testDate]# cat person.text

1,xiaozhang,20

2,xiaoli,30

3,xiaoxu,19

4,lili,20

5,yuanyuan,18

[root@hadoop1 testDate]# hadoop fs -put /opt/testDate/person.text  /sparkSql

scala> val person = sc.textFile("hdfs://skycloud1:9000/sparkSql/person.text")

person: org.apache.spark.rdd.RDD[String] = hdfs://skycloud1:9000/sparkSql/person.text.txt MapPartitionsRDD[143] at textFile at <console>:36

 

scala> case class Person(ID:Int,NAME:String,AGE:String)

defined class Person

 

scala> val person1 = person.map(_.split(","))

person1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[144] at map at <console>:38       

 

scala> val person2 = person1.map(x=>Person(x(0).toInt,x(1),x(2)))

person2: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[145] at map at <console>:42

 

scala> import sql.implicits._

import sql.implicits._

 

scala> val pr = person2.toDF

pr: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field]

 

scala> pr.show(2)

+---+---------+---+

| ID|     NAME|AGE|

+---+---------+---+

|  1|xiaozhang| 20|

|  2|   xiaoli| 30|

+---+---------+---+

only showing top 2 rows

B)、获取数据

scala> rdd4.join(rdd4)

res66: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: int, PLACE_TYPE: string ... 6 more fields]

 

指定join的字段

scala> rdd4.join(pr,"ID").show()

+---+----------+----------+----------+---------+---+                            

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|     NAME|AGE|

+---+----------+----------+----------+---------+---+

|  1|        01|    110000|       北京市|xiaozhang| 20|

|  3|        01|    130000|       河北省|   xiaoxu| 19|

|  5|        01|    150000|    内蒙古自治区| yuanyuan| 18|

|  4|        01|    140000|       山西省|     lili| 20|

|  2|        01|    120000|       天津市|   xiaoli| 30|

+---+----------+----------+----------+---------+---+

[stage  99  ============================================>          (37 + 1) / 199 ]

因为join比较耗时间,spark开启了199阶段,提高了效率。

 

 

测试一下数据复杂的情况下

[root@hadoop1 testDate]# vi  person.text

1,xiaozhang,20

2,xiaoli,30

3,xiaoxu,19

4,lili,20

5,yuanyuan,18

20,dada,30

60,fgh,40

100,ros,30

8,irtn,100

1000,dfdfef,60

88,dfrif,70

9,ryty,80

99,fnth,aoli,30

3,xiaoxu,19

4,lili,20

5,yuanyuan,18

20,dada,30

60,fgh,40

100,ros,30

8,irtn,100

1000,dfdfef,60

88,dfrif,70

9,ryty,80

99,fnth,101

 

 

scala> rdd4.join(pr,"ID").show()

+---+----------+----------+----------+---------+---+                            

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|     NAME|AGE|

+---+----------+----------+----------+---------+---+

|  1|        01|    110000|       北京市|xiaozhang| 20|

|  3|        01|    130000|       河北省|   xiaoxu| 19|

|  5|        01|    150000|    内蒙古自治区| yuanyuan| 18|

|  4|        01|    140000|       山西省|     lili| 20|

|  2|        01|    120000|       天津市|   xiaoli| 30|

+---+----------+----------+----------+---------+---+

[stage  130  ============================================>          (37 + 1) / 199 ]

因为join比较耗时间,spark开启了199阶段,提高了效率。

 

C)、多个字段的情况下

scala> rdd4.join(pr,Seq("ID","NAME")).show()

org.apache.spark.sql.AnalysisException: using columns ['ID,'NAME] can not be resolved given input columns: [PLACE_TYPE, ID, AGE, PLACE_NAME, ID, NAME, PLACE_CODE] ;

***********

直接报错了,原因在于找不到NAME字段,在造数据时没有注意。

 

D)、指定join类型操作(两个DataFrame的join操作有inner, outer, left_outer, right_outer, leftsemi类型)

 

inner:内连接

Outer:外链接

left_outer:左链接

right_outer:右链接

Leftsemi:左半链接

 

 

scala> rdd4.join(pr,Seq("ID"),"outer").show(5)

+----+----------+----------+----------+----+----+

|  ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|NAME| AGE|

+----+----------+----------+----------+----+----+

| 148|        02|    341200|       阜阳市|null|null|

| 463|        03|    130129|       赞皇县|null|null|

| 471|        03|    130184|       新乐市|null|null|

| 496|        03|    130401|       市辖区|null|null|

| 833|        03|    150822|       磴口县|null|null|

 

 

scala> rdd4.join(pr,Seq("ID"),"inner").show()

+----+----------+----------+----------+---------+---+                           

|  ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|     NAME|AGE|

+----+----------+----------+----------+---------+---+

|   1|        01|    110000|       北京市|xiaozhang| 20|

|   3|        01|    130000|       河北省|   xiaoxu| 19|

|   3|        01|    130000|       河北省|   xiaoxu| 19|

|  20|        01|    450000|   广西壮族自治区|     dada| 30|

|  20|        01|    450000|   广西壮族自治区|     dada| 30|

 

 

scala> rdd4.join(pr,Seq("ID"),"right_outer").show(5)

+---+----------+----------+----------+---------+---+                            

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|     NAME|AGE|

+---+----------+----------+----------+---------+---+

|  1|        01|    110000|       北京市|xiaozhang| 20|

|  3|        01|    130000|       河北省|   xiaoxu| 19|

|  3|        01|    130000|       河北省|   xiaoxu| 19|

| 20|        01|    450000|   广西壮族自治区|     dada| 30|

| 20|        01|    450000|   广西壮族自治区|     dada| 30|

+---+----------+----------+----------+---------+---+

only showing top 5 rows

 

 

scala> rdd4.join(pr,Seq("ID"),"left_outer").show(5)

+---+----------+----------+----------+----+----+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|NAME| AGE|

+---+----------+----------+----------+----+----+

|148|        02|    341200|       阜阳市|null|null|

|463|        03|    130129|       赞皇县|null|null|

|471|        03|    130184|       新乐市|null|null|

|496|        03|    130401|       市辖区|null|null|

|833|        03|    150822|       磴口县|null|null|

+---+----------+----------+----------+----+----+

only showing top 5 rows

 

 

scala> rdd4.join(pr,Seq("ID"),"leftsemi").show(5)

+---+----------+----------+----------+                                          

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  3|        01|    130000|       河北省|

| 20|        01|    450000|   广西壮族自治区|

|  5|        01|    150000|    内蒙古自治区|

| 88|        02|    211300|       朝阳市|

+---+----------+----------+----------+

only showing top 5 rows

 

E)、使用Column类型来join

scala> rdd4.join(pr,rdd4("ID") === pr("ID")).show(5)

+---+----------+----------+----------+---+---------+---+                        

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME| ID|     NAME|AGE|

+---+----------+----------+----------+---+---------+---+

|  1|        01|    110000|       北京市|  1|xiaozhang| 20|

|  3|        01|    130000|       河北省|  3|   xiaoxu| 19|

|  3|        01|    130000|       河北省|  3|   xiaoxu| 19|

| 20|        01|    450000|   广西壮族自治区| 20|     dada| 30|

| 20|        01|    450000|   广西壮族自治区| 20|     dada| 30|

+---+----------+----------+----------+---+---------+---+

only showing top 5 rows

 

注意是===三个,是三个,不是两个

 

F)、在指定join字段同时指定join类型

scala> rdd4.join(pr,rdd4("ID") === pr("ID"),"inner").show(5)

+---+----------+----------+----------+---+---------+---+                        

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME| ID|     NAME|AGE|

+---+----------+----------+----------+---+---------+---+

|  1|        01|    110000|       北京市|  1|xiaozhang| 20|

|  3|        01|    130000|       河北省|  3|   xiaoxu| 19|

|  3|        01|    130000|       河北省|  3|   xiaoxu| 19|

| 20|        01|    450000|   广西壮族自治区| 20|     dada| 30|

| 20|        01|    450000|   广西壮族自治区| 20|     dada| 30|

+---+----------+----------+----------+---+---------+---+

only showing top 5 rows

 

1-30)、获取指定字段统计信息

stat方法可以用于计算指定字段或指定字段之间的统计信息,比如方差,协方差等。这个方法返回一个DataFramesStatFunctions类型对象。

下面代码演示根据c4字段,统计该字段值出现频率在30%以上的内容。在rdd4中字段PLACE_NAME的内容为:市区。其中[凼仔区和路环区]出现的频率为2 / 6,大于0.3

scala> rdd4.stat.freqItems(Seq ("PLACE_NAME") , 0.3).show()

+--------------------+

|PLACE_NAME_freqItems|

+--------------------+

|          [凼仔区, 路环区]|

+--------------------+

 

1-31)、获取两个DataFrame中共有的记录

scala> rdd4.intersect(rdd4.limit(5)).show(false)

+---+----------+----------+----------+                                          

|ID |PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|2  |01        |120000    |天津市       |

|5  |01        |150000    |内蒙古自治区    |

|3  |01        |130000    |河北省       |

|1  |01        |110000    |北京市       |

|4  |01        |140000    |山西省       |

+---+----------+----------+----------+

1-32)、获取一个DataFrame中有另一个DataFrame中没有的记录(显示limit中没有的数据)

scala> rdd4.except(rdd4.limit(5)).show(5)

17/02/19 17:19:08 WARN TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@69ea2d55

17/02/19 17:19:08 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@d465dc6 in task 8737

17/02/19 17:19:08 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@6b678f9 in task 8737

17/02/19 17:19:08 WARN Executor: Managed memory leak detected; size = 17039360 bytes, TID = 8737

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|175|        02|    370100|       济南市|

|258|        02|    450100|       南宁市|

|298|        02|    513400|   凉山彝族自治州|

|432|        03|    120102|       河东区|

|555|        03|    130636|       顺平县|

+---+----------+----------+----------+

only showing top 5 rows

 

1-33)、操作字段名(把ID换成IDS)

scala> rdd4.withColumnRenamed("ID","IDS")

res141: org.apache.spark.sql.DataFrame = [IDS: int, PLACE_TYPE: string ... 2 more fields]

 

1-34)、withColumn:往当前DataFrame中新增一列

scala> rdd4.withColumn("ADD_PLACE_NAME",rdd4("ID")).show(5)

+---+----------+----------+----------+--------------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|ADD_PLACE_NAME|

+---+----------+----------+----------+--------------+

|  1|        01|    110000|       北京市|             1|

|  2|        01|    120000|       天津市|             2|

|  3|        01|    130000|       河北省|             3|

|  4|        01|    140000|       山西省|             4|

|  5|        01|    150000|    内蒙古自治区|             5|

+---+----------+----------+----------+--------------+

only showing top 5 rows

 

可以看出已经新增了一列,数据来自于rdd4("ID")

1-35)、行转列

scala> rdd4.explode("PLACE_NAME","SPLIT_PLACE_NAME"){name:String=>name.split("省")}.show(5)

warning: there was one deprecation warning; re-run with -deprecation for details

+---+----------+----------+----------+----------------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|SPLIT_PLACE_NAME|

+---+----------+----------+----------+----------------+

|  1|        01|    110000|       北京市|             北京市|

|  2|        01|    120000|       天津市|             天津市|

|  3|        01|    130000|       河北省|              河北|

|  4|        01|    140000|       山西省|              山西|

|  5|        01|    150000|    内蒙古自治区|          内蒙古自治区|

+---+----------+----------+----------+----------------+

only showing top 5 rows

 

可以看出已经转换完毕,去掉了省字

 

1-36)、使用sql查询数据

scala> val rdd5 = rdd4.registerTempTable("sp_address")

warning: there was one deprecation warning; re-run with -deprecation for details

rdd5: Unit = ()

 

scala> val sqlc= new org.apache.spark.sql.SQLContext(sc)

warning: there was one deprecation warning; re-run with -deprecation for details

sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4cb346e9

 

 

scala> sqlc.sql("select * from sp_address limit 5").show(5)

+---+----------+----------+----------+

| ID|PLACE_TYPE|PLACE_CODE|PLACE_NAME|

+---+----------+----------+----------+

|  1|        01|    110000|       北京市|

|  2|        01|    120000|       天津市|

|  3|        01|    130000|       河北省|

|  4|        01|    140000|       山西省|

|  5|        01|    150000|    内蒙古自治区|

+---+----------+----------+----------+

1-37)、查看表的结构

scala> sqlc.sql("desc sp_address").show()

+----------+---------+-------+

|  col_name|data_type|comment|

+----------+---------+-------+

|        ID|      int|       |

|PLACE_TYPE|   string|       |

|PLACE_CODE|   string|       |

|PLACE_NAME|   string|       |

+----------+---------+-------+

1-38)、保存到HDFS中

scala> res.write.json("hdfs://skycloud1:9000/outsparkSql")

 

查看数据

[root@hadoop1 testDate]# hadoop fs -cat /outsparkSql/part-r-00000-08af1f56-81de-4252-a83a-22ae8ab3bde7.json

{"ID":1,"PLACE_TYPE":"01","PLACE_CODE":"110000","PLACE_NAME":"北京市"}

{"ID":2,"PLACE_TYPE":"01","PLACE_CODE":"120000","PLACE_NAME":"天津市"}

{"ID":3,"PLACE_TYPE":"01","PLACE_CODE":"130000","PLACE_NAME":"河北省"}

{"ID":4,"PLACE_TYPE":"01","PLACE_CODE":"140000","PLACE_NAME":"山西省"}

{"ID":5,"PLACE_TYPE":"01","PLACE_CODE":"150000","PLACE_NAME":"内蒙古自治区"}

1-39)、查看表的结构

scala> rdd4.printSchema

root

 |-- ID: integer (nullable = false)

 |-- PLACE_TYPE: string (nullable = true)

 |-- PLACE_CODE: string (nullable = true)

 |-- PLACE_NAME: string (nullable = true)

1-40)、对一个字段操作

scala> rdd4.select($"ID",$"ID"+1).show(5)

+---+--------+

| ID|(ID + 1)|

+---+--------+

|  1|       2|

|  2|       3|

|  3|       4|

|  4|       5|

|  5|       6|

+---+--------+

only showing top 5 rows

B)、DSL风格语法

1-1)、加载数据

scala> val pDF = persionDF.toDF

pDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int]

 

scala> pDF.show

+---+---------+---+

| id|     name|age|

+---+---------+---+

|  1|xiaozhang| 23|

|  2| xiaowang| 24|

|  3|   xiaoli| 25|

|  4| xiaoxiao| 26|

|  5| xiaoxiao| 27|

|  6| xiaolizi| 39|

|  7| xiaodaye| 10|

|  8|   dageda| 12|

|  9|     daye| 24|

| 10|     dada| 25|

+---+---------+---+

1-2)、按照条件查找实例

按照字段查找数据

scala> pDF.select(pDF.col("name")).show

+---------+

|     name|

+---------+

|xiaozhang|

| xiaowang|

|   xiaoli|

| xiaoxiao|

| xiaoxiao|

| xiaolizi|

| xiaodaye|

|   dageda|

|     daye|

|     dada|

+---------+

 

scala> pDF.select(pDF.col("name"),pDF.col("age")).show

+---------+---+

|     name|age|

+---------+---+

| xiaolizi| 39|

| xiaoxiao| 27|

| xiaoxiao| 26|

|   xiaoli| 25|

|     dada| 25|

| xiaowang| 24|

|     daye| 24|

|xiaozhang| 23|

|   dageda| 12|

| xiaodaye| 10|

+---------+---+

 

 

scala> pDF.select(col("name"),col("age")).show

+---------+---+

|     name|age|

+---------+---+

| xiaolizi| 39|

| xiaoxiao| 27|

| xiaoxiao| 26|

|   xiaoli| 25|

|     dada| 25|

| xiaowang| 24|

|     daye| 24|

|xiaozhang| 23|

|   dageda| 12|

| xiaodaye| 10|

+---------+---+

 

 

scala> pDF.select("name","age").show

+---------+---+

|     name|age|

+---------+---+

| xiaolizi| 39|

| xiaoxiao| 27|

| xiaoxiao| 26|

|   xiaoli| 25|

|     dada| 25|

| xiaowang| 24|

|     daye| 24|

|xiaozhang| 23|

|   dageda| 12|

| xiaodaye| 10|

+---------+---+

 

对字段的数值操作

scala> pDF.select(col("name"),col("age")+10).show

+---------+----------+

|     name|(age+ 10)|

+---------+----------+

| xiaolizi|        49|

| xiaoxiao|        37|

| xiaoxiao|        36|

|   xiaoli|        35|

|     dada|        35|

| xiaowang|        34|

|     daye|        34|

|xiaozhang|        33|

|   dageda|        22|

| xiaodaye|        20|

+---------+----------+

 

 

scala> pDF.select(pDF.col("name"),pDF.col("age")+10).show

+---------+----------+

|     name|(age+ 10)|

+---------+----------+

| xiaolizi|        49|

| xiaoxiao|        37|

| xiaoxiao|        36|

|   xiaoli|        35|

|     dada|        35|

| xiaowang|        34|

|     daye|        34|

|xiaozhang|        33|

|   dageda|        22|

| xiaodaye|        20|

+---------+----------+

 

1-3)、按照条件过滤数据

查找年龄大于等于20的人

scala> pDF.filter(col("age")>=20).show

+---+---------+---+

| id|     name|age|

+---+---------+---+

|  6| xiaolizi| 39|

|  5| xiaoxiao| 27|

|  4| xiaoxiao| 26|

|  3|   xiaoli| 25|

| 10|     dada| 25|

|  2| xiaowang| 24|

|  9|     daye| 24|

|  1|xiaozhang| 23|

+---+---------+---+

 

查找按照名字分组的个数

scala> pDF.groupBy("name").count().show

+---------+-----+                                                               

|     name|count|

+---------+-----+

|     dada|    1|

|xiaozhang|    1|

| xiaoxiao|    2|

|   xiaoli|    1|

| xiaodaye|    1|

|     daye|    1|

| xiaolizi|    1|

|   dageda|    1|

| xiaowang|    1|

+---------+-----+

C)、SQL风格语法

加载数据

scala> val regisTable = sql1.registerTempTable("t_persion")

 

查询数据

scala> sqlContext.sql("select * from t_persion").show

+---+---------+---+

| id|     name|sex|

+---+---------+---+

|  6| xiaolizi| 39|

|  5| xiaoxiao| 27|

|  4| xiaoxiao| 26|

|  3|   xiaoli| 25|

| 10|     dada| 25|

|  2| xiaowang| 24|

|  9|     daye| 24|

|  1|xiaozhang| 23|

|  8|   dageda| 12|

|  7| xiaodaye| 10|

+---+---------+---+

 

查看字段信息

scala> sqlContext.sql("desc t_persion").show

+--------+---------+-------+

|col_name|data_type|comment|

+--------+---------+-------+

|      id|   bigint|       |

|    name|   string|       |

|     sex|   bigint|       |

+--------+---------+-------+

 

1-5)、多表联合查询实例

A)、准备数据

[root@hadoop1 ~]# hadoop fs -cat  /sparkSql/course.txt

C001,football

C002,music

C003,art

 

[root@hadoop1 ~]# hadoop fs -cat  /sparkSql/student.txt

S001,zhangsan,12,female

S002,lisi,13,male

S003,wangwu,14,male

S004,zhaoliu,15,female

 

[root@hadoop1 ~]# hadoop fs -cat  /sparkSql/student_course.txt

1,S001,C001

2,S002,C001

3,S002,C002

4,S003,C003

5,S003,C001

6,S004,C003

7,S004,C002

 

B)、Scala 代码


import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by xiaoxu
  */
// 创建模式匹配类
case class course(cid: String, cname: String)

case class student(sid: String, sanme: String, age: Int, gender: String)

case class student_course(id: Int, sid: String, cid: String)

object SqlText {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
    val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    // 设置运行Log级别
    LoggerLevels.setStreamingLogLevels()
    // 读取数据
    val courseDate = sc.textFile("hdfs://skycloud1:9000/sparkSql/course.txt")
    val studentDate = sc.textFile("hdfs://skycloud1:9000/sparkSql/student.txt")
    val student_courseDate = sc.textFile("hdfs://skycloud1:9000/sparkSql/student_course.txt")
    // 把数据映射成对应的class
    val courseTable = courseDate.map(_.split(",")).map(courseInfo => {
      course(courseInfo(0), courseInfo(1))
    })
    val studentTable = studentDate.map(_.split(",")).map(studentInfo => {
      student(studentInfo(0), studentInfo(1), studentInfo(2).toInt, studentInfo(3))
    })
    val student_courseTable = student_courseDate.map(_.split(",")).map(student_courseInfo => {
      student_course(student_courseInfo(0).toInt, student_courseInfo(1), student_courseInfo(2))
    })

    // 把数据转化为DF方式,再进行缓存
    import sqlContext.implicits._
    val courseD: DataFrame = courseTable.toDF().cache()
    val studentD: DataFrame = studentTable.toDF().cache()
    val scD: DataFrame = student_courseTable.toDF().cache()
    // 第一种JOIN的方式,会把scD的数显先放在前面,到第二次JOIN是也会把student数据放在前面,会自动把重复的字段去掉,相当于一下的SQL
    courseD.join(scD, "cid").join(studentD, "sid").show()
    // 注册成临时表,便于SQL实现,不会把重复的字段去掉,因为select *
    courseD.registerTempTable("course")
    studentD.registerTempTable("student")
    scD.registerTempTable("student_course")
    println("=========================")
    sqlContext.sql("select * from course c,student s,student_course sc where c.cid=sc.cid and s.sid=sc.sid").show()
  }
}

 

C)、设置Log级别


import org.apache.log4j.{Logger, Level}
import org.apache.spark.Logging

/**
  * Created by Administrator on 2017/3/4.
  */
object LoggerLevels extends Logging {
  def setStreamingLogLevels() {
    val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
    if (!log4jInitialized) {
      logInfo("Setting log level to [WARN] for streaming example." +
        " To override add a custom log4j.properties to the classpath.")
      Logger.getRootLogger.setLevel(Level.WARN)
    }
  }
}

D)、运行结果(SQL结果)

+----+--------+----+--------+---+------+---+----+----+

| cid|   cname| sid|   sanme|age|gender| id| sid| cid|

+----+--------+----+--------+---+------+---+----+----+

|C001|football|S001|zhangsan| 12|female|  1|S001|C001|

|C001|football|S002|    lisi| 13|  male|  2|S002|C001|

|C002|   music|S002|    lisi| 13|  male|  3|S002|C002|

|C001|football|S003|  wangwu| 14|  male|  5|S003|C001|

|C003|     art|S003|  wangwu| 14|  male|  4|S003|C003|

|C002|   music|S004| zhaoliu| 15|female|  7|S004|C002|

|C003|     art|S004| zhaoliu| 15|female|  6|S004|C003|

+----+--------+----+--------+---+------+---+----+----+

 

 

详细执行过程请查看:http://blog.csdn.net/xfg0218/article/details/60332983

 

 

以编程方式执行Spark SQL查询

编写Spark SQL查询程序

Pom.xml 加入如下:

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-sql_2.10</artifactId>

    <version>1.5.2</version>

</dependency>

1-1)、spark SQL执行实例

A)、代码实现

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
//case class一定要放到外面
case class Person(id: Int, name: String, age: Int)
object InferringSchema {
  def main(args: Array[String]) {
    //创建SparkConf()并设置App名称
    val conf = new SparkConf().setAppName("SQL-1")
    //SQLContext要依赖SparkContext
    val sc = new SparkContext(conf)
    //创建SQLContext
    val sqlContext = new SQLContext(sc)
    //从指定的地址创建RDD
    val lineRDD = sc.textFile(args(0)).map(_.split(","))
    //创建case class
    //将RDD和case class关联
    val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))
    //导入隐式转换,如果不到人无法将RDD转换成DataFrame
    //将RDD转换成DataFrame
    import sqlContext.implicits._
    val personDF = personRDD.toDF
    //注册表
    personDF.registerTempTable("t_person")
    //传入SQL
    val df = sqlContext.sql("select * from t_person order by age desc limit 2")
    //将结果以JSON的方式存储到指定位置
    df.write.json(args(1))
    //停止Spark Context
    sc.stop()
  }
}

B)、执行JAR

[root@hadoop1 bin]# ./spark-submit  --class InferringSchema --master spark://hadoop1:7077,hadoop2:7077 ../sparkJar/sparkSql.jar  hdfs://hadoop1:9000/sparkSql  hdfs://hadoop1:9000/sparktest

C)、WEB界面查看执行过程

 

 

 

 

D)、HDFS查看数据

[root@hadoop1 bin]# hadoop fs -cat /sparkTest/part-r-00000-0182a87b-63af-4a5d-a223-0838447d27d2

{"id":6,"name":"xiaolizi","age":39}

{"id":5,"name":"xiaoxiao","age":27}

 

E)、查看更多想信息

http://blog.csdn.net/xfg0218/article/details/53045395

 

 

1-2 )、自定义表的字段信息

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

object SpecifyingSchema {

  def main(args: Array[String]) {
    //创建SparkConf()并设置App名称
    val conf = new SparkConf().setAppName("SQL-2")
    //SQLContext要依赖SparkContext
    val sc = new SparkContext(conf)
    //创建SQLContext
    val sqlContext = new SQLContext(sc)
    //从指定的地址创建RDD
    val personRDD = sc.textFile(args(0)).map(_.split(","))
    //通过StructType直接指定每个字段的schema
    val schema = StructType(
      List(
        StructField("id", IntegerType, true),
        StructField("name", StringType, true),
        StructField("age", IntegerType, true)
      )
    )
    //将RDD映射到rowRDD
    val rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))
    //将schema信息应用到rowRDD上
    val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    //注册表
    personDataFrame.registerTempTable("t_person")
    //执行SQL
    val df = sqlContext.sql("select * from t_person order by age desc limit 4")
    //将结果以JSON的方式存储到指定位置
    df.write.json(args(1))
    //停止Spark Context
    sc.stop()
  }
}

 

 

加载数据源的方式读取数据

1-1)、JDBC 加载数据

Spark SQL可以通过JDBC从关系型数据库中读取数据的方式创建DataFrame,通过对DataFrame一系列的计算后,还可以将数据再写回关系型数据库中。

 

 

1-2)、从mysql中读取数据

A)、代码实现

import java.sql.DriverManager

import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}

object JdbcRDDDemo {

  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    val conf = new SparkConf().setAppName("JdbcRDDDemo").setMaster("local")
    val sc = new SparkContext(conf)
    val connection = () => {
      Class.forName("com.mysql.jdbc.Driver").newInstance()
      DriverManager.getConnection("jdbc:mysql://hadoop2:3306/ta", "root", "123456")
    }

    val jdbcRDD = new JdbcRDD(
      sc,
      connection,
      "SELECT * FROM bigdata where id >= ? AND id <= ?",
      1, 4, 2,
      // 1与4对用的占位符,2是分区的数量
      r => {
        val id = r.getInt(1)
        val code = r.getString(2)
        (id, code)
      }
    )
    val jrdd = jdbcRDD.collect()
    println(jrdd.toBuffer)
    sc.stop()
  }
}

 

B)、查看运行结果

***********

16/11/05 17:07:59 INFO DAGScheduler: Job 0 finished: collect at JdbcRDDDemo.scala:33, took 3.105903 s

ArrayBuffer((1,efef), (2,dfefe), (3,efefe), (4,dfgfr))

16/11/05 17:07:59 INFO SparkUI: Stopped Spark web UI at http://192.168.164.1:4040

**********

 

详细请查看:http://blog.csdn.net/xfg0218/article/details/53046540

 

1-3)、把数据保存到Mysql中

A)、代码实现

import java.util.Properties

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

object SQLDemo2 {

  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    val conf = new SparkConf().setAppName("MySQL-Demo").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    //通过并行化创建RDD
    val personRDD = sc.parallelize(Array("1 tom 5", "2 jerry 3", "3 kitty 6")).map(_.split(" "))
    //通过StructType直接指定每个字段的schema
    val schema = StructType(
      List(
        StructField("id", IntegerType, true),
        StructField("name", StringType, true),
        StructField("age", IntegerType, true)
      )
    )
    //将RDD映射到rowRDD
    val rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))
    //将schema信息应用到rowRDD上
    val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    //创建Properties存储数据库相关属性
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "123456")
    //将数据追加到数据库
    personDataFrame.write.mode("append").jdbc("jdbc:mysql://hadoop2:3306/ta", "bigdata", prop)
    //停止SparkContext
    sc.stop()
  }
}

 

B)、查看运行结果

 

 

详细请查看:http://blog.csdn.net/xfg0218/article/details/53046658

Spark SQL 结合HIVE 

1-1)、创建数据

hive> create table person(id bigint,name string,age int) row format delimited fields terminated  by ",";

OK

Time taken: 1.034 seconds

 

 

hive> show tables;

OK

person

Time taken: 0.309 seconds, Fetched: 1 row(s)

 

1-2)、复制配置文件

A)、把HIVE的hive-site.conf文件复制到spark的conf下

[root@hadoop1 conf]# cp hive-site.xml  /usr/local/spark/conf/

 

B)、复制HDFS的配置文件

[root@hadoop1 conf]# cd /usr/local/hadoop-2.6.4/etc/hadoop/

[root@hadoop1 hadoop]# cp core-site.xml  hdfs-site.xml  /usr/local/spark/conf/

1-3)、启动spark

[root@hadoop1 bin]# ./spark-shell --master spark://hadoop1:7077,hadoop2:7077 --executor-memory 1g --total-executor-cores 2 --driver-class-path /usr/local/hive/lib/mysql-connector-java-5.1.35-bin.jar

 

*************

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2

      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76)

Type in expressions to have them evaluated.

Type :help for more information.

Spark context available as sc.

 

*************

 

1-4)、查看数据

scala> sqlContext.sql("select * from person")

res1: org.apache.spark.sql.DataFrame = [id: bigint, name: string, age: int]

 

scala> res1.toDF

res2: org.apache.spark.sql.DataFrame = [id: bigint, name: string, age: int]

 

scala> res2.show

+---+----+---+

| id|name|age|

+---+----+---+

+---+----+---+

 

加载数据

hive> load data local inpath "/usr/local/testDate/persion.text" into table person;

Loading data to table default.person

Table default.person stats: [numFiles=1, totalSize=130]

OK

Time taken: 4.167 seconds

 

查看数据

scala> res2.show

+---+---------+---+

| id|     name|age|

+---+---------+---+

|  1|xiaozhang| 23|

|  2| xiaowang| 24|

|  3|   xiaoli| 25|

|  4| xiaoxiao| 26|

|  5| xiaoxiao| 27|

|  6| xiaolizi| 39|

|  7| xiaodaye| 10|

|  8|   dageda| 12|

|  9|     daye| 24|

| 10|     dada| 25|

+---+---------+---+

1-5)、Scala 代码对Hive操作

package sqlText
import org.apache.spark.{SparkConf, SparkContext}
/**
  * Created by xiaoxu
  */
object SparkSQL2Hive {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf();
    conf.setAppName("SparkSQL2Hive for scala")
    conf.setMaster("spark://master1:7077")

    val sc = new SparkContext(conf)
    val hiveContext = new HiveContext(sc)
    //用户年龄
    hiveContext.sql("use testdb")
    hiveContext.sql("DROP TABLE IF EXISTS people")
    hiveContext.sql("CREATE TABLE IF NOT EXISTS people(name STRING, age INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n'")
    //把本地数据加载到hive中(实际上发生了数据拷贝),也可以直接使用HDFS中的数据
    hiveContext.sql("LOAD DATA LOCAL INPATH '/usr/local/sparkApps/SparkSQL2Hive/resources/people.txt' INTO TABLE people")
    //用户份数
    hiveContext.sql("use testdb")
    hiveContext.sql("DROP TABLE IF EXISTS peopleScores")
    hiveContext.sql("CREATE TABLE IF NOT EXISTS peopleScores(name STRING, score INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n'")
    hiveContext.sql("LOAD DATA LOCAL INPATH '/usr/local/sparkApps/SparkSQL2Hive/resources/peopleScore.txt' INTO TABLE peopleScores")

    /**
      * 通过HiveContext使用join直接基于hive中的两种表进行操作
      */
    val resultDF = hiveContext.sql("select pi.name,pi.age,ps.score "
      +" from people pi join peopleScores ps on pi.name=ps.name"
      +" where ps.score>90");
    /**
      * 通过saveAsTable创建一张hive managed table,数据的元数据和数据即将放的具体位置都是由
      * hive数据仓库进行管理的,当删除该表的时候,数据也会一起被删除(磁盘的数据不再存在)
      */
    hiveContext.sql("drop table if exists peopleResult")
    resultDF.saveAsTable("peopleResult")

    /**
      * 使用HiveContext的table方法可以直接读取hive数据仓库的Table并生成DataFrame,
      * 接下来机器学习、图计算、各种复杂的ETL等操作
      */
    val dataframeHive = hiveContext.table("peopleResult")
    dataframeHive.show()
  }
}

 

Spark -SQL 脚本执行SQL

1-1)、启动命令

[root@hadoop1 bin]# ./spark-sql --master spark://hadoop1:7077,hadoop2:7077 --executor-memory 1g --total-executor-cores 2 --driver-class-path /usr/local/hive/lib/mysql-connector-java-5.1.35-bin.jar

 

***************

16/11/05 19:01:07 INFO SessionState: Created HDFS directory: /tmp/hive/root/862a773f-674f-4949-9934-f6257fc5e434/_tmp_space.db

SET spark.sql.hive.version=1.2.1

SET spark.sql.hive.version=1.2.1

spark-sql> show databases;

********

default

Time taken: 12.808 seconds, Fetched 1 row(s)

*******

> use default;

*******

OK

16/11/05 19:10:13 INFO Driver: OK

*******

> show tables;

********

person false

Time taken: 0.591 seconds, Fetched 1 row(s)

********

> select * from person;

*********

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

Time taken: 14.784 seconds, Fetched 10 row(s)

 

spark-sql> create table sparkSql(id int,name string) ;

********

OK

16/11/05 19:21:03 INFO Driver: OK

 

 

详细的请查看:http://blog.csdn.net/xfg0218/article/details/53053162

 

1-2)、查看MYSQL保存数据信息

 

 

 

 

Spark-SQL 直接运行脚本

1-1)、准备变量数据

[hadoop@N2-06-1 ~/kettle2/embrace/xiaoxu]$ cat common.property

set hivevar:db.allinfo=allinfo;

set hivevar:db.beer=beer;

set hivevar:db.default=default;

set hivevar:db.middle=middle;

set hivevar:db.orcdata=orcdata;

set hivevar:db.rawdata=rawdata;

set hivevar:db.result=result;

set hivevar:db.result2=result2;

set hivevar:db.temp=temp;

set hivevar:db.test=test;

set hivevar:db.test_robert=test_robert;

set hivevar:xiaoxu=xiaoxu;

 

set hivevar:hdfs.url=hdfs://198.218.33.81:8020;

 

1-2)、查看还行过程

[hadoop@N2-06-1 ~/kettle2/embrace/xiaoxu]$ spark-sql -i common.property -f test.sql

Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set

SET hive.support.sql11.reserved.keywords=false

SET spark.sql.hive.version=1.2.1

SET spark.sql.hive.version=1.2.1

SET hivevar:db.allinfo=allinfo

SET hivevar:db.allinfo=allinfo

hivevar:db.allinfo allinfo

SET hivevar:db.beer=beer

SET hivevar:db.beer=beer

hivevar:db.beer beer

SET hivevar:db.default=default

SET hivevar:db.default=default

hivevar:db.default default

SET hivevar:db.middle=middle

SET hivevar:db.middle=middle

hivevar:db.middle middle

SET hivevar:db.orcdata=orcdata

SET hivevar:db.orcdata=orcdata

hivevar:db.orcdata orcdata

SET hivevar:db.rawdata=rawdata

SET hivevar:db.rawdata=rawdata

hivevar:db.rawdata rawdata

SET hivevar:db.result=result

SET hivevar:db.result=result

hivevar:db.result result

SET hivevar:db.result2=result2

SET hivevar:db.result2=result2

hivevar:db.result2 result2

SET hivevar:db.temp=temp

SET hivevar:db.temp=temp

hivevar:db.temp temp

SET hivevar:db.test=test

SET hivevar:db.test=test

hivevar:db.test test

SET hivevar:db.test_robert=test_robert

SET hivevar:db.test_robert=test_robert

hivevar:db.test_robert test_robert

SET hivevar:xiaoxu=xiaoxu

SET hivevar:xiaoxu=xiaoxu

hivevar:xiaoxu xiaoxu

SET hivevar:hdfs.url=hdfs://198.218.33.81:8020

SET hivevar:hdfs.url=hdfs://198.218.33.81:8020

hivevar:hdfs.url hdfs://198.218.33.81:8020

OK

allinfo

beer

default

middle

orcdata

rawdata

result

result2

temp

test

test_robert

xiaoxu

Time taken: 2.331 seconds, Fetched 12 row(s)

 

可以看到spark-sql在执行的过程中会先去加载配置,并把数据放到全局的比阿娘里面。

猜你喜欢

转载自blog.csdn.net/xfg0218/article/details/82381676