Spark learning (eight) SparkSQL Profile

A, Spark SQL Overview

  1.1 What is the Spark SQL

  So why Spark SQL 1.2

二,DataFrames

  2.1 What is DataFrames

  2.2 Creating DataFrames

Three, DataFrame common operations

  3.1 DSL style syntax

  3.2 SQL-style syntax

Four, SparkSQL programming examples

  4.1 preparation

  4.2 inferred by reflection Schema

  4.3 Schema specified directly by StructType

 

text

A, Spark SQL Overview

  1.1 What is the Spark SQL

  

  

  Spark Spark SQL is a module for processing structured data, it provides a programming abstraction called a DataFrame and acts as a distributed SQL query engine.

  So why Spark SQL 1.2

  We have learned Hive, it is converted into MapReduce and Hive SQL submit to execution on the cluster, which greatly simplifies the complexity of writing MapReduce programs, due to the MapReduce model to calculate the efficiency of this relatively slow. All Spark SQL came into being, it is to convert Spark SQL into RDD, and then submitted to the cluster execution, execution efficiency is very fast!

  1. Easy integration

  

  2. unified data access method

  

  3. Compatible Hive

  

  4. Standard data connection

  

二,DataFrames

  2.1 What is DataFrames

  And RDD Similarly, DataFrame is a distributed data container. However, two-dimensional table DataFrame more like a traditional database, in addition to data, data structure information is also recorded, i.e., schema. Meanwhile, similar Hive, DataFrame supports nested data type (struct, array, and map). From the perspective of ease of use API point of view, it is a high-level relations operation, to be more friendly and functional than the RDD API, a lower threshold DataFrame API provides. Since similar DataFrame R and the Pandas, Spark DataFrame well inherited the traditional stand-alone data analysis of development experience.

·  

  2.2 Creating DataFrames

//1.在本地创建一个文件,有三列,分别是id、name、age,用空格分隔,然后上传到hdfs上
hdfs dfs -put person.txt /

//2.在spark shell执行下面命令,读取数据,将每一行的数据使用列分隔符分割
val lineRDD = sc.textFile("hdfs://node1.xiaoniu.com:9000/person.txt").map(_.split(" "))

//3.定义case class(相当于表的schema)
case class Person(id:Int, name:String, age:Int)

//4.将RDD和case class关联
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))

//5.将RDD转换成DataFrame
val personDF = personRDD.toDF

//6.对DataFrame进行处理
personDF.show

三,DataFrame常用操作

  3.1 DSL风格语法

//查看DataFrame中的内容
personDF.show

//查看DataFrame部分列中的内容
personDF.select(personDF.col("name")).show
personDF.select(col("name"), col("age")).show
personDF.select("name").show

//打印DataFrame的Schema信息
personDF.printSchema

//查询所有的name和age,并将age+1
personDF.select(col("id"), col("name"), col("age") + 1).show
personDF.select(personDF("id"), personDF("name"), personDF("age") + 1).show


//过滤age大于等于18的
personDF.filter(col("age") >= 18).show


//按年龄进行分组并统计相同年龄的人数
personDF.groupBy("age").count().show()

  3.2 SQL风格语法

//如果想使用SQL风格的语法,需要将DataFrame注册成表
personDF.registerTempTable("t_person")

//查询年龄最大的前两名
sqlContext.sql("select * from t_person order by age desc limit 2").show


//显示表的Schema信息
sqlContext.sql("desc t_person").show

四,SparkSQL编程实例

  4.1 前期准备

  前面我们学习了如何在Spark Shell中使用SQL完成查询,现在我们来实现在自定义的程序中编写Spark SQL查询程序。首先在maven项目的pom.xml中添加Spark SQL的依赖

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
</dependency>

  4.2 通过反射推断Schema

  创建一个object为cn.xiaoniu.spark.sql.InferringSchema

package cn.xiaoniu.spark.sql

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

object InferringSchema {

  def main(args: Array[String]) {

    //创建SparkConf()并设置App名称
    val conf = new SparkConf().setAppName("SQL-1")
    //SQLContext要依赖SparkContext
    val sc = new SparkContext(conf)
    //创建SQLContext
    val sqlContext = new SQLContext(sc)

    //从指定的地址创建RDD
    val lineRDD = sc.textFile(args(0)).map(_.split(" "))

    //创建case class
    //将RDD和case class关联
    val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))
    //导入隐式转换,如果不到人无法将RDD转换成DataFrame
    //将RDD转换成DataFrame
    import sqlContext.implicits._
    val personDF = personRDD.toDF
    //注册表
    personDF.registerTempTable("t_person")
    //传入SQL
    val df = sqlContext.sql("select * from t_person order by age desc limit 2")
    //将结果以JSON的方式存储到指定位置
    df.write.json(args(1))
    //停止Spark Context
    sc.stop()
  }
}
//case class一定要放到外面
case class Person(id: Int, name: String, age: Int)

  4.3 通过StructType直接指定Schema

  创建一个object为cn.xiaoniu.spark.sql.SpecifyingSchema

package cn.xiaoniu.spark.sql

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkContext, SparkConf}

object SpecifSchema {
  def main(args: Array[String]) {
    //创建SparkConf()并设置App名称
    val conf = new SparkConf().setAppName("SQL-2")
    //SQLContext要依赖SparkContext
    val sc = new SparkContext(conf)
    //创建SQLContext
    val sqlContext = new SQLContext(sc)
    //从指定的地址创建RDD
    val personRDD = sc.textFile(args(0)).map(_.split(" "))
    //通过StructType直接指定每个字段的schema
    val schema = StructType(
      List(
        StructField("id", IntegerType, true),
        StructField("name", StringType, true),
        StructField("age", IntegerType, true)
      )
    )
    //将RDD映射到rowRDD
    val rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))
    //将schema信息应用到rowRDD上
    val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    //注册表
    personDataFrame.registerTempTable("t_person")
    //执行SQL
    val df = sqlContext.sql("select * from t_person order by age desc limit 4")
    //将结果以JSON的方式存储到指定位置
    df.write.json(args(1))
    //停止Spark Context
    sc.stop()
  }
}

Guess you like

Origin www.cnblogs.com/tashanzhishi/p/10993945.html