Creating DataFrame method Spark SQL in

In the Spark SQL the SparkSession create DataFrames and execute SQL entrance
created DataFrames three ways:
(1) from an existing RDD conversion
(2) created from JSON / Parquet / CSV / ORC / JDBC and other structured data source
(3) query returns from Hive Table

Core:
Create DataFrame, create "RDD + meta information schema definition"
RDD data from
schema may be defined by the developer or inferred from the data by the frame

1. Create DataFrame from RDD

1.1 DataFrame create from RDD [Case class category]

Definition of a case class to encapsulate the data, the following, a case class is class Stu

case class Stu(id: Int, name: String, age: Int, city: String, score: Double) 

code show as below:

//RDD[String]
val rdd: RDD[String] = saprk.sparkContext.textFile("doc/stu.csv")

val data: RDD[Stu] = rdd.map(line => {
  //切分字段
  val arr = line.split(",")
  //每一行数据对应一个caseclass对象
  Stu(arr(0).toInt, arr(1), arr(2).toInt, arr(3), arr(4).toDouble)
})

// .toDF
import saprk.implicits._
val df: DataFrame = data.toDF()

1.2 DataFrame create from RDD [Tuple]

val rddTuple: RDD[(Int, String, Int, String, Double)] = rdd
  // 切分字段
  .map(_.split(","))
  // 将每一行数据变形成一个多元组tuple
  .map(arr => (arr(0).toInt, arr(1), arr(2).toInt, arr(3), arr(4).toDouble))
import spark.implicits._
//指定字段名称
val df2 = rddTuple.toDF("id","name","age","city","score")

1.3 DataFrame create from RDD [JavaBean]

Note: Bean said here, referring to the definition used java bean

public class Stu2 {
    private int id;
    private String name;
    private int age;
    private String city;
    private double score;
    public Stu2(int id, String name, int age, String city, double score) {
        this.id = id;
        this.name = name;
        this.age = age;
        this.city = city;
        this.score = score;
    }
    public int getId() {
        return id;
    }
    public void setId(int id) {

Sample code:

val rddBean: RDD[Stu2] = rdd
  // 切分字段
  .map(_.split(","))
  // 将每一行数据变形成一个JavaBean
  .map(arr => new Stu2(arr(0).toInt,arr(1),arr(2).toInt,arr(3),arr(4).toDouble))
val df = spark.createDataFrame(rddBean,classOf[Stu2])
df.show()

Note: RDD [JavaBean] no support in toDF in spark.implicits._

1.4 DataFrame create from RDD [ordinary Scala classes] in

Ordinary scala bean class definition:

class Stu3(
            @BeanProperty
            val id: Int,
            @BeanProperty
            val name: String,
            @BeanProperty
            val age: Int,
            @BeanProperty
            val city: String,
            @BeanProperty
            val score: Double)

Sample code:

val rddStu3: RDD[Stu3] = rdd
  // 切分字段
  .map(_.split(","))
  // 将每一行数据变形成一个普通Scala对象
  .map(arr => new Stu3(arr(0).toInt, arr(1), arr(2).toInt, arr(3), arr(4).toDouble))
val df = spark.createDataFrame(rddStu3, classOf[Stu3])
df.show()

1.5 DataFrame create from RDD [Row] in

Note: The data DataFrame, in essence, is encapsulated in the RDD, the RDD [T] T total Row type a type, the element type in the RDD inside DataFrame T is the framework defined above;

val rddRow = rdd
  // 切分字段
  .map(_.split(","))
  // 将每一行数据变形成一个Row对象
  .map(arr => Row(arr(0).toInt, arr(1), arr(2).toInt, arr(3), arr(4).toDouble))

val schema = new StructType()
  .add("id", DataTypes.IntegerType)
  .add("name", DataTypes.StringType)
  .add("age", DataTypes.IntegerType)
  .add("city", DataTypes.StringType)
  .add("score", DataTypes.DoubleType)

val df = spark.createDataFrame(rddRow,schema)
df.show()

2. Create DataFrame from a structured file

2.1 created from JSON file

val df = spark.read/*.schema(schema)*/.json("data/people.json")
df.printSchema()
df.show()

2.2 from csv file created DataFrame

2.2.1 (without header) created

val df = spark.read.csv("data/stu.csv")
df.printSchema()
df.show()

2.2.2 Custom Schema created from a csv file (without the header)

// 创建DataFrame时,传入自定义的schema
// schema在api中用StructType这个类来描述,字段用StructField来描述
val schema = new StructType()
  .add("id", DataTypes.IntegerType)
  .add("name", DataTypes.StringType)
  .add("age", DataTypes.IntegerType)
  .add("city", DataTypes.StringType)
  .add("score", DataTypes.DoubleType)

val df = spark.read.schema(schema).csv("data/stu.csv")
df.printSchema()
df.show()

2.2.3 created from a csv file (with a header)

Key - a set of parameters header = true

val df = spark.read.option("header",true).csv("data/stu.csv")
df.printSchema()
df.show()

Although the name of the column correctly, but still can not determine the type of the field, by default treated as full-String treatment, you can open a parameter inferSchema = true framework to make data fields in csv reasonable type inference

val df = spark.read
  .option("header",true)
  .option("inferSchema",true)
  .csv("data/stu.csv")
df.printSchema()
df.show()

Let frame automatically inferred schema, inefficiency is not recommended!
Note:
1, if the only option ( "header", true) , then all the fields have names but are type String
2, if there is .option ( "header", true) and .option ( "inferSchema", true) , then All fields have names and types, inefficient
3, if option ( "header", true) and self-defined schema is high efficiency

2.3 files created from Parquet

val df = spark.read.parquet("data/parquet")

3. Read the data from an external server creates DataFrame

3.1 created from a JDBC connection to the database server

NOTE: To use the package depends jdbc driver jar database connected to read data, necessary to introduce the jdbc

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.39</version>
</dependency>

Code Example:

val props = new Properties()
props.setProperty("user","root")
props.setProperty("password","root")
val df = spark.read.jdbc("jdbc:mysql://localhost:3306/bigdata","student",props)
df.show()

3.2 Load Hive from the warehouse to create DataFrame

step:

  1. To add a dependent jar spark-hive in the project
  2. To add in the project mysql connection driver dependent jar
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.4.4</version>
</dependency>

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.48</version>
</dependency> 
  1. To add a hive-site.xml / core-site.xml / hdfs-site-xml configuration file in the project
  2. When creating a need to call .enableHiveSupport sparksession () method
val spark = SparkSession
  .builder()
  .appName(this.getClass.getSimpleName)
  .master("local[*]")
  // 启用hive支持,需要调用enableHiveSupport,还需要添加一个依赖 spark-hive
  // 默认sparksql内置了自己的hive
  // 如果程序能从classpath中加载到hive-site配置文件,那么它访问的hive元数据库就不是本地内置的了,而是配置中所指定的元数据库了
  // 如果程序能从classpath中加载到core-site配置文件,那么它访问的文件系统也不再是本地文件系统了,而是配置中所指定的hdfs文件系统了
  .enableHiveSupport()
  .getOrCreate()
  1. Load hive tables
spark.sql(
  """
    |select * from stu
    |
    |""".stripMargin).show()
Released five original articles · won praise 5 · Views 153

Guess you like

Origin blog.csdn.net/weixin_45687351/article/details/103811097