Spark read and write data, converting the abstract Supplements

read

package com.test.spark

import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

/**
  * @author Administrator
  *         2019/7/22-17:09
  *
  */
object TestReadData {
  val spark = SparkSession
    .builder()
    .appName("TestCreateDataset")
    .config("spark.some.config.option", "some-value")
    .master("local")
    .enableHiveSupport()
    .getOrCreate()

  def main(args: Array[String]): Unit = {
    testRead
  }

  def testRead(): Unit = {
    //  parquet 如果有损坏啥的容易莫名的错误
    val parquet: Dataset[Row] = spark.read.parquet("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\PersonInfoCommon.parquet")
    parquet.show()

    // Spark SQL 的通用输入模式
    val commonRead: Dataset[Row] = spark.read.format("json").load("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\testJson")
    commonRead.show()
    // Spark SQL 的通用输出模式
    commonRead.write.format("parquet").mode(SaveMode.Append).save("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\PersonInfoCommon.parquet")

    // Spark SQL 的专业输入模式
    val professionalRead: Dataset[Row] = spark.read.json("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\testJson")
    professionalRead.show()
    // Spark SQL 的专业输出模式
    professionalRead.write.mode(SaveMode.Append).parquet("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\PersonInfoProfessional.parquet")

    val readParquet: Dataset[Row] = spark.sql("select * from parquet.`D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\PersonInfoCommon.parquet`")
    readParquet.show()
  }

}

//输出:
+---+---------------+---------+
|age|             ip|     name|
+---+---------------+---------+
| 24|    192.168.0.8|  lillcol|
|100|  192.168.255.1|    adson|
| 39|  192.143.255.1|     wuli|
| 20|  192.168.255.1|       gu|
| 15|  243.168.255.9|     ason|
|  1|  108.168.255.1|   tianba|
| 25|222.168.255.110|clearlove|
| 30|222.168.255.110|clearlove|
+---+---------------+---------+

+---+---------------+---------+
|age|             ip|     name|
+---+---------------+---------+
| 24|    192.168.0.8|  lillcol|
|100|  192.168.255.1|    adson|
| 39|  192.143.255.1|     wuli|
| 20|  192.168.255.1|       gu|
| 15|  243.168.255.9|     ason|
|  1|  108.168.255.1|   tianba|
| 25|222.168.255.110|clearlove|
| 30|222.168.255.110|clearlove|
+---+---------------+---------+

+---+---------------+---------+
|age|             ip|     name|
+---+---------------+---------+
| 24|    192.168.0.8|  lillcol|
|100|  192.168.255.1|    adson|
| 39|  192.143.255.1|     wuli|
| 20|  192.168.255.1|       gu|
| 15|  243.168.255.9|     ason|
|  1|  108.168.255.1|   tianba|
| 25|222.168.255.110|clearlove|
| 30|222.168.255.110|clearlove|
+---+---------------+---------+

+---+---------------+---------+
|age|             ip|     name|
+---+---------------+---------+
| 24|    192.168.0.8|  lillcol|
|100|  192.168.255.1|    adson|
| 39|  192.143.255.1|     wuli|
| 20|  192.168.255.1|       gu|
| 15|  243.168.255.9|     ason|
|  1|  108.168.255.1|   tianba|
| 25|222.168.255.110|clearlove|
| 30|222.168.255.110|clearlove|
| 24|    192.168.0.8|  lillcol|
|100|  192.168.255.1|    adson|
| 39|  192.143.255.1|     wuli|
| 20|  192.168.255.1|       gu|
| 15|  243.168.255.9|     ason|
|  1|  108.168.255.1|   tianba|
| 25|222.168.255.110|clearlove|
| 30|222.168.255.110|clearlove|
+---+---------------+---------+

Storage

File Saving Options

mode Note
Append DataFrame content will be appended to the existing data.
Overwrite The existing data will be overwritten Daframe of content data.
ErrorIfExists If the data already exists, an error.
Ignore If the data already exists, do nothing

NOTE: These save mode does not use any locking, it is not atomic.
If you use the Overwrite while the path (path) is the data source path, first data persistence operations,
otherwise they will be removed before the data read path, leading to the subsequent lazy when reading data reported in the file does not exist error.


Conversion between types

About Spark before switching between three in the abstract is always some tangled
now convert between them to be summed up

In SparkSQL the Spark offers two new abstract to us, they are DataFrame and DataSet.
They have even RDD and what to do with it?
First, from the point of view version of the produce: RDD (Spark1.0) -> DataFrame (Spark1.3) -> DataSet (Spark1.6)

If the same data gave after these three data structures, they are calculated, we will give the same result.
The difference is that their efficiency and implementation.
In later versions of the Spark, DataSet will gradually replace the RDD and DataFrame become the only API interface.
So follow me more development-oriented development of the DataSet.

eet
  1. RDD elastic distributed data sets, the cornerstone Spark calculated, shields the user from the underlying complexity of abstraction and processing of the data, the user provides a convenient method of data conversion and evaluated.
  2. RDD is performing a lazy immutable Lambda expressions may support a set of parallel data.
  3. The greatest benefit of RDD is a simple, user-friendly high degree API.
  4. RDD disadvantage is performance limitations, it is a memory object in JVM, which also determines the increase in Java serialization costs increased presence of GC and data limitations.
DataFrame
  1. And RDD Similarly, DataFrame is a distributed data container.
  2. However, two-dimensional table DataFrame more like a traditional database, in addition to data, data structure information is also recorded, i.e., schema.
  3. Hive and the like, DataFrame supports nested data type (struct, array, and map).
  4. From the perspective of ease of use API point of view, it is a high-level relations operation, to be more friendly and functional than the RDD API, a lower threshold DataFrame API provides.
  5. Since similar DataFrame R and the Pandas, Spark DataFrame well inherited the traditional stand-alone data analysis of development experience.

Q: reasons DataFrame much higher performance than the RDD:

A: There are two main reasons

  1. Custom memory management
    data in a binary manner in the presence of non-heap memory, saving a lot of space, but also out of the GC restrictions.
    DataFrame custom memory management .png
  2. Optimized execution plan
    the query plan is optimized by Spark catalyst optimiser.
DataSet
  1. DataFrame is an extension API, Spark is the latest data abstraction.
  2. API user-friendly style, with both types of security checks also have a query optimization characteristics of DataFrame.
  3. DataSet supports the codec, when you need access to non-heap data to avoid deserializing the entire object, and improve efficiency.
  4. Sample class structure information is used to define the data in the DataSet, the name of each sample class attribute field name is mapped directly to the DataSet.
  5. DataFrame is particularly the DataSet column, type DataFrame = Dataset [Row], as it can be converted to a method DataFrame DataSet. Row is a type, just like these types of Car, Person, all the table structure information is represented by Row.
  6. DataSet is strongly typed. For example, there can Dataset [Car], Dataset [the Person],
    DataFrame just know the field, but do not know the type of field, so when performing these operations is no way at compile time checking whether the type of failure, such as you can for a String subtraction operation, before an error in the implementation of,
    and DataSet not know the field, and know the type of field, so there are more stringent error checking.
    Just JSON analogy between objects and classes of objects.

Conversion between the three

case class Person(name: String, age: Long) extends Serializable //case class的定义要在引用case class函数的外面。

import spark.implicits._

//类型之间的转换:注意输出类型
    def rddSetFrame() = {
    // 在使用一些特殊的操作时,一定要加上 import spark.implicits._ 不然 toDF、toDS 无法使用。
    val rdd: RDD[String] = spark.sparkContext.textFile("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\testFile")
    val ds: Dataset[Row] = spark.read.json("D:\\DATA-LG\\PUBLIC\\TYGQ\\INF\\testJson")
    val df: DataFrame = rdd.map(_.split(",")).map(strArr => (strArr(0).trim(), strArr(1).trim().toInt)).toDF("nama", "age")

    //    rdd->df
    //一般用元组把一行的数据写在一起,然后在 toDF 中指定字段名。
    val rddTDf: DataFrame = rdd.map(_.split(",")).map(strArr => (strArr(0).trim(), strArr(1).trim().toInt)).toDF("nama", "age")
    //   df -> rdd
    val dfTRdd: RDD[Row] = df.rdd;

    //   rdd -> ds
    //定义每一行的类型 case class 时,已经给出了字段名和类型,后面只要往 case class 里面添加值即可。
    val rddTDs: Dataset[Person] = rdd.map(_.split(",")).map(strArr => Person(strArr(0).trim(), strArr(1).trim().toInt)).toDS()
    //   ds -> rdd
    val dsTRdd: RDD[Person] = rddTDs.rdd

    //    df->ds
    //这种方法就是在给出每一列的类型后,使用 as 方法,转成 DataSet,这在数据类型是 DataFrame 又需要针对各个字段处理时极为方便。
    val dfTDs: Dataset[Person] = df.as[Person]
    //    ds->df
    // 只是把 case class 封装成 Row。
    val dsTDf: DataFrame = ds.toDF

  }

Guess you like

Origin www.cnblogs.com/lillcol/p/11229033.html