spark notes of DataFrame

1.1. What is DataFrame

DataFrame, formerly known as SchemaRDD, from Spark 1.3.0 start SchemaRDD renamed DataFrame. The main difference is that with SchemaRDD: DataFrame no longer directly inherited from RDD, but their functions to achieve the vast majority of the RDD. You can still call the method on rdd DataFrame convert it to a RDD.

In Spark, DataFrame RDD to a distributed data set based, two-dimensional table similar to a conventional database, DataFrame Schema with meta-information, each column of the two-dimensional table of data sets, i.e., are represented with DataFrame the name and type, but the underlying done more optimized. DataFrame data may be constructed from a number of sources, such as: existing RDD, structured document, an external database, Hive table.

1.2. RDD and the difference DataFrame

RDD can be seen as a collection of distributed objects, the Spark does not know the detailed information of the object model, DataFrame Row object can be viewed as a collection of distributed, which provides detailed information about a pattern of columns, such that the Spark SQL can optimize certain forms of execution. Logical frame difference DataFrame ordinary RDD as follows:

FIG directly reflected on the difference between the DataFrame and RDD.

RDD the left [Person] Although the type Person parameters, but do not understand the internal structure of the frame itself Spark Person class.

DataFrame right and has provided detailed structural information, such that the Spark SQL will be clear to a column in the data set which contains, what the name and type of each column are each, more DataFrame structure information data, i.e., schema. This looks like a table of, DataFrame also supporting a new way of operating data, DataFrame API (such as df.select ()) and SQL (select id, name from xx_table where ...).

In addition DataFrame also introduced off-heap, means other than the JVM heap memory, which is directly managed by the operating system (rather than JVM). Spark can be in the form of binary data sequence (not including the structure) to the off-heap, when the data to be operated, directly operating off-heap memory. Since Spark appreciated Schema, so we know how to operate.

RDD is a collection of distributed Java objects. DataFrame Row is a collection of distributed objects. DataFrame In addition to providing a richer than the RDD operators outside, more important feature is to enhance the efficiency of the implementation
rate and reduce data read and optimized execution plan.

Once you have DataFrame this high level of abstraction, we process the data easier, and can even use SQL to process the data, and for developers, usability has been greatly improved.

Not only that, by DataFrame API or SQL processing data automatically optimized Spark Optimizer (Catalyst), even if you write a program or SQL is not efficient, it can run very quickly.
 

 

1.3. DataFrame and RDD advantages and disadvantages

RDD advantages and disadvantages:

advantage:

(1) compile-time type safety 
                will be able to check out the compile-time type errors

(2) an object oriented programming style 
                directly manipulate the data object by calling the method in the form of

Disadvantages:

(1) serialization and deserialization performance overhead 
                whether communication between clusters or IO operations are required and the structure of the object data serialization and deserialization.

(2) GC performance overhead 
                of frequent creating and destroying objects, is bound to increase GC

DataFrame by introducing a schema and off-heap (not in the heap memory, in addition to not refer to the heap memory, the memory system on the operator), addresses the shortcomings of RDD, the Spark by schame can read data, the communication only need to serialize and deserialize IO data and time, and part of the structure can be omitted; incorporated by off-heap, fast operation data, to avoid a large number of GC. But lost the advantage of RDD, DataFrame not type safe, API is not object-oriented style.
 

 

1.4. Read the data source to create DataFrame 2.4.1 reads text files created DataFrame

 Before spark2.0 version, Spark SQL DataFrame in SQLContext is to create and execute SQL entrance, you can use hiveContext by hive sql statements to manipulate data table hive, hive-compatible operating, and hiveContext inherited from SQLContext. After spark2.0, which are united in the SparkSession, SparkSession encapsulates SparkContext, SqlContext, you can get to SparkConetxt, SqlContext objects through SparkSession.
 

1) create a file locally, there are three, namely, id, name, age, separated by a space, and then uploaded to the hdfs. person.txt says:

1 zhangsan 20

2 lysis 29

3 wangwu 25

4 zhaoliu 30

5 tianqi 35

6 kobe 40

Upload data files to HDFS:

hdfs dfs -put person.txt  /

(2) Run the following commands spark shell, reading data, the data of each row using the divided column delimiter

Performed first spark-shell --master local [2]

val lineRDD= sc.textFile("/person.txt").map(_.split(" "))

(3) the definition of case class (schema corresponding to the table)

case class Person(id:Int, name:String, age:Int)

(4) The class and associated case RDD

val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))

(5) converting into DataFrame RDD

selection personDF = personRDD.toDF

(6) to be treated DataFrame

personDF.show

personDF.printSchema

(7), by constructing DataFrame SparkSession

Use spark-shell has been initialized to generate a good spark SparkSession objects DataFrame

val dataFrame=spark.read.text("/person.txt")
 

Published 515 original articles · won praise 10 · views 90000 +

Guess you like

Origin blog.csdn.net/xiamaocheng/article/details/104298895