Parquet Introduction and simple to use (turn)

==> What is parquet

        Parquet is a file type storage column

 

==> official website description:

            Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language

        Regardless of the process frame selection data, the data model or programming language, Apache Parquet Hadoop ecosystem are any items available inline storage format

 

==> Origin

    Parquet inspiration comes from the Dremel 2010, Google published papers, the paper introduces a support nested structure of the storage format and uses columnar storage of ways to enhance query performance, the Dremel paper also describes how Google uses this achieve parallel query storage format, if interested can refer to this paper and open source implementations Apache Drill.

==> Features:

    ---> can skip does not meet the conditions of the data, reading only the data required to reduce the amount of data IO

    ---> compression coding disk storage space can be reduced (since the data type is the same as the same row, can be used more efficient compression coding (such as Run Length Encoding t Delta Encoding) further save storage space)

    ---> read only the columns needed to support vector operations, you can get better scanning performance

    ---> Parquet Spark SQL format default data source can be configured by spark.sql.sources.default

 

==> parquet common operations

    ---> load and save functions

// Read Parquet file
val usersDF = spark.read.load("/test/users.parquet")

// Query Schema and Data
usersDF.printSchema
usersDF.show

// query the user's name and favorite color and save
usersDF.select($"name", $"favorite_color").write.save("/test/result/parquet")
// verify the results by printSchema query data structure, data show View

// explicitly specify the file format: Load json format
val usersDF = spark.read.format("json").load("/test/people.json")

// storage mode (Save Modes) 
// store operation may be performed using SaveMode, SaveMode defines a processing mode of the data, it is noted that these save mode does not use any locking, not atomic
// When performed using Overwrite mode, the output before new data, the original data has been deleted
usersDF.select ($ "name"). write.save ( "/ test / parquet1") // if / test / parquet1 presence will complain
usersDF.select($"name").wirte.mode("overwrite").save("/test/parquet1")        // 使用 overwrite 即可

// save the results as a table, it can also be partitioned, divided barrels and other operations: partitionBy bucketBy
usersDF.select($"name").write.saveAsTable("table1")

  

 

 

    ---> Parquet file 

            Parquet is a row format and a plurality of data processing systems

       Spark SQL provides support for reading and writing Parquet file, which is automatically saved raw data Schema, when writing Parquet file, all columns are automatically converted to nullable, because the sake of compatibility

 

        ---- read Json format data, converts it into a format parquet, create the appropriate table, use the SQL statement to query

// read data from json file
val empJson = spark.read.json("/test/emp.json")
// save data parquet
empJson.write.mode("overwrite").parquet("/test/parquet")
// Read parquet
val empParquet = spark.read.parquet("/test/parquet")
// create a temporary table emptable
empParquet.createOrReplaceTempView("emptalbe")
// execute the query using SQL statements
spark.sql("select * from emptable where deptno=10 and sal>1500").show

  

 

        ---- merger Schematic: first define a simple Schema, and then gradually increase the column descriptions, multiple users can access several different but mutually compatible Parquet Schema file

// Create the first file
val df1 = sc.makeRDD(1 to 5).map(x=> (x, x*2)).toDF("single", "double")
scala> df1.printSchema
root
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)
 
 
// create a second file 
 scala> val df2 = sc.makeRDD(6 to 10).map(x=> (x, x*2)).toDF("single", "triple")
df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]

scala> df2.printSchema
root
 |-- single: integer (nullable = false)
 |-- triple: integer (nullable = false)
  
 scala> df2.write.parquet("/data/testtable/key=2")

 // merge two files above
scala> val df3 = spark.read.option("mergeSchema", "true").parquet("/data/testtable")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 2 more fields]

scala> df3.printSchema
root
 |-- single: integer (nullable = true)
 |-- double: integer (nullable = true)
 |-- triple: integer (nullable = true)
 |-- key: integer (nullable = true)
 
 Scale> df3.show
+------+------+------+---+
|single|double|triple|key|
+------+------+------+---+
|     8|  null|    16|  2|
|     9|  null|    18|  2|
|    10|  null|    20|  2|
|     3|     6|  null|  1|
|     4|     8|  null|  1|
|     5|    10|  null|  1|
|     6|  null|    12|  2|
|     7|  null|    14|  2|
|     1|     2|  null|  1|
|     2|     4|  null|  1|
+------+------+------+---+

  

 

 

 

    ---> Json Datasets (two way)

// The first
scala> val df4 = spark.read.json("/app/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json")
df4: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

Scale> df4.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

// The second
scala> val df5 = spark.read.format("json").load("/app/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json")
df5: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

Scale> df5.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

  

 

 

    ---> JDBC reads data in a relational database (JDBC driver needs to be added)

// The JDBC drivers added
bin/spark-shell --master spark://bigdata11:7077 --jars /root/temp/ojdbc6.jar --driver-class-path /root/temp/ojdbc6.jar

// Read Oracle
val oracleEmp = spark.read.format("jdbc")
                    .option("url","jdbc:oracle:thin:@192.168.10.100:1521/orcl.example.com")
                    .option("dbtable","scott.emp")
                    .option("user","scott")
                    .option("password","tiger").load

  

 

 

    ---> Hive operating table

        ---- copy hadoop hive and configuration files to the conf directory sprke: hive-sit.xml, core-sit.xml, hdfs-sit.xml

        Mysql database specified start time ---- Spark-shell Drivers

./bin/spark-shell --master spark://bigdata0:7077 --jars /data/tools/mysql-connector-java-5.1.43-bin.jar  --driver-class-path /data/tools/mysql-connector-java-5.1.43-bin.jar
 

 

        ---- use Spark Shell operations Hive

// Create a table
spark.sql("create table ccc(key INT, value STRING) row format delimited fields terminated by ','")

// Import Data
spark.sql("load data local path '/test/data.txt' into table ccc")

// Query data
spark.sql("select * from ccc").show

  

 

        ---- Spark SQL operations using Hive

show tables;
select * from ccc;

  

 

 

Guess you like

Origin www.cnblogs.com/sandea/p/11919376.html