==> What is parquet
Parquet is a file type storage column
==> official website description:
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language
Regardless of the process frame selection data, the data model or programming language, Apache Parquet Hadoop ecosystem are any items available inline storage format
==> Origin
Parquet inspiration comes from the Dremel 2010, Google published papers, the paper introduces a support nested structure of the storage format and uses columnar storage of ways to enhance query performance, the Dremel paper also describes how Google uses this achieve parallel query storage format, if interested can refer to this paper and open source implementations Apache Drill.
==> Features:
---> can skip does not meet the conditions of the data, reading only the data required to reduce the amount of data IO
---> compression coding disk storage space can be reduced (since the data type is the same as the same row, can be used more efficient compression coding (such as Run Length Encoding t Delta Encoding) further save storage space)
---> read only the columns needed to support vector operations, you can get better scanning performance
---> Parquet Spark SQL format default data source can be configured by spark.sql.sources.default
==> parquet common operations
---> load and save functions
---> Parquet file
Parquet is a row format and a plurality of data processing systems
Spark SQL provides support for reading and writing Parquet file, which is automatically saved raw data Schema, when writing Parquet file, all columns are automatically converted to nullable, because the sake of compatibility
---- read Json format data, converts it into a format parquet, create the appropriate table, use the SQL statement to query
---- merger Schematic: first define a simple Schema, and then gradually increase the column descriptions, multiple users can access several different but mutually compatible Parquet Schema file
---> Json Datasets (two way)
---> JDBC reads data in a relational database (JDBC driver needs to be added)
---> Hive operating table
---- copy hadoop hive and configuration files to the conf directory sprke: hive-sit.xml, core-sit.xml, hdfs-sit.xml
Mysql database specified start time ---- Spark-shell Drivers
---- use Spark Shell operations Hive
---- Spark SQL operations using Hive