39, the automatic data partitioning Parquet infer the source of metadata & merged

First, the automatic partitioning inference

1 Overview

Table partitioning is a common optimization methods, such as Hive in provides the characteristics of the table partition. In a partition table, different partitions of data are typically stored in different directories, 
the value of the partitioning column typically contains a directory name in the directory partition. Spark SQL data source in the Parquet, auto inferred from the name of the directory partition information. 
For example, if the demographic data stored in the partition table, and the use of gender and the state as a partition column. Then the directory structure might look like this: 

tableName
   | - Gender = MALE
     | - Country = US 
      ... 
      ... 
      ...
     | - Country = CN 
      ...
   | - = Gender FEMALE
     | - Country = US 
      ...
     | - Country = CH 
      ... 






If the / tableName incoming SQLContext.read.parquet () or SQLContext.read.load () method, the Spark SQL automatically according to the directory structure, deduced partition information, a gender and country.
Even if the data file contains only two values, name and age, but Spark SQL returned DataFrame, when calling printSchema () method will print out the values of the four columns: name, age, country, gender . This is inferred automatically partition function. 

Further, the data partition type column is automatically inferred. Currently, Spark SQL supports only automatically infer the type of numeric and string types. Sometimes, a user may not want Spark SQL data types automatically inferred partition column. 
At this time, a configuration can be provided as long as, spark.sql.sources.partitionColumnTypeInference.enabled, the default is true, to automatically deduce the type of partitioning column, set to false, i.e. not automatically inferred type. 
When disabled automatically infer the type of partition column, all type of partition columns on the harmonization of default is String. 


Case: automatically inferred gender and national user data


2, java case realization

## hdfs create directories, upload files 
## created a users directory, but also created under the Gender = male, country = US two directory 
[root @ spark1 SQL] # hdfs the DFS -mkdir / the Spark-Study / users 
[root SQL spark1 @] # HDFS DFS -mkdir / Spark-Study / Users / Gender = MALE 
[@ spark1 the root SQL] # HDFS DFS -mkdir / Spark-Study / Users / Gender = MALE / Country = US 
[@ spark1 the root SQL] DFS HDFS # -put users.parquet / Spark-Study / Users / Gender = MALE / Country = US



 --------------
 Package cn.spark.study.sql; 

Import the org.apache. spark.SparkConf;
 Import org.apache.spark.api.java.JavaSparkContext;
 Import org.apache.spark.sql.DataFrame;
 Import org.apache.spark.sql.SQLContext;

public class ParquetPartitionDiscovery {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("ParquetPartitionDiscovery");
        JavaSparkContext sc = new JavaSparkContext();
        SQLContext sqlConf = new SQLContext(sc);
        
        DataFrame usersDF = sqlConf.read().parquet("hdfs://spark1:9000/spark-study/users/gender=male/country=US/users.parquet");
        
        usersDF.printSchema();
        usersDF.show();
        
    }
        
}




## package, upload 


## run script 
[root @ spark1 SQL] # CAT ParquetPartitionDiscovery.sh
 /usr/local/spark-1.5.1-bin-hadoop2.4/bin/spark- the Submit \
 - class cn.spark. study.sql.ParquetPartitionDiscovery \
 --num-Executors. 3 \
 --driver- Memory 100m \
 --executor- Memory 100m \
 --executor. 3-Cores \
 --files / usr / local / Hive / the conf / hive- Site. xml \
 --driver- class -path /usr/local/hive/lib/mysql-connector-java-5.1.17 .jar \
 /usr/local/spark-study/java/sql/saprk-study-java-0.0 -SNAPSHOT-JAR-.1 with- dependencies.jar \ 




## results 
## visible, has automatically inferred gender = Male, country =US two partitions, and added to the field
 + ------ + -------------- + ---------------- + + ------- + ------ 
| name | favorite_color | favorite_numbers | Gender | Country | 
+ ------ + -------------- + - + ------ + ------- + --------------- 
| Alyssa |           null | [. 3,. 9, 15, 20 is] | MALE | US | 
| Ben | Red | [] | MALE | US | 
+ ------ + -------------- + --------------- - + ------ + ------- +


Second, merging metadata

1 Overview

As ProtocolBuffer, Avro, Thrift, like, Parquet also supports metadata merger. The user can define a simple on metadata in the beginning, and as the business needs, gradually add more columns to the metadata. 
In this case, users may create multiple Parquet files, it has a number of different but mutually compatible metadata. Parquet support data source automatically infer this case, and a plurality of metadata files Parquet merged. 

Because metadata integration is a relatively time-consuming operation, and in most cases is not a necessary feature, from the Spark 1.5 start .0 version is off by default Parquet files automatically merge metadata properties. 
Automatically merge the metadata characteristics may turn Parquet data source in two ways:
 1 , read Parquet file, the option data source, mergeSchema, set to true
 2 , using SQLContext.setConf () method, a spark. sql.parquet.mergeSchema parameter to true 


stories: basic information about the merger of the students, and grade information metadata


2, scala case realization

Package cn.spark.study.sql 

Import org.apache.spark.SparkConf
 Import org.apache.spark.SparkContext
 Import org.apache.spark.sql.SQLContext
 Import org.apache.spark.sql.SaveMode 

Object ParquetMergeSchema { 
  DEF main ( args: the Array [String]) { 
    Val the conf = new new SparkConf () setAppName ( "ParquetMergeSchema." ) 
    Val SC = new new SparkContext (the conf) 
    Val SqlContext = new new SqlContext (SC) 
    
    Import sqlContext.implicits._ 
    
    // Create a DataFrame, as a basic student information, and writes a parquet file
     //toSeq converted to Seq; Seq is a list of suitable ordered and repetitive data storage, rapid insertion / deletion scenarios like element
     // sc.parallelize: create a parallel set 2: specifies the data set divided into two parts by cut 
    val studentWithNameAge = Array (( "LEO", 30), ( "Jack", 26 is .)) toSeq 
    Val studentWithNameAgeDF = sc.parallelize (studentWithNameAge, 2) .toDF ( "name", "Age" ) 
    studentWithNameAgeDF.save ( "HDFS: // spark1: 9000 / the Spark-Study / students "," parquet " , SaveMode.Append) 
    
    // create a second DataFrame, as student achievement information, and writes a parquet file 
    val studentWithNameGrade = Array ((" tom ", "A"), ( "Marry", "B" )). toSeq 
    Val studentWithNameGradeDF = sc.parallelize (studentWithNameGrade, 2) .toDF ( "name", "grade")
    studentWithNameGradeDF.save("HDFS: // spark1: 9000 / the Spark-Study / Students", "Parquet" , SaveMode.Append) 
    
    // First, the metadata DataFrame first and second DataFrame is certainly not the same as
     // one is included name and age of the two columns, the two columns containing a name and Grade
     // Therefore, this is desirable, the data read out of the table, two automatic merging metadata file, there are three columns, name, Age, Grade 
    
    // with mergeSchema way, students read the data in the table, the combined metadata 
    Val students = sqlContext.read.option ( "mergeSchema", "to true" ) 
      .parquet ( "HDFS: // spark1: 9000 / the Spark-Study / Students " ) 
      
    students.printSchema () 
    students.show () 
    
  } 
} 




## package - Upload - run 


## to run the script 
[root @ spark1 sql] # cat ParquetMergeSchema.sh
/usr/local/spark-1.5.1-bin-hadoop2.4/bin/spark- the Submit \
 - class cn.spark.study.sql.ParquetMergeSchema \
 --num-The Executors 3 \
 --driver- Memory 100m \
 --executor- Memory 100m \
 --executor. 3-Cores \
 --files / usr / local / Hive / the conf / hive- the site.xml \
 --driver- class -path / usr / local / Hive / lib / MySQL- Java-5.1.17-Connector .jar \
 / usr / local / Spark-Study / Scala / SQL / Spark-study- scala.jar \ 



## results, two DataFrame metadata is merged
 + ----- + - + ----- + --- 
| name | Age | Grade | 
+ ----- + ---- + ----- + 
| LEO | 30 | null | 
| Jack | 26 |null|
|marry|null|    B|
|  tom|null|    A|
+-----+----+-----+


3, java case realization

package cn.spark.study.sql;

import java.util.ArrayList;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.types.DataTypes;
importorg.apache.spark.sql.types.StructField;
 Import org.apache.spark.sql.types.StructType; 

public  class ParquetMergeSchema {
     public  static  void main (String [] args) { 
        SparkConf the conf = new new . SparkConf () setAppName ( "ParquetMergeSchemaJava") setMaster ( "local." ); 
        JavaSparkContext sparkContext = new new JavaSparkContext (conf); 
        SqlContext SqlContext = new new SqlContext (sparkContext); 
 
        // create a DataFrame, as the basic student information, and writes a parquet file 
        List <String> = studentWithNameAndAge new new ArrayList<String>();
        studentWithNameAndAge.add("tom,18");
        studentWithNameAndAge.add("jarry,17");
        JavaRDD<String> studentWithNameAndAgeRDD = sparkContext.parallelize(studentWithNameAndAge, 2);
        JavaRDD<Row> studentWithNameAndAgeRowRDD = studentWithNameAndAgeRDD
            .map(new Function<String, Row>() {
            @Override
            public Row call(String v1) throws Exception {
                return RowFactory.create(v1.split(",")[0], Integer.parseInt(v1.split(",")[1]));
            }
        });
        
        List<StructField> fieldList = new ArrayList<StructField>();
        fieldList.add(DataTypes.createStructField("name", DataTypes.StringType, true));
        fieldList.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
        StructType structType = DataTypes.createStructType(fieldList);
 
        DataFrame studentWithNameAndAgeDF = sqlContext.createDataFrame(studentWithNameAndAgeRowRDD, structType);
        studentWithNameAndAgeDF.write().format("parquet").mode(SaveMode.Append)
            .save ( () {"HDFS: // spark1: 9000 / the Spark-Study / Students" ); 
 
        // create a second DataFrame, as student achievement information, and writes a parquet file 
        List <String> = studentWithNameAndGrade new new ArrayList <String> ( ); 
        studentWithNameAndGrade.add ( "LEO, B" ); 
        studentWithNameAndGrade.add ( "Jack, A" ); 
        JavaRDD <String> studentWithNameAndGradeRDD = sparkContext.parallelize (studentWithNameAndGrade, 2 ); 
        JavaRDD <Row> studentWithNameAndGradeRowRDD = studentWithNameAndGradeRDD 
            .map ( new new Function <String, Row> 
            @Override 
            public Row call(String v1) throws Exception {
                return RowFactory.create(v1.split(",")[0], v1.split(",")[1]);
            }
        });
        fieldList = new ArrayList<StructField>();
        fieldList.add(DataTypes.createStructField("name", DataTypes.StringType, true));
        fieldList.add(DataTypes.createStructField("grade", DataTypes.StringType, true));
        structType = DataTypes.createStructType(fieldList);
 
        DataFrame studentWithNameAndGradeDF =sqlContext.createDataFrame (studentWithNameAndGradeRowRDD, StructType); 
        . studentWithNameAndGradeDF.write () the format ( "Parquet" ) .mode (SaveMode.Append) 
            .save ( "HDFS: // spark1: 9000 / Spark-Study / Students." ); 
 
 
        // first of all, the first and second DataFrame DataFrame metadata is certainly not the same as it
         // a contains the name and age are two columns, one containing two columns name and Grade
         // so, here's desired is read out of the table data, metadata automatically merge the two files, there are three columns, name, Age, Grade
         // mergeSchema manner as to read data students table, merging metadata 
        DataFrame df . = sqlContext.read () Option ( "mergeSchema", "to true" ) 
            .parquet ( "HDFS: // spark1: 9000 / Spark-Study / Students." );  
        DF.schema();
        DF.show();
    }
}

Guess you like

Origin www.cnblogs.com/weiyiming007/p/11277227.html
Recommended