First, the automatic partitioning inference
1 Overview
Table partitioning is a common optimization methods, such as Hive in provides the characteristics of the table partition. In a partition table, different partitions of data are typically stored in different directories, the value of the partitioning column typically contains a directory name in the directory partition. Spark SQL data source in the Parquet, auto inferred from the name of the directory partition information. For example, if the demographic data stored in the partition table, and the use of gender and the state as a partition column. Then the directory structure might look like this: tableName | - Gender = MALE | - Country = US ... ... ... | - Country = CN ... | - = Gender FEMALE | - Country = US ... | - Country = CH ... If the / tableName incoming SQLContext.read.parquet () or SQLContext.read.load () method, the Spark SQL automatically according to the directory structure, deduced partition information, a gender and country. Even if the data file contains only two values, name and age, but Spark SQL returned DataFrame, when calling printSchema () method will print out the values of the four columns: name, age, country, gender . This is inferred automatically partition function. Further, the data partition type column is automatically inferred. Currently, Spark SQL supports only automatically infer the type of numeric and string types. Sometimes, a user may not want Spark SQL data types automatically inferred partition column. At this time, a configuration can be provided as long as, spark.sql.sources.partitionColumnTypeInference.enabled, the default is true, to automatically deduce the type of partitioning column, set to false, i.e. not automatically inferred type. When disabled automatically infer the type of partition column, all type of partition columns on the harmonization of default is String. Case: automatically inferred gender and national user data
2, java case realization
## hdfs create directories, upload files ## created a users directory, but also created under the Gender = male, country = US two directory [root @ spark1 SQL] # hdfs the DFS -mkdir / the Spark-Study / users [root SQL spark1 @] # HDFS DFS -mkdir / Spark-Study / Users / Gender = MALE [@ spark1 the root SQL] # HDFS DFS -mkdir / Spark-Study / Users / Gender = MALE / Country = US [@ spark1 the root SQL] DFS HDFS # -put users.parquet / Spark-Study / Users / Gender = MALE / Country = US -------------- Package cn.spark.study.sql; Import the org.apache. spark.SparkConf; Import org.apache.spark.api.java.JavaSparkContext; Import org.apache.spark.sql.DataFrame; Import org.apache.spark.sql.SQLContext; public class ParquetPartitionDiscovery { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("ParquetPartitionDiscovery"); JavaSparkContext sc = new JavaSparkContext(); SQLContext sqlConf = new SQLContext(sc); DataFrame usersDF = sqlConf.read().parquet("hdfs://spark1:9000/spark-study/users/gender=male/country=US/users.parquet"); usersDF.printSchema(); usersDF.show(); } } ## package, upload ## run script [root @ spark1 SQL] # CAT ParquetPartitionDiscovery.sh /usr/local/spark-1.5.1-bin-hadoop2.4/bin/spark- the Submit \ - class cn.spark. study.sql.ParquetPartitionDiscovery \ --num-Executors. 3 \ --driver- Memory 100m \ --executor- Memory 100m \ --executor. 3-Cores \ --files / usr / local / Hive / the conf / hive- Site. xml \ --driver- class -path /usr/local/hive/lib/mysql-connector-java-5.1.17 .jar \ /usr/local/spark-study/java/sql/saprk-study-java-0.0 -SNAPSHOT-JAR-.1 with- dependencies.jar \ ## results ## visible, has automatically inferred gender = Male, country =US two partitions, and added to the field + ------ + -------------- + ---------------- + + ------- + ------ | name | favorite_color | favorite_numbers | Gender | Country | + ------ + -------------- + - + ------ + ------- + --------------- | Alyssa | null | [. 3,. 9, 15, 20 is] | MALE | US | | Ben | Red | [] | MALE | US | + ------ + -------------- + --------------- - + ------ + ------- +
Second, merging metadata
1 Overview
As ProtocolBuffer, Avro, Thrift, like, Parquet also supports metadata merger. The user can define a simple on metadata in the beginning, and as the business needs, gradually add more columns to the metadata. In this case, users may create multiple Parquet files, it has a number of different but mutually compatible metadata. Parquet support data source automatically infer this case, and a plurality of metadata files Parquet merged. Because metadata integration is a relatively time-consuming operation, and in most cases is not a necessary feature, from the Spark 1.5 start .0 version is off by default Parquet files automatically merge metadata properties. Automatically merge the metadata characteristics may turn Parquet data source in two ways: 1 , read Parquet file, the option data source, mergeSchema, set to true 2 , using SQLContext.setConf () method, a spark. sql.parquet.mergeSchema parameter to true stories: basic information about the merger of the students, and grade information metadata
2, scala case realization
Package cn.spark.study.sql Import org.apache.spark.SparkConf Import org.apache.spark.SparkContext Import org.apache.spark.sql.SQLContext Import org.apache.spark.sql.SaveMode Object ParquetMergeSchema { DEF main ( args: the Array [String]) { Val the conf = new new SparkConf () setAppName ( "ParquetMergeSchema." ) Val SC = new new SparkContext (the conf) Val SqlContext = new new SqlContext (SC) Import sqlContext.implicits._ // Create a DataFrame, as a basic student information, and writes a parquet file //toSeq converted to Seq; Seq is a list of suitable ordered and repetitive data storage, rapid insertion / deletion scenarios like element // sc.parallelize: create a parallel set 2: specifies the data set divided into two parts by cut val studentWithNameAge = Array (( "LEO", 30), ( "Jack", 26 is .)) toSeq Val studentWithNameAgeDF = sc.parallelize (studentWithNameAge, 2) .toDF ( "name", "Age" ) studentWithNameAgeDF.save ( "HDFS: // spark1: 9000 / the Spark-Study / students "," parquet " , SaveMode.Append) // create a second DataFrame, as student achievement information, and writes a parquet file val studentWithNameGrade = Array ((" tom ", "A"), ( "Marry", "B" )). toSeq Val studentWithNameGradeDF = sc.parallelize (studentWithNameGrade, 2) .toDF ( "name", "grade") studentWithNameGradeDF.save("HDFS: // spark1: 9000 / the Spark-Study / Students", "Parquet" , SaveMode.Append) // First, the metadata DataFrame first and second DataFrame is certainly not the same as // one is included name and age of the two columns, the two columns containing a name and Grade // Therefore, this is desirable, the data read out of the table, two automatic merging metadata file, there are three columns, name, Age, Grade // with mergeSchema way, students read the data in the table, the combined metadata Val students = sqlContext.read.option ( "mergeSchema", "to true" ) .parquet ( "HDFS: // spark1: 9000 / the Spark-Study / Students " ) students.printSchema () students.show () } } ## package - Upload - run ## to run the script [root @ spark1 sql] # cat ParquetMergeSchema.sh /usr/local/spark-1.5.1-bin-hadoop2.4/bin/spark- the Submit \ - class cn.spark.study.sql.ParquetMergeSchema \ --num-The Executors 3 \ --driver- Memory 100m \ --executor- Memory 100m \ --executor. 3-Cores \ --files / usr / local / Hive / the conf / hive- the site.xml \ --driver- class -path / usr / local / Hive / lib / MySQL- Java-5.1.17-Connector .jar \ / usr / local / Spark-Study / Scala / SQL / Spark-study- scala.jar \ ## results, two DataFrame metadata is merged + ----- + - + ----- + --- | name | Age | Grade | + ----- + ---- + ----- + | LEO | 30 | null | | Jack | 26 |null| |marry|null| B| | tom|null| A| +-----+----+-----+
3, java case realization
package cn.spark.study.sql; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.types.DataTypes; importorg.apache.spark.sql.types.StructField; Import org.apache.spark.sql.types.StructType; public class ParquetMergeSchema { public static void main (String [] args) { SparkConf the conf = new new . SparkConf () setAppName ( "ParquetMergeSchemaJava") setMaster ( "local." ); JavaSparkContext sparkContext = new new JavaSparkContext (conf); SqlContext SqlContext = new new SqlContext (sparkContext); // create a DataFrame, as the basic student information, and writes a parquet file List <String> = studentWithNameAndAge new new ArrayList<String>(); studentWithNameAndAge.add("tom,18"); studentWithNameAndAge.add("jarry,17"); JavaRDD<String> studentWithNameAndAgeRDD = sparkContext.parallelize(studentWithNameAndAge, 2); JavaRDD<Row> studentWithNameAndAgeRowRDD = studentWithNameAndAgeRDD .map(new Function<String, Row>() { @Override public Row call(String v1) throws Exception { return RowFactory.create(v1.split(",")[0], Integer.parseInt(v1.split(",")[1])); } }); List<StructField> fieldList = new ArrayList<StructField>(); fieldList.add(DataTypes.createStructField("name", DataTypes.StringType, true)); fieldList.add(DataTypes.createStructField("age", DataTypes.IntegerType, true)); StructType structType = DataTypes.createStructType(fieldList); DataFrame studentWithNameAndAgeDF = sqlContext.createDataFrame(studentWithNameAndAgeRowRDD, structType); studentWithNameAndAgeDF.write().format("parquet").mode(SaveMode.Append) .save ( () {"HDFS: // spark1: 9000 / the Spark-Study / Students" ); // create a second DataFrame, as student achievement information, and writes a parquet file List <String> = studentWithNameAndGrade new new ArrayList <String> ( ); studentWithNameAndGrade.add ( "LEO, B" ); studentWithNameAndGrade.add ( "Jack, A" ); JavaRDD <String> studentWithNameAndGradeRDD = sparkContext.parallelize (studentWithNameAndGrade, 2 ); JavaRDD <Row> studentWithNameAndGradeRowRDD = studentWithNameAndGradeRDD .map ( new new Function <String, Row> @Override public Row call(String v1) throws Exception { return RowFactory.create(v1.split(",")[0], v1.split(",")[1]); } }); fieldList = new ArrayList<StructField>(); fieldList.add(DataTypes.createStructField("name", DataTypes.StringType, true)); fieldList.add(DataTypes.createStructField("grade", DataTypes.StringType, true)); structType = DataTypes.createStructType(fieldList); DataFrame studentWithNameAndGradeDF =sqlContext.createDataFrame (studentWithNameAndGradeRowRDD, StructType); . studentWithNameAndGradeDF.write () the format ( "Parquet" ) .mode (SaveMode.Append) .save ( "HDFS: // spark1: 9000 / Spark-Study / Students." ); // first of all, the first and second DataFrame DataFrame metadata is certainly not the same as it // a contains the name and age are two columns, one containing two columns name and Grade // so, here's desired is read out of the table data, metadata automatically merge the two files, there are three columns, name, Age, Grade // mergeSchema manner as to read data students table, merging metadata DataFrame df . = sqlContext.read () Option ( "mergeSchema", "to true" ) .parquet ( "HDFS: // spark1: 9000 / Spark-Study / Students." ); DF.schema(); DF.show(); } }