Parquet encoding method

Although Parquet documents have written many encoding methods ( https://github.com/apache/parquet-format/blob/master/Encodings.md ), in fact Parquet only supports two encodings: PLAIN and Dictionary Encoding. It can only be set to open Dictionary or not to open Dictionary. And only supports the setting of file granularity. If the column granularity is not supported, the code cannot be set for a specific column.
( Https://issues.apache.org/jira/browse/PARQUET-1058 )

https://issues.apache.org/jira/browse/PARQUET-796
the boolean enableDictionary determines whether dictionary encoding is used for all columns or none of them. Here this buddy says that you can use OriginalType.TIMESTAMP_MILLIS to open Delta Encoding. But this data type is not written in the Parquet documentation, so it cannot be used as an official function.

File layer API encoding setting method: When initializing ParquetWriter, select whether to open Dictionary Encoding (enableDictionary parameter).

  • Parquet interface
/**
   * Create a new ParquetWriter.
   *
   * Directly instantiates a Hadoop {@link org.apache.hadoop.conf.Configuration} which reads
   * configuration from the classpath.
   *
   * @param file the file to create
   * @param writeSupport the implementation to write a record to a RecordConsumer
   * @param compressionCodecName the compression codec to use
   * @param blockSize the block size threshold
   * @param pageSize the page size threshold
   * @param dictionaryPageSize the page size threshold for the dictionary pages
   * @param enableDictionary to turn dictionary encoding on
   * @param validating to turn on validation using the schema
   * @param writerVersion version of parquetWriter from {@link ParquetProperties.WriterVersion}
   * @throws IOException
   * @see #ParquetWriter(Path, WriteSupport, CompressionCodecName, int, int, int, boolean, boolean, WriterVersion, Configuration)
   */
  @Deprecated
  public ParquetWriter(
      Path file,
      WriteSupport<T> writeSupport,
      CompressionCodecName compressionCodecName,
      int blockSize,
      int pageSize,
      int dictionaryPageSize,
      boolean enableDictionary,
      boolean validating,
      WriterVersion writerVersion) throws IOException {
    this(file, writeSupport, compressionCodecName, blockSize, pageSize, dictionaryPageSize, enableDictionary, validating, writerVersion, new Configuration());
  }

If you use Spark, set a parameter parquet.enable.dictionary:
https://stackoverflow.com/questions/45488227/how-to-set-parquet-file-encoding-in-spark

When viewing the encoding of a specific column of Parquet, you can debug the build() function of ParquetReader, which will read the Footer of the Parquet file, and there is the Metadata of the entire file, including the encoding of each Column in each RowGroup.

Guess you like

Origin blog.csdn.net/qiaojialin/article/details/90269796