Parquet 编码方式

虽然 Parquet 的文档里写了很多编码方式 (https://github.com/apache/parquet-format/blob/master/Encodings.md ),但实际上 Parquet 只支持两种编码:PLAIN 和 Dictionary Encoding。只能设置开 Dictionary 或者不开 Dictionary。而且只支持文件粒度的设置,不支持列粒度的,不能对某一具体列设置编码。
https://issues.apache.org/jira/browse/PARQUET-1058

https://issues.apache.org/jira/browse/PARQUET-796
the boolean enableDictionary determines whether dictionary encoding is used for all columns or none of them. 这里边这个哥们说可以用 OriginalType.TIMESTAMP_MILLIS 来打开 Delta Encoding。但是 Parquet 的文档里都没写这个数据类型,那就不能作为官方的功能了。

文件层 API 编码设置方式:初始化 ParquetWriter 时选择是否打开 Dictionary Encoding(enableDictionary 参数)。

  • Parquet 接口
/**
   * Create a new ParquetWriter.
   *
   * Directly instantiates a Hadoop {@link org.apache.hadoop.conf.Configuration} which reads
   * configuration from the classpath.
   *
   * @param file the file to create
   * @param writeSupport the implementation to write a record to a RecordConsumer
   * @param compressionCodecName the compression codec to use
   * @param blockSize the block size threshold
   * @param pageSize the page size threshold
   * @param dictionaryPageSize the page size threshold for the dictionary pages
   * @param enableDictionary to turn dictionary encoding on
   * @param validating to turn on validation using the schema
   * @param writerVersion version of parquetWriter from {@link ParquetProperties.WriterVersion}
   * @throws IOException
   * @see #ParquetWriter(Path, WriteSupport, CompressionCodecName, int, int, int, boolean, boolean, WriterVersion, Configuration)
   */
  @Deprecated
  public ParquetWriter(
      Path file,
      WriteSupport<T> writeSupport,
      CompressionCodecName compressionCodecName,
      int blockSize,
      int pageSize,
      int dictionaryPageSize,
      boolean enableDictionary,
      boolean validating,
      WriterVersion writerVersion) throws IOException {
    this(file, writeSupport, compressionCodecName, blockSize, pageSize, dictionaryPageSize, enableDictionary, validating, writerVersion, new Configuration());
  }

如果用 Spark,是设置一个参数 parquet.enable.dictionary:
https://stackoverflow.com/questions/45488227/how-to-set-parquet-file-encoding-in-spark

在查看 Parquet 具体某一列的编码方式时,可以调试 ParquetReader 的 build() 函数,里面会读取 Parquet 文件的 Footer,里边有整个文件的 Metadata,包括每个 RowGroup 中每个 Column 的编码方式。

猜你喜欢

转载自blog.csdn.net/qiaojialin/article/details/90269796