虽然 Parquet 的文档里写了很多编码方式 (https://github.com/apache/parquet-format/blob/master/Encodings.md ),但实际上 Parquet 只支持两种编码:PLAIN 和 Dictionary Encoding。只能设置开 Dictionary 或者不开 Dictionary。而且只支持文件粒度的设置,不支持列粒度的,不能对某一具体列设置编码。
(https://issues.apache.org/jira/browse/PARQUET-1058 )
https://issues.apache.org/jira/browse/PARQUET-796
the boolean enableDictionary determines whether dictionary encoding is used for all columns or none of them. 这里边这个哥们说可以用 OriginalType.TIMESTAMP_MILLIS 来打开 Delta Encoding。但是 Parquet 的文档里都没写这个数据类型,那就不能作为官方的功能了。
文件层 API 编码设置方式:初始化 ParquetWriter 时选择是否打开 Dictionary Encoding(enableDictionary 参数)。
- Parquet 接口
/**
* Create a new ParquetWriter.
*
* Directly instantiates a Hadoop {@link org.apache.hadoop.conf.Configuration} which reads
* configuration from the classpath.
*
* @param file the file to create
* @param writeSupport the implementation to write a record to a RecordConsumer
* @param compressionCodecName the compression codec to use
* @param blockSize the block size threshold
* @param pageSize the page size threshold
* @param dictionaryPageSize the page size threshold for the dictionary pages
* @param enableDictionary to turn dictionary encoding on
* @param validating to turn on validation using the schema
* @param writerVersion version of parquetWriter from {@link ParquetProperties.WriterVersion}
* @throws IOException
* @see #ParquetWriter(Path, WriteSupport, CompressionCodecName, int, int, int, boolean, boolean, WriterVersion, Configuration)
*/
@Deprecated
public ParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
int dictionaryPageSize,
boolean enableDictionary,
boolean validating,
WriterVersion writerVersion) throws IOException {
this(file, writeSupport, compressionCodecName, blockSize, pageSize, dictionaryPageSize, enableDictionary, validating, writerVersion, new Configuration());
}
如果用 Spark,是设置一个参数 parquet.enable.dictionary:
https://stackoverflow.com/questions/45488227/how-to-set-parquet-file-encoding-in-spark
在查看 Parquet 具体某一列的编码方式时,可以调试 ParquetReader 的 build() 函数,里面会读取 Parquet 文件的 Footer,里边有整个文件的 Metadata,包括每个 RowGroup 中每个 Column 的编码方式。