Parquet 列式存储格式

参考文章：
https://blog.csdn.net/kangkangwanwan/article/details/78656940
http://parquet.apache.org/documentation/latest/

列式存储的优势

把IO只给查询需要用到的数据，只加载需要被计算的列
列式的压缩效果更好，节省空间

parquet只是一种存储格式，与上层语言无关

适配通用性
存储空间优化
计算时间优化
hive中metastore和数据是分开的，alter table只修改metastore,不会立即把数据进行转换，因此查询时可能出现转换失败的错误，parquet的schema也不会改变

词汇表

block:hdfs block
file:hdfs 文件，必须包含file metadata,不是一定包含实际数据
row group:多个column chunk组成，每个chunk标识row中的一列
column chunk：大量的数据集，针对某一列
page:chunk被分成多页，每一个页是一个压缩编码的单位，一个chunk可能多有个页

File Format

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

Metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

数据模型

message AddressBook {
 required string owner;  //required（出现1次）
 repeated string ownerPhoneNumbers; //repeated（出现0次或者多次）
 repeated group contacts {  //Parquet格式没有复杂的Map, List,Set等而是使用repeated fields和groups来表示
   required string name;
   optional string phoneNumber; //optional（出现0次或者1次）
 }
}

type可以是一个group或者一个primitive类型

一个schema有几个叶子节点就有多少列
[ AddressBook ]
/ | / | required repeated repeated
owner ownerPhoneNumbers contacts
/
required optional
name phoneNumber

此处有4列
owner:string
ownerPhoneNumbers:string
contacts.name:string
contacts.phoneNumber:string

分段/组装算法

Definition Level

从根节点开始遍历，当某一个field的路径上的节点为空的时候，我们记录下当前深度作为这个field的Definition Level
最大Definition level:根据schema推算出来的，field路径都不为空时的Definition level
如果current definition level == max definition level ，则表示这个field是有数据的

Repetition level

记录该field的值是在哪一个深度上重复的。只有repeated类型的field需要Repetition Level，optional 和 required类型的不需要。Repetition Level = 0 表示开始一个新的record

如何存储

Parquet文件
所有数据水平切分成Row group:包含对应区间内的所有列的column chunk
column chunk:负责存储某一列的数据，由1个或多个Page组成
Page:压缩和编码的单元，对数据模型是透明的

Parquet文件尾部：Footer:文件的元数据信息和统计信息

疑问

多个列如何对应到同一行的
hbase存在列数据分布不均问题，即1列数据非常多，另一列数据非常少，导致有很多无用的block，推测parquet也存在这个问题，如果存在，那么会有很多空数据列来标识不存在的数据

org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

MapredParquetOutputFormat将对象写入parquet文件，通过它，我们来追踪这些疑问，方法getHiveRecordWriter返回一个写入记录的对象用于最终将对象的多个fields写到文件中

  /**
   *
   * Create the parquet schema from the hive schema, and return the RecordWriterWrapper which
   * contains the real output format
   */
  @Override
  public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter getHiveRecordWriter(
      final JobConf jobConf,
      final Path finalOutPath,
      final Class<? extends Writable> valueClass,
      final boolean isCompressed,
      final Properties tableProperties,
      final Progressable progress) throws IOException {

    LOG.info("creating new record writer..." + this);

    final String columnNameProperty = tableProperties.getProperty(IOConstants.COLUMNS);
    final String columnTypeProperty = tableProperties.getProperty(IOConstants.COLUMNS_TYPES);
    List<String> columnNames;
    List<TypeInfo> columnTypes;

    if (columnNameProperty.length() == 0) {
      columnNames = new ArrayList<String>();
    } else {
      columnNames = Arrays.asList(columnNameProperty.split(","));
    }

    if (columnTypeProperty.length() == 0) {
      columnTypes = new ArrayList<TypeInfo>();
    } else {
      columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
    }
    //根据列名和类型，生成schema
    DataWritableWriteSupport.setSchema(HiveSchemaConverter.convert(columnNames, columnTypes), jobConf);
    //类的构造方法生成 realOutputFormat = new ParquetOutputFormat<ParquetHiveRecord>(new DataWritableWriteSupport());
    return getParquerRecordWriterWrapper(realOutputFormat, jobConf, finalOutPath.toString(),
            progress,tableProperties);
  }
  
  protected ParquetRecordWriterWrapper getParquerRecordWriterWrapper(
      ParquetOutputFormat<ParquetHiveRecord> realOutputFormat,
      JobConf jobConf,
      String finalOutPath,
      Progressable progress,
      Properties tableProperties
      ) throws IOException {
    return new ParquetRecordWriterWrapper(realOutputFormat, jobConf, finalOutPath.toString(),
            progress,tableProperties);
  }

ParquetRecordWriterWrapper是RecordWriter包装类

  public ParquetRecordWriterWrapper(
      final OutputFormat<Void, ParquetHiveRecord> realOutputFormat,
      final JobConf jobConf,
      final String name,  //finalOutPath
      final Progressable progress, Properties tableProperties) throws
          IOException {
    try {
      // create a TaskInputOutputContext
      TaskAttemptID taskAttemptID = TaskAttemptID.forName(jobConf.get("mapred.task.id"));
      if (taskAttemptID == null) {
        taskAttemptID = new TaskAttemptID();
      }
      taskContext = ContextUtil.newTaskAttemptContext(jobConf, taskAttemptID);
      //从tableProperties初始化blockSize,ENABLE_DICTIONARY,COMPRESSION，如果tableProperties没有对应属性，则什么都不做   
      LOG.info("initialize serde with table properties.");
      initializeSerProperties(taskContext, tableProperties);

      LOG.info("creating real writer to write at " + name);
      //真正的writer
      realWriter =
              ((ParquetOutputFormat) realOutputFormat).getRecordWriter(taskContext, new Path(name));

      LOG.info("real writer: " + realWriter);
    } catch (final InterruptedException e) {
      throw new IOException(e);
    }
  }

疑问

blockSize是什么

进入ParquetRecordWriterWrapper.getRecordWriter来看realWriter生成方式

public RecordWriter<Void, T> getRecordWriter(Configuration conf, Path file, CompressionCodecName codec) throws IOException, InterruptedException {
    WriteSupport<T> writeSupport = this.getWriteSupport(conf);  //writeSupport是一个关键的东西 ,读取配置中的parquet.write.support.class，反射生成实例
    CodecFactory codecFactory = new CodecFactory(conf);
    long blockSize = getLongBlockSize(conf);
    int pageSize = getPageSize(conf);
    int dictionaryPageSize = getDictionaryPageSize(conf);       
    boolean enableDictionary = getEnableDictionary(conf);       
    boolean validating = getValidation(conf);         
    WriterVersion writerVersion = getWriterVersion(conf);
    //通过conf.schema生成写入上下文 
    WriteContext init = writeSupport.init(conf);
    //使用finalOutPath
    ParquetFileWriter w = new ParquetFileWriter(conf, init.getSchema(), file);
    w.start();
    float maxLoad = conf.getFloat("parquet.memory.pool.ratio", 0.95F);
    long minAllocation = conf.getLong("parquet.memory.min.chunk.size", 1048576L);
    if (memoryManager == null) {
        memoryManager = new MemoryManager(maxLoad, minAllocation);
    } else if (memoryManager.getMemoryPoolRatio() != maxLoad) {
        LOG.warn("The configuration parquet.memory.pool.ratio has been set. It should not be reset by the new value: " + maxLoad);
    }
    
    //realWriter
    return new ParquetRecordWriter(w, writeSupport, init.getSchema(), init.getExtraMetaData(), blockSize, pageSize, codecFactory.getCompressor(codec, pageSize), dictionaryPageSize, enableDictionary, validating, writerVersion, memoryManager);
}

ParquetFileWriter，执行FS中的file

public ParquetFileWriter(Configuration configuration, MessageType schema, Path file, ParquetFileWriter.Mode mode) throws IOException {
    this.blocks = new ArrayList();
    this.state = ParquetFileWriter.STATE.NOT_STARTED;  //writer初始状态
    this.schema = schema;
    FileSystem fs = file.getFileSystem(configuration);
    boolean overwriteFlag = mode == ParquetFileWriter.Mode.OVERWRITE;
    this.out = fs.create(file, overwriteFlag);  //生成FSDataOutputStream,指定文件为file
}

真，最终 ParquetRecordWriter

public ParquetRecordWriter(ParquetFileWriter w, WriteSupport<T> writeSupport, MessageType schema, Map<String, String> extraMetaData, long blockSize, int pageSize, BytesCompressor compressor, int dictionaryPageSize, boolean enableDictionary, boolean validating, WriterVersion writerVersion, MemoryManager memoryManager) {
    this.internalWriter = new InternalParquetRecordWriter(w, writeSupport, schema, extraMetaData, blockSize, pageSize, compressor, dictionaryPageSize, enableDictionary, validating, writerVersion);
    this.memoryManager = (MemoryManager)Preconditions.checkNotNull(memoryManager, "memoryManager");
    memoryManager.addWriter(this.internalWriter, blockSize);
}

ParquetRecordWriterWrapper写入一条记录，对应row级别

  @Override
  public void write(final Void key, final ParquetHiveRecord value) throws IOException {
    try {
      realWriter.write(key, value);
    } catch (final InterruptedException e) {
      throw new IOException(e);
    }
  }

调用到ParquetRecordWriter.write方法，最终指向InternalParquetRecordWriter.write

public void write(Void key, T value) throws IOException, InterruptedException {
    this.internalWriter.write(value);
}     

public void write(T value) throws IOException, InterruptedException {
    this.writeSupport.write(value);
    ++this.recordCount;
    this.checkBlockSizeReached();
}

既然writeSuppot是反射生成的，我们使用 GroupWriteSupport 来继续走完代码（spark自己写了一个WriteSupport:ParquetWriteSupport）

public void prepareForWrite(RecordConsumer recordConsumer) {
    this.groupWriter = new GroupWriter(recordConsumer, this.schema);
}

public void write(Group record) {
    this.groupWriter.write(record);
}

public class GroupWriter {
    private final RecordConsumer recordConsumer;
    private final GroupType schema;

    public GroupWriter(RecordConsumer recordConsumer, GroupType schema) {
        this.recordConsumer = recordConsumer;  //MessageColumnIORecordConsumer类
        this.schema = schema;
    }

    public void write(Group group) {
        //标记record开始,预准备一些标识
        this.recordConsumer.startMessage();
        //写入row
        this.writeGroup(group, this.schema);
        //标记record结束，处理写完数据后的标识操作
        this.recordConsumer.endMessage();
    }

    private void writeGroup(Group group, GroupType type) {
        int fieldCount = type.getFieldCount();
        //遍历所有field，对应column
        for(int field = 0; field < fieldCount; ++field) {
            int valueCount = group.getFieldRepetitionCount(field);
            //有效数量，如果为0，则表示null，不需要此时写入
            if (valueCount > 0) {
                Type fieldType = type.getType(field);
                String fieldName = fieldType.getName();
                //标记field开始
                this.recordConsumer.startField(fieldName, field);

                for(int index = 0; index < valueCount; ++index) {
                    if (fieldType.isPrimitive()) {
                        //写入原始类型的值
                        group.writeValue(field, index, this.recordConsumer);
                    } else {
                        //复杂结构，递归调用
                        this.recordConsumer.startGroup();
                        this.writeGroup(group.getGroup(field, index), fieldType.asGroupType());
                        this.recordConsumer.endGroup();
                    }
                }
                //标记field结束
                this.recordConsumer.endField(fieldName, field);
            }
        }

    }
}