Source code interpretation of docValue in lucene (5) - reading of BinaryDocValue

The principle of reading BinaryDocValue is similar to that of NumericDocValue. When opening a new segment, first read the meta file, that is, the index file, read it into the memory, save it, and then use a knife Read the data file when the doc's docValue is used. Take a look at the code for reading the meta file, in Lucene410DocValuesProducer.readBinaryEntry(IndexInput):

static BinaryEntry readBinaryEntry(IndexInput meta) throws IOException {
	BinaryEntry entry = new BinaryEntry();
	entry.format = meta.readVInt();//Stored format
	entry.missingOffset = meta.readLong();//Record the fp of those docSets that contain values
	entry.minLength = meta.readVInt();//byte[] minimum length
	entry.maxLength = meta.readVInt();//byte[] maximum length
	entry.count = meta.readVLong();//The number of all doc
	entry.offset = meta.readLong();//fp of the part that actually stores docValue
	switch (entry.format) {
	case BINARY_FIXED_UNCOMPRESSED://Uncompressed, indicating that the length is the same, and the starting position is found directly according to id * minLength.
		break;
	case BINARY_PREFIX_COMPRESSED://Not used!
		entry.addressesOffset = meta.readLong();
		entry.packedIntsVersion = meta.readVInt();
		entry.blockSize = meta.readVInt();
		entry.reverseIndexOffset = meta.readLong();
		break;
	case BINARY_VARIABLE_UNCOMPRESSED:
		entry.addressesOffset = meta.readLong();//This is the fp of the block that records the length of the byte[] of each doc.
		entry.packedIntsVersion = meta.readVInt();
		entry.blockSize = meta.readVInt();//This is the size of a block read, because when storing those numbers, they are stored in blocks.
		break;
	default:
		throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta);
	}
	return entry;
}

 In this way, the meta file is read and stored in memory. Let's see how to read the docValue of each doc: the code is in Lucene410DocValuesProducer.getBinary(FieldInfo),

public BinaryDocValues getBinary(FieldInfo field) throws IOException {
	BinaryEntry bytes = binaries.get(field.number);
	switch (bytes.format) {
	
	case BINARY_FIXED_UNCOMPRESSED:
		return getFixedBinary(field, bytes);//All byte[] have the same length format
	case BINARY_VARIABLE_UNCOMPRESSED:
		return getVariableBinary(field, bytes);//byte[] formats with different lengths
	case BINARY_PREFIX_COMPRESSED://no knife
		return getCompressedBinary(field, bytes);
	default:
		throw new AssertionError();
	}
}

 Take a look one by one, first look at the case where the byte[] length is the same:

private BinaryDocValues getFixedBinary(FieldInfo field, final BinaryEntry bytes) throws IOException {
	// Because the length of each byte[] is already known, it is directly multiplied by the number. At this time, there will be no missing, that is, all docs contain values, so count is the number of values.
	final IndexInput data = this.data.slice("fixed-binary", bytes.offset, bytes.count * bytes.maxLength);//Get the data at the specified position from the index
	final BytesRef term = new BytesRef (bytes.maxLength);
	final byte[] buffer = term.bytes;//Read into this
	final int length = term.length = bytes.maxLength;
	return new LongBinaryDocValues() {
		@Override
		public BytesRef get(long id) {
			try {
				data.seek(id * length);//Find the starting position
				data.readBytes(buffer, 0, buffer.length);//Read a byte[]
				return term;
			} catch (IOException e) {
				throw new RuntimeException(e);
			}
		}
	};
}

It can be found that if the length of byte[] is the same, it is easy to understand. Just use id* the length of each byte[], and the length of byte[] is known, because the maximum and minimum values ​​are saved. Take a look at the different lengths: 

 

private BinaryDocValues getVariableBinary(FieldInfo field, final BinaryEntry bytes) throws IOException {
		
	//Read position, because when the byte[] is different, the length of the byte[] of each doc will be saved, here is to read the length of each doc (in fact, it is read in blocks, not Each doc is read out, but we don't study that, here we assume that all are read out, or the provided function can get the start position of each doc's byte[])
	final MonotonicBlockPackedReader addresses = getAddressInstance(field, bytes);
	//Read all values ​​(that is, byte[] of all doc)
	final IndexInput data = this.data.slice("var-binary", bytes.offset, bytes.addressesOffset - bytes.offset);
	final BytesRef term = new BytesRef(Math.max(0, bytes.maxLength));
	final byte buffer[] = term.bytes;

	return new LongBinaryDocValues() {
		@Override
		public BytesRef get(long id) {
			long startAddress = addresses.get(id);//The starting position of the current doc
			long endAddress = addresses.get(id + 1);//The end position of the current doc
			int length = (int) (endAddress - startAddress);//Length, if a doc has no value, then the start position of his next doc is the same, then this is 0
			try {
				data.seek(startAddress);
				data.readBytes(buffer, 0, length);
				term.length = length;
				return term;//Return the read byte[].
			} catch (IOException e) {
				throw new RuntimeException(e);
			}
		}
	};
}

In this way, the read with inconsistent length is read. When reading, the starting position will be obtained according to the id, and then read according to the offset from the large byte[]. Its efficiency is not as high as that of byte[] with the same length, because it needs to be read twice, while the same length only needs to be read once. It should also be noted that any doc will read a BytesRef, so it depends on whether the record has a bitset of value to determine whether the doc has a binaryDocValue. 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326122098&siteId=291194637