Source code interpretation of docValue in lucene (4) - writing of BinaryDocValue

BinaryDocValue is the stored byte[], that is, it can store some strings, pictures, etc. that can be represented by byte[]. We don't care about his usage scenarios, we mainly look at how he is stored in lucene. His addition is still in the DefaultIndexingChain.indexDocValue method. It is still stored in memory first. Let's introduce how to save it in memory. The class used is: BinaryDocValuesWriter, and the constructor is as follows:

public BinaryDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
	this.fieldInfo = fieldInfo;//The field to be added
	this.bytes = new PagedBytes(BLOCK_BITS);//This is a container to save all byte[], its advantage is that it is compressed, which can save memory. It can be directly regarded as a large byte[]
	this.bytesOut = bytes.getDataOutput();//Get the added entry of byte[] mentioned above
	this.lengths = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);//The thing used to record the byte[] length of each doc, because at the end the byte[] of each doc is put into a large byte[] , so record the length of byte[] of each doc
	this.iwBytesUsed = iwBytesUsed;//Used to record the memory used, can be ignored
	this.docsWithField = new FixedBitSet(64);//Record the object containing the id worth doc
	this.bytesUsed = docsWithFieldBytesUsed();
	iwBytesUsed.addAndGet(bytesUsed);
}

 Let's see how to add docValue:

public void addValue(int docID, BytesRef value) {
	. . . //The method of verification is removed
	// Fill in any holes:
	while (addedValues ​​< docID) {//This is to add a hole, because some doc does not have byte[], at this time, it needs to be filled with 0 in the object of record length. (Actually, I don't think it's necessary. You only need to slightly change the code of the iterator below to avoid the operation here. See the bold and italic instructions below)
		addedValues++;
		lengths.add(0);
	}
	addedValues++;
	lengths.add(value.length);//Record the size of the byte[] to be added
	try {
		bytesOut.writeBytes(value.bytes, value.offset, value.length);//Write byte[] to memory
	} catch (IOException ioe) {
		// Should never happen!
		throw new RuntimeException(ioe);
	}
	docsWithField = FixedBitSet.ensureCapacity(docsWithField, docID);
	docsWithField.set(docID);//Record this id, indicating that it has a value in this field
	updateBytesUsed();
}

 Through the above method, the byte[] of each doc has been written into the memory. Let's take a look at the operation when flushing

public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();//The number of all docs in the current segment
	bytes.freeze(false);
	final PackedLongValues lengths = this.lengths.build();//
	dvConsumer.addBinaryField(fieldInfo, new Iterable<BytesRef>() {//A parameter added is an iterator generator, which rewrites the iterator method. The first parameter of the returned iterator is the number of all doc, The second is an object that records the byte[] of each doc
		public Iterator <BytesRef> iterator () {
			return new BytesIterator(maxDoc, lengths);
		}
	});
}

 Take a look at the iterator, the next method of BytesIterator, which returns all the byte[] to be stored in the index

public BytesRef next () {
	if (!hasNext()) {
		throw new NoSuchElementException();
	}
	final BytesRef v;
	if (upto < size) {
		int length = (int) lengthsIterator.next();//Get the length of this byte[]de
		value.grow(length);
		value.setLength(length);
		try {
			bytesIterator.readBytes(value.bytes(), 0, value.length());//Read byte[] of specified length from it and read it into value
		} catch (IOException ioe) {
			// Should never happen!
			throw new RuntimeException(ioe);
		}
		if (docsWithField.get(upto)) {//If this id exists, return value. (If we put this check in front of lengthsIterator.next, if it doesn't exist, we can return null directly, so there is no need to fill in the holes)
			v = value.get();
		} else {// does not exist return null
			v = null;
		}
	} else {
		v = null;
	}
	upto++;
	return v;
}

 From this method, it can be concluded that it will return the value of all doc, if the doc has no value, it will return null, the reading of heap byte[] is to record the object of each byte[] and record all bytes [] is implemented by the object union. Let's take a look at the final flush method, which is how DocValueConsumer uses iterators. The Lucene410DocValuesConsumer used in 4.10.4 is the same as the docValue of the numeric type, except that the method called is the addBinaryField method.

@Override
public void addBinaryField(FieldInfo field, Iterable<BytesRef> values) throws IOException {
	meta.writeVInt(field.number);//Write the field number, where meta and numericDocValue are the same, both are index files of data,
	meta.writeByte(Lucene410DocValuesFormat.BINARY);
	int minLength = Integer.MAX_VALUE;
	int maxLength = Integer.MIN_VALUE;
	final long startFP = data.getFilePointer();
	long count = 0;
	boolean missing = false;
	for (BytesRef v : values) {//loop all byte[], if not null, write to data,
		final int length;
		if (v == null) {
			length = 0;
			missing = true;
		} else {
			length = v.length;
		}
		minLength = Math.min(minLength, length);
		maxLength = Math.max(maxLength, length);
		if (v != null) {
			data.writeBytes(v.bytes, v.offset, v.length);//Write all byte[] into data, so that data is equivalent to a large byte[], and multiple know byte[] recorded in it.
		}
		count++;
	}
	
	//There are two formats for writing, one is uncompressed, when all byte[] are the same length, otherwise compressed.
	meta.writeVInt(minLength == maxLength ? BINARY_FIXED_UNCOMPRESSED : BINARY_VARIABLE_UNCOMPRESSED);
	if (missing) {//If there is no worth of doc, record all ids with values ​​in data, which is the same as numericDocValue.
		meta.writeLong(data.getFilePointer());//Record the fp written by the docset in meta, that is, the index.
		writeMissingBitset(values);
	} else {//Otherwise write -1
		meta.writeLong(-1L);
	}
	meta.writeVInt(minLength);
	meta.writeVInt(maxLength);
	meta.writeVLong(count);//The number of doc, the comment in this source code is wrong, his comment is the number of worth writing, no, it is actually the number of doc, because some doc have no value of.
	meta.writeLong(startFP);//Record the fp of data that is not written to any byte[],

	// if minLength == maxLength, its a fixed-length byte[], we are done (the addresses are implicit) otherwise, we need to record the length fields... This sentence in English means if all byte[] are all the same length, then it will be fine, but this is almost impossible in practice,
	if (minLength != maxLength) {//If the length is inconsistent, write the worth length of each doc, so that it is easy to find the starting position of each doc from the big byte[].
		meta.writeLong(data.getFilePointer());//Record the index of the data at this time, and record the start position of each doc below
		meta.writeVInt(PackedInts.VERSION_CURRENT);
		meta.writeVInt(BLOCK_SIZE);

		final MonotonicBlockPackedWriter writer = new MonotonicBlockPackedWriter(data, BLOCK_SIZE);
		long addr = 0;
		writer.add(addr);
		for (BytesRef v : values) {
			if (v != null) {//Write the start position of each doc, so if this doc has a value, the start position of the next doc is different from the start position of this doc, and the middle value is the value of this doc
				addr += v.length;
			}
			writer.add(addr);
		}
		writer.finish();
	}
}

 In this way, each byte[] is written into the index.

To sum up, BinaryDocValue actually writes all byte[] to the hard disk, and then writes the number that records the byte[] length of each doc to the hard disk, and writes the fp (that is, the starting position of each part) to the hard disk. , understood as an index) into the meta file.

In fact, binaryDocValue is simpler than NumericDocValue, because he does not have many forms.

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326097174&siteId=291194637