Interpretation of docValue implementation source code in lucene (2) - writing of NumericDocValue

The writing of various types of docValue is when adding an index, in the org.apache.lucene.index.DefaultIndexingChain.indexDocValue(PerField, DocValuesType, IndexableField) method, there will be five types of docValue, which will call different DocValueWriter to implement, this article introduces the docValue of the numeric type, which is a very simple and efficient storage of docValue.

case NUMERIC:// if it is a numeric type of docValue
if (fp.docValuesWriter == null) {
	fp.docValuesWriter = new NumericDocValuesWriter(fp.fieldInfo, bytesUsed, true);
}
((NumericDocValuesWriter) fp.docValuesWriter).addValue(docID, field.numericValue().longValue());//Add the value of the specified docId, the value is of type long
break;

The docValueWriter used is NumericDocValueWriter, let's take a look at the source code of this method.

Construction method
public NumericDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed, boolean trackDocsWithField) {
	pending = new AppendingDeltaPackedLongBuffer(PackedInts.COMPACT);//This is used to store specific docVALues, that is, some numbers, as can be seen from his name, he is stored according to the difference, and then added After these numbers, compression is activated for more efficient memory usage.
	docsWithField = trackDocsWithField ? new FixedBitSet(64) : null;//This is to record those docids that contain docValue in this field. The type used is the bitset type.
	bytesUsed = pending.ramBytesUsed() + docsWithFieldBytesUsed();//The following lines record the memory used, because lucnee needs to flush according to the memory used, so it needs to record the memory.
	this.fieldInfo = fieldInfo;
	this.iwBytesUsed = iwBytesUsed;
	iwBytesUsed.addAndGet(bytesUsed);
}
/** Add a doc value, here is the real logic to add docValue */
public void addValue(int docID, long value) {
	if (docID < pending.size()) {
		throw new IllegalArgumentException("DocValuesField \"" + fieldInfo.name
				+ "\" appears more than once in this document (only one value is allowed per field)");
	}
	// Fill in any holes: Fill in the holes, because all the values ​​need to be iterated below, this is added to make the iterator work properly
	for (int i = (int) pending.size(); i < docID; ++i) {
		pending.add(MISSING);
	}
	
	pending.add(value);//Save this value to pending
	if (docsWithField != null) {//docWithFeidl is a record with a docid worth, not null here
		docsWithField = FixedBitSet.ensureCapacity(docsWithField, docID);//Enable bitSet to store docID.
		docsWithField.set(docID);//Record this doc with value.
	}
	updateBytesUsed();//Update used memory
}

 The above method is added to the memory. It has two important places. One is to save the specific value. The saved format is of type long. Although we may use int and float when using lucene, they will be unified. Converted to long type. The second is to record the docid containing the value, which is recorded in a bitset. After we have introduced the memory in memory, let's take a look at the method when flushing to the hard disk - flush:

public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();//The number of all docs in this segment
	dvConsumer.addNumericField(fieldInfo, new Iterable<Number>() {//Use the given docValueConsumer to process, the processed parameter is an iterator
		@Override
		public Iterator<Number> iterator() {
			return new NumericIterator(maxDoc);
		}
	});
}

The iterator used is an inner class, look at the code:

private class NumericIterator implements Iterator<Number> {
	final AppendingDeltaPackedLongBuffer.Iterator iter = pending.iterator();//Get the iterator of the object that stores all docValues. The purpose of using this object is very simple, which is to compress the storage of numbers and improve the usage of memory.
	final int size = (int) pending.size();//The number of all doc
	final int maxDoc;//Maximum id
	int upto;//The id of the currently processed doc
	NumericIterator(int maxDoc) {
		this.maxDoc = maxDoc;
	}
	@Override
	public boolean hasNext() {//Determine if there are any numbers to store
		return upto < maxDoc;
	}
	@Override
	public Number next() {//This method is called in DocValueConsumer to get the next value to be stored
		if (!hasNext()) {
			throw new NoSuchElementException();
		}
		Long value;
		if (upto < size) {
			long v = iter.next();//Get the next doc from pending, this value will not be null, because even if there is no value, a special value will be filled, that is, the place where the hole is filled above
			if (docsWithField == null || docsWithField.get(upto)) {//If the doc has a value, the real value is returned
				value = v;
			} else {
				value = null;//If there is no value (although 0 is filled in, but he is not in docsWithField, that is to say, filling the hole is only for the iterative method here), the return is null.
			}
		} else {
			value = docsWithField != null ? null : MISSING;
		}
		upto++;
		return value;
	}

	@Override
	public void remove() {
		throw new UnsupportedOperationException();
	}
}

  The above two pieces of code are not difficult, that is, it is stored in memory and ready to be written to the hard disk. How to write it depends on the docValueConsumer.addNumericField method.

I am looking at this implementation class

		CodecUtil.writeHeader(data, dataCodec, Lucene49DocValuesFormat.VERSION_CURRENT);//Write header file to dvd
		String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix,metaExtension);//dvm文件,
		
public Lucene410DocValuesConsumer(SegmentWriteState state, String dataCodec, String dataExtension, String metaCodec,
		String metaExtension) throws IOException {
	boolean success = false;
	try {
		String dataName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix,dataExtension);//The name of the dvd, such as _3_lucene410_0.dvd
		data = state.directory.createOutput(dataName, state.context);//This is the file that actually stores docValue
		CodecUtil.writeHeader(data, dataCodec, Lucene410DocValuesFormat.VERSION_CURRENT);//Write the version number of lucene used in the data file
		String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix,metaExtension);//The name of the index file of the data file, remember, the data file is also indexed, in order to find a domain faster, because docValue is stored by column.
		meta = state.directory.createOutput(metaName, state.context);
		CodecUtil.writeHeader(meta, metaCodec, Lucene410DocValuesFormat.VERSION_CURRENT);
		maxDoc = state.segmentInfo.getDocCount();
		success = true;
	} finally {
		if (!success) {
			IOUtils.closeWhileHandlingException(this);
		}
	}
}

 In lucene 4.10, docValue has two files, one is the specific file dvd that stores docValue, and in a segment, the docValue of all fields is stored in this file, it has an index file, and also It is the dvm, which is the index file of the dvd file. The things he indexes are very simple, including the offset of the starting position of a certain domain in the dvd file (the following is uniformly represented by fp, fp is the file pointer), and the rest depends on specific storage format. Let's take a look at the specific code:

/** Add all docValues ​​of a field, a field is represented by field, and all values ​​are an iterator. is the value of all docValue */
@Override
public void addNumericField(FieldInfo field, Iterable<Number> values) throws IOException {
	addNumericField(field, values, true);
}

void addNumericField(FieldInfo field, Iterable<Number> values, boolean optimizeStorage) throws IOException {
	
	long count = 0;
	long minValue = Long.MAX_VALUE;
	long maxValue = Long.MIN_VALUE;
	long gcd = 0;//The greatest common divisor, if it is 1, it means that the greatest common divisor is not stored.
	boolean missing = false;//Is there a doc without docValue,
	// TODO: more efficient?
	HashSet<Long> uniqueValues ​​= null;//Do not use if it exceeds 256, indicating that there are too many numbers that are not repeated.
	if (optimizeStorage) {
		uniqueValues = new HashSet<>();
		for (Number nv : values) {
			final long v;//The value of the loop.
			if (nv == null) {
				v = 0;
				missing = true;//Some doc has no value
			} else {
				v = nv.longValue();
			}
			if (gcd != 1) {
				if (v < Long.MIN_VALUE / 2 || v > Long.MAX_VALUE / 2) {//The greatest common divisor is meaningless in this case, because the number is too large,
					// in that case v - minValue might overflow and make the GCD computation return
					// wrong results. Since these extreme values are unlikely, we just discard GCD computation for them
					gcd = 1;
				} else if (count != 0) { // minValue needs to be set first
					gcd = MathUtil.gcd(gcd, v - minValue);
				}
			}
			minValue = Math.min(minValue, v);
			maxValue = Math.max(maxValue, v);
			if (uniqueValues != null) {
				if (uniqueValues.add(v)) {
					if (uniqueValues.size() > 256) {//If there are more than 256, a certain storage format is not applicable
						uniqueValues = null;
					}
				}
			}
			++count;
		}
	} else {//This is not used
		for (Number nv : values) {
			long v = nv.longValue();
			minValue = Math.min(minValue, v);
			maxValue = Math.max(maxValue, v);
			++count;
		}
	}

	final long delta = maxValue - minValue;//Difference, which is the largest non-repeating number
	//The number of bits to be used to record the largest difference. If the difference rule is used to record, the number of bits used to record a number must be less than this value, that is to say, this value is when the difference is recorded. The maximum value of the number of bits used to record a certain value
	final int deltaBitsRequired = DirectWriter.unsignedBitsRequired(delta);
        //This is the maximum value of bits required to record a docValue when stored in table_compressed format.
	final int tableBitsRequired = uniqueValues == null ? Integer.MAX_VALUE : DirectWriter.bitsRequired(uniqueValues.size() - 1);

	final int format;
	if (uniqueValues ​​!= null && tableBitsRequired < deltaBitsRequired) {//When the number of unique values ​​is not very large
		format = TABLE_COMPRESSED;//Use table compression record method
	} else if (gcd != 0 && gcd != 1) {//If after dividing by the greatest common divisor, the number of bits used for each stored value is less than the number of bits of deltaBitsRequired, then use the method recorded by the greatest common divisor . As long as this can be entered, the compression method of the greatest common divisor will be used, because gcd must be greater than 1.
		final long gcdDelta = (maxValue - minValue) / gcd;
		final long gcdBitsRequired = DirectWriter.unsignedBitsRequired(gcdDelta);
		format = gcdBitsRequired < deltaBitsRequired ? GCD_COMPRESSED : DELTA_COMPRESSED;//It will be smaller because it is divided by gcd
	} else {
		format = DELTA_COMPRESSED;//Otherwise use the difference rule to record
	}
	//The following meta is the index file dvm represented, and the data file is the dvd
	meta.writeVInt(field.number);//Write the field number in the index file dvm
	meta.writeByte(Lucene49DocValuesFormat.NUMERIC);//The name of the storage format
	meta.writeVInt(format);//Specific storage format
	if (missing) {//If some doc is not worthwhile, it will be recorded in da
		meta.writeLong(data.getFilePointer());//Record the fp of data in meta, that is, the offset in the file, which can find the file faster.
		writeMissingBitset(values);//Record which doc has no docValue. Records are recorded in the data file that contain the value docid.
	} else {
		meta.writeLong(-1L);
	}
	meta.writeLong(data.getFilePointer());//Record the fp of the current data again in the meta, because the missingBitset may be recorded in the data, so that you can quickly find the starting position when the number is actually stored
	meta.writeVLong(count);
	
	switch (format) {//The specific format used has been recorded in meta, so that it will be known when reading.
	case GCD_COMPRESSED://based on the greatest common divisor
		meta.writeLong(minValue);//Record the minimum value
		meta.writeLong(gcd);//Record the greatest common divisor
		final long maxDelta = (maxValue - minValue) / gcd;
		final int bits = DirectWriter.unsignedBitsRequired(maxDelta);//Record the number of bits required for a value
		meta.writeVInt(bits);//This is for decoding, because lucene will also be compressed during actual storage, but it can be ignored and does not affect the understanding here
		final DirectWriter quotientWriter = DirectWriter.getInstance(data, count, bits);
		for (Number nv : values) {
			long value = nv == null ? 0 : nv.longValue();// For those without worth doc, write the default value 0 , although some doc have no docValue, but those ids without docValue have been recorded in data So it doesn't matter if this 0 is written, it can still be recognized. The purpose of writing is simply to read doc-worthy values ​​more quickly.
			quotientWriter.add((value - minValue) / gcd);
		}
		quotientWriter.finish();
		break;
	case DELTA_COMPRESSED://based on the difference, this can be seen as the GCD_COMPRESSED format with the greatest common divisor of 1
		final long minDelta = delta < 0 ? 0 : minValue;
		meta.writeLong(minDelta);
		meta.writeVInt(deltaBitsRequired);
		final DirectWriter writer = DirectWriter.getInstance(data, count, deltaBitsRequired);
		for (Number nv : values) {
			long v = nv == null ? 0 : nv.longValue();//Although some docs have no docValue, it doesn't matter if you write 0, because it can be recognized that those ids without docValue have been recorded in data.
			writer.add(v - minDelta);
		}
		writer.finish();
		break;
	case TABLE_COMPRESSED://This case will increase the size of the meta file, so it is only used when the number is relatively small
		final Long[] decode = uniqueValues.toArray(new Long[uniqueValues.size()]);
		Arrays.sort(decode);//Sort all numbers from small to large
		final HashMap<Long, Integer> encode = new HashMap<>();
		meta.writeVInt(decode.length);
		for (int i = 0; i < decode.length; i++) {
			meta.writeLong(decode[i]);//Write the specific value to the meta file, all written are long
			encode.put(decode[i], i);//Record the correspondence between a value and its serial number. For example, the order of 100 is the 10th, and the order of 101 is the 11th, which is recorded in a hashmap.
		}
		meta.writeVInt(tableBitsRequired);//This is used for decoding.
		final DirectWriter ordsWriter = DirectWriter.getInstance(data, count, tableBitsRequired);
		for (Number nv : values) {
			ordsWriter.add(encode.get(nv == null ? 0 : nv.longValue()));//The serial number is written in the data file, that is, the sorted serial number of the value in the meta, the same here For those not worth doc, 0 is still written.
		}
		ordsWriter.finish();
		break;
	default:
		throw new AssertionError();
	}
	meta.writeLong(data.getFilePointer());//Write the end position, because when reading the docValue of the numeric type, a slice will be read into the memory, so you need to know the start position and end position.
}

Through the above code, we can know that the docValue of the numeric type has three formats, one is based on the greatest common divisor, the other is based on the difference (can be regarded as a special form of the greatest common divisor, the common divisor is 1), one is compression table.

For the greatest common divisor, the minimum value and the greatest common divisor are recorded in the meta file, and then recorded in the data file is the value of a docvalue minus the minimum value and divided by the value of the greatest common divisor, so that the recorded value is much smaller. For the difference value, it is the same as the greatest common divisor, except that the greatest common divisor is 1. For the compressed table, it is special, and its use conditions are limited, only when the number of docValue values ​​after deduplication is small. Use, he will sort those values, and then put them in the meta file (that is, the index file), and then put the serial number of each doc value in all the sorted values ​​in the data file, so that it will be Makes the index much smaller and faster to look up. One more thing to note is that if a doc has no value, 0, will be written.

In this way, the writing of the DocValue of the numeric type is completed. In the next article, we will see how to read the docValue of the numeric type.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326080992&siteId=291194637