Interpretation of docValue implementation source code in lucene (8) - writing of SortedNumericDocValue

SortedNumericDocValue, at first glance, there are a lot of questions that bother me, is this type of storage also sorted? Isn't it possible to get its ordering by docid? Is it single-valued like SortedBinaryDocValue? With these questions, after reading the code, I found that it is not at all. SortedNumericDocValue does not sort all docs, that is, it is impossible to obtain the sorting of a doc, and it is multi-valued, that is, a doc can contain multiple numbers, here sorted means that multiple numbers of each doc are stored in the storage The time is sorted, but there is no sorting between multiple docs.

As before, take a look at the addition in memory:

/** Buffers up pending long[] per doc, sorts, then flushes when segment flushes. */
class SortedNumericDocValuesWriter extends DocValuesWriter {
	/**All longs contained in each doc*/
	private PackedLongValues.Builder pending; // stream of all values
	/** The number of long numbers contained in each doc, if not, it is 0*/
	private PackedLongValues.Builder pendingCounts; // count of values per doc
	private final FieldInfo fieldInfo;
	/** The id of the doc just processed */
	private int currentDoc;
	/** Multiple long numbers used to save the current doc */
	private long currentValues[] = new long[8];
	/** The pointer of the last long of the current doc in the currentValues, which can also be said to be used to record the number of all numbers in the current doc*/
	private int currentUpto = 0;
        //Construction method
	public SortedNumericDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
		this.fieldInfo = fieldInfo;
		this.iwBytesUsed = iwBytesUsed;
		pending = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		bytesUsed = pending.ramBytesUsed() + pendingCounts.ramBytesUsed();
		iwBytesUsed.addAndGet(bytesUsed);
	}
        //Add a docvalue
	public void addValue(int docID, long value) {
		if (docID != currentDoc) {//If the docid is switched, it means that the next doc is to be processed, and the current doc must be ended, because a doc has multiple values
			finishCurrentDoc();//End the current doc
		}
		// Fill in any holes: Fill in the holes. For the doc with no value, fill in 0 in pendingCounts, indicating that the number of values ​​of this doc is 0
		while (currentDoc < docID) {
			pendingCounts.add(0); // no values
			currentDoc++;
		}
		addOneValue(value);//Add a value to the current doc
		updateBytesUsed();
	}
	// finalize currentDoc: this sorts the values in the current doc
	private void finishCurrentDoc() {
		Arrays.sort(currentValues, 0, currentUpto);//Sort the multiple values ​​of the current doc from small to large
		for (int i = 0; i < currentUpto; i++) {//Write to pending
			pending.add(currentValues[i]);
		}
		// record the number of values for this doc
		pendingCounts.add(currentUpto);//The number of long numbers contained in the current doc
		currentUpto = 0;
		currentDoc++;
	}

	//end all doc
	@Override
	public void finish(int maxDoc) {
		finishCurrentDoc();
		// fill in any holes
		for (int i = currentDoc; i < maxDoc; i++) {
			pendingCounts.add(0); // no values
		}
	}

	/**Add a long to the array */
	private void addOneValue(long value) {
		if (currentUpto == currentValues.length) {
			currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
		}

		currentValues[currentUpto] = value;
		currentUpto++;
	}
}

It can be clearly seen from the above method that SortedNumericDocValue supports multiple numbers in one doc. For each doc, two contents are recorded, one is the value of this doc (stored in pending), and it is stored after sorting; the second is the number of values ​​of this doc (stored in pendingCount).

 

Let's take a look at the operation when flushing:

@Override
public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();
	assert pendingCounts.size() == maxDoc;
	final PackedLongValues ​​values ​​= pending.build();//All added values
	final PackedLongValues ​​valueCounts = pendingCounts.build();//The number of longs contained in each doc

	dvConsumer.addSortedNumericField(fieldInfo,
			// doc -> valueCount,
			new Iterable<Number>() {
				@Override
				public Iterator<Number> iterator() {
					return new CountIterator(valueCounts);//The number of numbers contained in each doc
				}
			},
			// values
			new Iterable<Number>() {
				@Override
				public Iterator<Number> iterator() {//All numbers
					return new ValuesIterator(values);
				}
			});
}
private static class ValuesIterator implements Iterator<Number> {
	final PackedLongValues.Iterator iter;
	ValuesIterator(PackedLongValues values) {
		iter = values.iterator();
	}
	@Override
	public boolean hasNext() {
		return iter.hasNext();
	}
	@Override
	public Number next() {
		if (!hasNext()) {
			throw new NoSuchElementException();
		}
		return iter.next();
	}
	@Override
	public void remove() {
		throw new UnsupportedOperationException();
	}
}
private static class CountIterator implements Iterator<Number> {
	final PackedLongValues.Iterator iter;
	CountIterator(PackedLongValues valueCounts) {
		this.iter = valueCounts.iterator();
	}
	@Override
	public boolean hasNext() {
		return iter.hasNext();
	}
	@Override
	public Number next() {
		if (!hasNext()) {
			throw new NoSuchElementException();
		}
		return iter.next();
	}
	@Override
	public void remove() {
		throw new UnsupportedOperationException();
	}
}

 Flushing is also very simple, that is, two iterators are passed to the Consumer, one is used for the number of long values ​​of a doc, and the second is used to pass all the values. Let's take a look at the specific method written to the dierctory. Lucene410DocValuesConsumer.addSortedNumericField(FieldInfo, Iterable<Number>, Iterable<Number>):

public void addSortedNumericField(FieldInfo field, final Iterable<Number> docToValueCount, final Iterable<Number> values) throws IOException {
	meta.writeVInt(field.number);//域号
	meta.writeByte(Lucene410DocValuesFormat.SORTED_NUMERIC);
	if (isSingleValued(docToValueCount)) {//If all are single-valued, that is, each doc has only one value, directly use the previous number type. Because there is no sorting at this time, the sorting mentioned here is the sorting of multiple values ​​of a doc in the domain
		meta.writeVInt(SORTED_SINGLE_VALUED);
		// The field is single-valued, we can encode it as NUMERIC
		addNumericField(field, singletonView(docToValueCount, values, null));//This is the format when recording NumericDocValye, which is divided into three formats.
	} else {//Under normal circumstances, that is, when a doc contains multiple numbers
		meta.writeVInt(SORTED_WITH_ADDRESSES);
		// write the stream of values ​​as a numeric field, write the numeric type first, that is, write all the values ​​to the directory.
		addNumericField(field, values, true);
		// write the doc -> ord count as a absolute index to the stream。
		addAddresses(field, docToValueCount);//In writing the index, it is used to read each doc's own multiple long values.
	}
}
private void addAddresses(FieldInfo field, Iterable<Number> values) throws IOException {
	meta.writeVInt(field.number);//
	meta.writeByte(Lucene410DocValuesFormat.NUMERIC);
	meta.writeVInt(MONOTONIC_COMPRESSED);
	meta.writeLong(-1L);
	meta.writeLong(data.getFilePointer());
	meta.writeVLong(maxDoc);
	meta.writeVInt(PackedInts.VERSION_CURRENT);
	meta.writeVInt(BLOCK_SIZE);
	final MonotonicBlockPackedWriter writer = new MonotonicBlockPackedWriter(data, BLOCK_SIZE);
	long addr = 0;
	writer.add(addr);//
	for (Number v : values) {
		addr += v.longValue();
		writer.add(addr);//Record how many longs there are before each doc, so that you can quickly find the beginning of each doc in the numeric block, and then read how many docs, that is, all the doc's numbers.
	}
	writer.finish();
	meta.writeLong(data.getFilePointer());
}

 After reading the flush, you will know how the SortedNumericDocValue is stored. It is divided into two parts, one is the number, that is, all the numbers. All the numbers of each doc are stored together, and they are stored after sorting; The second part stores the order of the first number of each doc among all numbers. For example, the first doc has three numbers, then the second doc stores 3 in the second part, because it reads in the first part. After taking 3 longs, it is the doc's own long. The stored value of the next doc of this doc minus the stored value of this doc is the number of longs of this doc, then after reading this number of longs , is all the long of this doc.

At the same time, it can be seen that in all the docs, there is indeed no sorting, only the multiple numbers of a doc are sorted.

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326156196&siteId=291194637