SortedNumericDocValue, at first glance, there are a lot of questions that bother me, is this type of storage also sorted? Isn't it possible to get its ordering by docid? Is it single-valued like SortedBinaryDocValue? With these questions, after reading the code, I found that it is not at all. SortedNumericDocValue does not sort all docs, that is, it is impossible to obtain the sorting of a doc, and it is multi-valued, that is, a doc can contain multiple numbers, here sorted means that multiple numbers of each doc are stored in the storage The time is sorted, but there is no sorting between multiple docs.
As before, take a look at the addition in memory:
/** Buffers up pending long[] per doc, sorts, then flushes when segment flushes. */ class SortedNumericDocValuesWriter extends DocValuesWriter { /**All longs contained in each doc*/ private PackedLongValues.Builder pending; // stream of all values /** The number of long numbers contained in each doc, if not, it is 0*/ private PackedLongValues.Builder pendingCounts; // count of values per doc private final FieldInfo fieldInfo; /** The id of the doc just processed */ private int currentDoc; /** Multiple long numbers used to save the current doc */ private long currentValues[] = new long[8]; /** The pointer of the last long of the current doc in the currentValues, which can also be said to be used to record the number of all numbers in the current doc*/ private int currentUpto = 0; //Construction method public SortedNumericDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) { this.fieldInfo = fieldInfo; this.iwBytesUsed = iwBytesUsed; pending = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT); pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT); bytesUsed = pending.ramBytesUsed() + pendingCounts.ramBytesUsed(); iwBytesUsed.addAndGet(bytesUsed); } //Add a docvalue public void addValue(int docID, long value) { if (docID != currentDoc) {//If the docid is switched, it means that the next doc is to be processed, and the current doc must be ended, because a doc has multiple values finishCurrentDoc();//End the current doc } // Fill in any holes: Fill in the holes. For the doc with no value, fill in 0 in pendingCounts, indicating that the number of values of this doc is 0 while (currentDoc < docID) { pendingCounts.add(0); // no values currentDoc++; } addOneValue(value);//Add a value to the current doc updateBytesUsed(); } // finalize currentDoc: this sorts the values in the current doc private void finishCurrentDoc() { Arrays.sort(currentValues, 0, currentUpto);//Sort the multiple values of the current doc from small to large for (int i = 0; i < currentUpto; i++) {//Write to pending pending.add(currentValues[i]); } // record the number of values for this doc pendingCounts.add(currentUpto);//The number of long numbers contained in the current doc currentUpto = 0; currentDoc++; } //end all doc @Override public void finish(int maxDoc) { finishCurrentDoc(); // fill in any holes for (int i = currentDoc; i < maxDoc; i++) { pendingCounts.add(0); // no values } } /**Add a long to the array */ private void addOneValue(long value) { if (currentUpto == currentValues.length) { currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1); } currentValues[currentUpto] = value; currentUpto++; } }
It can be clearly seen from the above method that SortedNumericDocValue supports multiple numbers in one doc. For each doc, two contents are recorded, one is the value of this doc (stored in pending), and it is stored after sorting; the second is the number of values of this doc (stored in pendingCount).
Let's take a look at the operation when flushing:
@Override public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException { final int maxDoc = state.segmentInfo.getDocCount(); assert pendingCounts.size() == maxDoc; final PackedLongValues values = pending.build();//All added values final PackedLongValues valueCounts = pendingCounts.build();//The number of longs contained in each doc dvConsumer.addSortedNumericField(fieldInfo, // doc -> valueCount, new Iterable<Number>() { @Override public Iterator<Number> iterator() { return new CountIterator(valueCounts);//The number of numbers contained in each doc } }, // values new Iterable<Number>() { @Override public Iterator<Number> iterator() {//All numbers return new ValuesIterator(values); } }); } private static class ValuesIterator implements Iterator<Number> { final PackedLongValues.Iterator iter; ValuesIterator(PackedLongValues values) { iter = values.iterator(); } @Override public boolean hasNext() { return iter.hasNext(); } @Override public Number next() { if (!hasNext()) { throw new NoSuchElementException(); } return iter.next(); } @Override public void remove() { throw new UnsupportedOperationException(); } } private static class CountIterator implements Iterator<Number> { final PackedLongValues.Iterator iter; CountIterator(PackedLongValues valueCounts) { this.iter = valueCounts.iterator(); } @Override public boolean hasNext() { return iter.hasNext(); } @Override public Number next() { if (!hasNext()) { throw new NoSuchElementException(); } return iter.next(); } @Override public void remove() { throw new UnsupportedOperationException(); } }
Flushing is also very simple, that is, two iterators are passed to the Consumer, one is used for the number of long values of a doc, and the second is used to pass all the values. Let's take a look at the specific method written to the dierctory. Lucene410DocValuesConsumer.addSortedNumericField(FieldInfo, Iterable<Number>, Iterable<Number>):
public void addSortedNumericField(FieldInfo field, final Iterable<Number> docToValueCount, final Iterable<Number> values) throws IOException { meta.writeVInt(field.number);//域号 meta.writeByte(Lucene410DocValuesFormat.SORTED_NUMERIC); if (isSingleValued(docToValueCount)) {//If all are single-valued, that is, each doc has only one value, directly use the previous number type. Because there is no sorting at this time, the sorting mentioned here is the sorting of multiple values of a doc in the domain meta.writeVInt(SORTED_SINGLE_VALUED); // The field is single-valued, we can encode it as NUMERIC addNumericField(field, singletonView(docToValueCount, values, null));//This is the format when recording NumericDocValye, which is divided into three formats. } else {//Under normal circumstances, that is, when a doc contains multiple numbers meta.writeVInt(SORTED_WITH_ADDRESSES); // write the stream of values as a numeric field, write the numeric type first, that is, write all the values to the directory. addNumericField(field, values, true); // write the doc -> ord count as a absolute index to the stream。 addAddresses(field, docToValueCount);//In writing the index, it is used to read each doc's own multiple long values. } } private void addAddresses(FieldInfo field, Iterable<Number> values) throws IOException { meta.writeVInt(field.number);// meta.writeByte(Lucene410DocValuesFormat.NUMERIC); meta.writeVInt(MONOTONIC_COMPRESSED); meta.writeLong(-1L); meta.writeLong(data.getFilePointer()); meta.writeVLong(maxDoc); meta.writeVInt(PackedInts.VERSION_CURRENT); meta.writeVInt(BLOCK_SIZE); final MonotonicBlockPackedWriter writer = new MonotonicBlockPackedWriter(data, BLOCK_SIZE); long addr = 0; writer.add(addr);// for (Number v : values) { addr += v.longValue(); writer.add(addr);//Record how many longs there are before each doc, so that you can quickly find the beginning of each doc in the numeric block, and then read how many docs, that is, all the doc's numbers. } writer.finish(); meta.writeLong(data.getFilePointer()); }
After reading the flush, you will know how the SortedNumericDocValue is stored. It is divided into two parts, one is the number, that is, all the numbers. All the numbers of each doc are stored together, and they are stored after sorting; The second part stores the order of the first number of each doc among all numbers. For example, the first doc has three numbers, then the second doc stores 3 in the second part, because it reads in the first part. After taking 3 longs, it is the doc's own long. The stored value of the next doc of this doc minus the stored value of this doc is the number of longs of this doc, then after reading this number of longs , is all the long of this doc.
At the same time, it can be seen that in all the docs, there is indeed no sorting, only the multiple numbers of a doc are sorted.