Source code interpretation of docValue in lucene (10) - Writing of SortedSet

 

Look at the last docValue, sortedSet, it's a bit difficult. If you didn't read the four other formats before, it's not easy to understand. Similarly, if you understand other formats, it's easy to understand. Let's talk about SortedSet first. It is a stored byte[]. A doc can have multiple values. When it is stored, it is sorted, that is, the byte[] of each doc can be found in the index. of. His storage is a combination of sortedDocValue (storage sorted single-value byte[]) and SortedNumericDocValue. When storing, for all byte[], the corresponding sorting must be obtained first. If we put the order of the place and the specific byte[] If the values ​​are separated, you can use the same format as SortedDocValues ​​to store all byte[] separately. If you store all byte[], then each doc can get its own byte[] according to the sorting of its own. byte[], that is, you only need to save some numbers, you can use the format of SortedNumericDocValue.

Take a look at the in-memory save, using the SortedSetDocValuesWriter:

class SortedSetDocValuesWriter extends DocValuesWriter {
	/**Used to store all byte[] and define its id*/
	final BytesRefHash hash;
	/** Record termid, all doc records are recorded, but when each doc is recorded, it is sorted by termId, but not sorted for different doc*/
	private PackedLongValues.Builder pending; // stream of all termIDs
	/** The number of byte[] contained in each doc is counted repeatedly */
	private PackedLongValues.Builder pendingCounts; // termIDs per doc
	private final Counter iwBytesUsed;
	private long bytesUsed; // this only tracks differences in 'pending' and 'pendingCounts'
	private final FieldInfo fieldInfo;
	/** The id of the currently processed doc is used to distinguish whether it is the same doc*/
	private int currentDoc;
	/** All byte[] at the current doc */
	private int currentValues[] = new int[8];
	/**The current doc contains the byte[] pointer in currentValues*/
	private int currentUpto = 0;
	/**The maximum number of byte[] contained in all doc*/
	private int maxCount = 0;
	
	public SortedSetDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
		this.fieldInfo = fieldInfo;
		this.iwBytesUsed = iwBytesUsed;
		hash = new BytesRefHash(new ByteBlockPool(new ByteBlockPool.DirectTrackingAllocator(iwBytesUsed)),BytesRefHash.DEFAULT_CAPACITY, new DirectBytesStartArray(BytesRefHash.DEFAULT_CAPACITY, iwBytesUsed));
		pending = PackedLongValues.packedBuilder(PackedInts.COMPACT);
		pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		bytesUsed = pending.ramBytesUsed() + pendingCounts.ramBytesUsed();
		iwBytesUsed.addAndGet(bytesUsed);
	}
        public void addValue(int docID, BytesRef value) {
                // check condition, skip
		if (docID != currentDoc) {//End this doc
			finishCurrentDoc();
		}
		// Fill in any holes:
		while (currentDoc < docID) {
			pendingCounts.add(0); //The number of byte[] contained in doc without value is 0
			currentDoc++;
		}

		addOneValue(value);//Add a value
		updateBytesUsed();//Update used memory
	}
        //end a doc
	private void finishCurrentDoc() {
		Arrays.sort(currentValues, 0, currentUpto);//sort all the values ​​of the current doc
		int lastValue = -1;
		int count = 0;
		for (int i = 0; i < currentUpto; i++) {
			int termID = currentValues[i];
			// if its not a duplicate
			if (termID != lastValue) {//Only one record is repeated
				pending.add(termID); // record the term id
				count++;
			}
			lastValue = termID;
		}
		// record the number of unique term ids for this doc
		pendingCounts.add(count);//Add the number of byte[] of this doc
		maxCount = Math.max(maxCount, count);
		currentUpto = 0;
		currentDoc++;
	}
        //
	private void addOneValue(BytesRef value) {
		int termID = hash.add(value);//Add to the hash table, if the returned value is less than 0, it already exists, otherwise it appears for the first time.
		if (termID < 0) {
			termID = -termID - 1;
		} else {
			iwBytesUsed.addAndGet(2 * RamUsageEstimator.NUM_BYTES_INT);
		}
		if (currentUpto == currentValues.length) {
			currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
			// reserve additional space for max # values per-doc
			// when flushing, we need an int[] to sort the mapped-ords within
			// the doc
			iwBytesUsed.addAndGet((currentValues.length - currentUpto) * 2 * RamUsageEstimator.NUM_BYTES_INT);
		}

		currentValues[currentUpto] = termID;//Add to the temporary array.
		currentUpto++;
	}

As you can see, its processing logic is similar to SortedNumericDocValue, allowing a doc to add multiple byte[]. When saving in memory, it saves multiple byte[] of a doc and the number of byte[] of a doc. When ending multiple byte[] of a doc, it is sorted first.

Let's see how to write to the directory, first call the flush method:

 

public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();
	final int maxCountPerDoc = maxCount;
	assert pendingCounts.size() == maxDoc;
	final int valueCount = hash.size();//The number of all byte[],
	final PackedLongValues ​​ords = pending.build();//All docs contain all byte[] ids
	final PackedLongValues ​​ordCounts = pendingCounts.build();//The number of byte[] contained in each doc
	
	// Sort all byte[], sort the hash table, and move it into the array.
	final int[] sortedValues = hash.sort(BytesRef.getUTF8SortedAsUnicodeComparator());
	final int[] ordMap = new int[valueCount];//The subscript is the term id, and the value is the order. You can quickly get the sorting of terms with a certain id
	for (int word = 0; word <valueCount; word ++) {
		ordMap [sortedValues ​​[word]] = word;
	}

	dvConsumer.addSortedSetField(fieldInfo,
			// ord -> value
			new Iterable<BytesRef>() {//Return all byte[], already arranged in order.
				public Iterator <BytesRef> iterator () {
					return new ValuesIterator(sortedValues, valueCount, hash);
				}
			},
			// doc -> ordCount. The number of byte[] contained in each doc
			new Iterable<Number>() {
				public Iterator<Number> iterator() {
					return new OrdCountIterator(maxDoc, ordCounts);
				}
			},
			//The sorting of byte[] used to return all doc
			new Iterable<Number>() {
				public Iterator<Number> iterator() {
					return new OrdsIterator(ordMap, maxCountPerDoc, ords, ordCounts);
				}
			});
	
}

As can be seen from the above, in the process of flushing, three parameters are passed to the final Consumer, the first one is used to form all dictionary tables, that is, all byte[], the second one is each doc Contains the number of byte[]s, the third is the ordering of all byte[]s for each doc. With these three iterables, let's take a look at the specific flush process: the method is Lucene410DocValuesConsumer.addSortedSetField(FieldInfo, Iterable<BytesRef>, Iterable<Number>, Iterable<Number>):

 

 

/**
 * @param values: byte[] returned in order
 * @param docToOrdCount: The number of orders (that is, byte[]) contained in each doc
 * @param ords: byte[] ordering of all doc
 */
public void addSortedSetField(FieldInfo field, Iterable<BytesRef> values, final Iterable<Number> docToOrdCount,	final Iterable<Number> ords) throws IOException {
	meta.writeVInt(field.number);//
	meta.writeByte(Lucene410DocValuesFormat.SORTED_SET);
	if (isSingleValued(docToOrdCount)) {//If all doc has only one byte[], it is the same as SortedDocValue, ignore this case
		meta.writeVInt(SORTED_SINGLE_VALUED);
		// The field is single-valued, we can encode it as SORTED
		addSortedField(field, values, singletonView(docToOrdCount, ords, -1L));
	} else {
		meta.writeVInt(SORTED_WITH_ADDRESSES);
		// write the ord -> byte[] as a binary field。
		addTermsDict(field, values);//Add a dictionary table, that is to record all byte[], this method is also used in sortedDocValue, that is, the storage format and SortedDocValue are a storage format (you can review the storage format of SortedDocValue, The address is: http://suichangkele.iteye.com/blog/2410752). The advantage of using this format is that it can be quickly sorted to the corresponding byte[]
		// Write the sorting of all byte[] contained in a doc, that is, multiple numbers corresponding to a doc, which is the same as sortedDocValue.
		addNumericField (field, ords, false);
		// write the doc -> ord count as a absolute index to the stream
		addAddresses(field, docToOrdCount);//Write the start position of the first sort of a doc in all sorts, which is the same as SortedNumericDocValue.
	}
}

As I said at the beginning of the article, if you understand the storage format of the other four docValues ​​before, this is very simple. The above addSortedSetField method, if the addAddress method is removed, is the same as SortedDocValue, the difference is that there are multiple values ​​for SortedSetDocValue, so it is necessary to record the number of byte[] contained in each doc, so add a third a method. If the addTermsDict method is removed, it is the same as SortedNumericDocValue. The idea is: SortedNumericDocValue also has multiple values, that is, a doc contains multiple numbers, and SortedNumericDocValue also needs to be used in SortedSetDocValue, but it is just as multiple byte[ ], also need to store specific byte[], so add the first part addTermsDict. With these three methods, you can roughly know the idea of ​​​​searching. When looking for the byte[] of a doc, first find the order of its first byte[] from addressesses, and how many in total. byte[], in this way, find the specific order from the numericField, and then find the corresponding byte[] from the termDict according to the order. Take a look at the corresponding method in the next blog.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326082590&siteId=291194637