lucene中的docValue实现源码解读（十）——SortedSet的写入

看看最后一种docValue，sortedSet，他有点难，如果不是之前看了四种其他的格式的话，不是很容易理解，同理，如果看懂了其他的格式，则很容易理解。先说一下SortedSet吧，它是存储的byte[]，一个doc是可以有多个值的，在存储的时候，是排序的，即在索引中是可以查找到每个doc的byte[]的排序的。他的存储是综合了sortedDocValue（存储排序的单值byte[]）和SortedNumericDocValue，在存储的时候对于所有的byte[]是要先获得对应的排序，如果我们把顺序的地方和具体的byte[]值分开的话，单独存放所有的byte[]就可以使用和SortedDocValues同样的格式，如果存储好了所有的byte[]，那么每个doc就可以根据自己的byte[]的排序就可以获得其自己的byte[]了，也就是只需要保存一些数字就可以了，就可以使用SortedNumericDocValue的格式了。

看看再内存中的保存，使用的是SortedSetDocValuesWriter ：

class SortedSetDocValuesWriter extends DocValuesWriter {
	/**用来存储所有的byte[]，并定义其id*/
	final BytesRefHash hash;
	/** 记录termid，所有的doc的都记录，不过在每个doc记录的时候，是按照termId排序的，但是对于不同的doc是不排序的 */
	private PackedLongValues.Builder pending; // stream of all termIDs
	/** 每个doc含有的byte[]的个数的，重复的计算一个 */
	private PackedLongValues.Builder pendingCounts; // termIDs per doc
	private final Counter iwBytesUsed;
	private long bytesUsed; // this only tracks differences in 'pending' and 'pendingCounts'
	private final FieldInfo fieldInfo;
	/** 当前处理的doc的id，用于区别是不是同一个doc*/
	private int currentDoc;
	/** 当前的doc处的所有的byte[] */
	private int currentValues[] = new int[8];
	/**当前的doc含有的byte[]在currentValues中的指针*/
	private int currentUpto = 0;
	/**所有的doc中含有的byte[]的个数的最大值 */
	private int maxCount = 0;
	
	public SortedSetDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
		this.fieldInfo = fieldInfo;
		this.iwBytesUsed = iwBytesUsed;
		hash = new BytesRefHash(new ByteBlockPool(new ByteBlockPool.DirectTrackingAllocator(iwBytesUsed)),BytesRefHash.DEFAULT_CAPACITY, new DirectBytesStartArray(BytesRefHash.DEFAULT_CAPACITY, iwBytesUsed));
		pending = PackedLongValues.packedBuilder(PackedInts.COMPACT);
		pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		bytesUsed = pending.ramBytesUsed() + pendingCounts.ramBytesUsed();
		iwBytesUsed.addAndGet(bytesUsed);
	}
        public void addValue(int docID, BytesRef value) {
                //检查的条件，略过
		if (docID != currentDoc) {//结束这个doc
			finishCurrentDoc();
		}
		// Fill in any holes:
		while (currentDoc < docID) {
			pendingCounts.add(0); //没有值的doc含有的byte[]的个数是0
			currentDoc++;
		}

		addOneValue(value);//添加一个值
		updateBytesUsed();//更新使用的内存
	}
        //结束一个doc
	private void finishCurrentDoc() {
		Arrays.sort(currentValues, 0, currentUpto);//将当前的doc的所有的值进行排序
		int lastValue = -1;
		int count = 0;
		for (int i = 0; i < currentUpto; i++) {
			int termID = currentValues[i];
			// if its not a duplicate
			if (termID != lastValue) {//重复的只记录一个
				pending.add(termID); // record the term id
				count++;
			}
			lastValue = termID;
		}
		// record the number of unique term ids for this doc
		pendingCounts.add(count);//添加这个doc的byte[]的个数
		maxCount = Math.max(maxCount, count);
		currentUpto = 0;
		currentDoc++;
	}
        //
	private void addOneValue(BytesRef value) {
		int termID = hash.add(value);//添加到hash表里面，如果返回的值是小于0的表示已经存在了，否则第一次出现。
		if (termID < 0) {
			termID = -termID - 1;
		} else {
			iwBytesUsed.addAndGet(2 * RamUsageEstimator.NUM_BYTES_INT);
		}
		if (currentUpto == currentValues.length) {
			currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
			// reserve additional space for max # values per-doc
			// when flushing, we need an int[] to sort the mapped-ords within
			// the doc
			iwBytesUsed.addAndGet((currentValues.length - currentUpto) * 2 * RamUsageEstimator.NUM_BYTES_INT);
		}

		currentValues[currentUpto] = termID;//添加到临时的数组里面去。
		currentUpto++;
	}

可以看到，他的处理逻辑和SortedNumericDocValue是差不多的，允许一个doc添加多个byte[]。在内存中保存的时候，是保存了一个doc的多个byte[]以及一个doc的byte[]的个数。再结束一个doc的多个byte[]的时候，是先进行了排序。

再看看如何写入到directory中去，先调用flush方法：

public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();
	final int maxCountPerDoc = maxCount;
	assert pendingCounts.size() == maxDoc;
	final int valueCount = hash.size();//所有的byte[]的个数，
	final PackedLongValues ords = pending.build();//所有的doc含有的所有的byte[]的id
	final PackedLongValues ordCounts = pendingCounts.build();//每个doc含有的byte[]的个数
	
	// 将所有的byte[]排序，并整理hash表，移动到数组里面去。
	final int[] sortedValues = hash.sort(BytesRef.getUTF8SortedAsUnicodeComparator());
	final int[] ordMap = new int[valueCount];//下标是term id，值是排序。可以快速的得到某个id的term的排序
	for (int ord = 0; ord < valueCount; ord++) {
		ordMap[sortedValues[ord]] = ord;
	}

	dvConsumer.addSortedSetField(fieldInfo,
			// ord -> value
			new Iterable<BytesRef>() {//返回所有的byte[],已经排好顺序了。
				public Iterator<BytesRef> iterator() {
					return new ValuesIterator(sortedValues, valueCount, hash);
				}
			},
			// doc -> ordCount。每个doc含有的byte[]的个数
			new Iterable<Number>() {
				public Iterator<Number> iterator() {
					return new OrdCountIterator(maxDoc, ordCounts);
				}
			},
			//用于返回所有的doc的byte[]的排序
			new Iterable<Number>() {
				public Iterator<Number> iterator() {
					return new OrdsIterator(ordMap, maxCountPerDoc, ords, ordCounts);
				}
			});
	
}

通过上面可以看出，在flush的过程中，是向最终的Consumer传递了三个参数，第一个是用于形成所有的词典表的，即所有的byte[]，第二个是每个doc含有的byte[]的个数，第三个是每个doc的所有的byte[] 的排序。有了这三个iterable，我们看一下具体的flush过程吧：方法是Lucene410DocValuesConsumer.addSortedSetField(FieldInfo, Iterable<BytesRef>, Iterable<Number>, Iterable<Number>)：

/**
 * @param values：按照顺序返回的byte[]
 * @param docToOrdCount：每个doc含有的顺序（也就是byte[]）的个数
 * @param ords：所有的doc的byte[]的排序
 */
public void addSortedSetField(FieldInfo field, Iterable<BytesRef> values, final Iterable<Number> docToOrdCount,	final Iterable<Number> ords) throws IOException {
	meta.writeVInt(field.number);//
	meta.writeByte(Lucene410DocValuesFormat.SORTED_SET);
	if (isSingleValued(docToOrdCount)) {//如果全部的doc都只有一个byte[]，则和SortedDocValue是一样的，忽略这种情况
		meta.writeVInt(SORTED_SINGLE_VALUED);
		// The field is single-valued, we can encode it as SORTED
		addSortedField(field, values, singletonView(docToOrdCount, ords, -1L));
	} else {
		meta.writeVInt(SORTED_WITH_ADDRESSES);
		// write the ord -> byte[] as a binary field。
		addTermsDict(field, values);//添加词典表，也就是记录所有的byte[]，这个方法在sortedDocValue中也使用到了，也就是存储格式和SortedDocValue是一个存储格式（可以回顾一下SortedDocValue的存储格式，地址是：http://suichangkele.iteye.com/blog/2410752）。使用这个格式的好处是可以快速的根据排序的到对应的byte[]
		// 写入一个doc含有的所有的byte[]的排序，也就是一个doc对应的多个数字，这个和sortedDocValue的也是一样的。
		addNumericField(field, ords, false);
		// write the doc -> ord count as a absolute index to the stream
		addAddresses(field, docToOrdCount);//写入一个doc的第一个排序在所有的排序中的开始位置，  这个和SortedNumericDocValue是一样的。
	}
}

正如我在文章开头说的，如果之前看懂了其他四个docValue的存储格式，看这个很简单。上面的addSortedSetField方法，如果把addAddress方法去掉的话，就是和SortedDocValue一样，不同的地方是，对于SortedSetDocValue是多个值的，所以要记录每个doc含有的byte[]的个数，所以要加入第三个方法。如果把addTermsDict方法去掉，就是和SortedNumericDocValue一样了，理解思路是：SortedNumericDocValue也是多个值的，即一个doc含有多个数字，在SortedSetDocValue中也需要使用SortedNumericDocValue，只不过是作为一个doc的多个byte[]的排序，还需要存储具体的byte[]，所以要加上第一个部分addTermsDict。有了这三个方法，就可以大致的知道查找的思路了，在查找某个doc的byte[]的时候，先从addressses中查找他的第一个byte[]的排序是多少，一共多少个byte[]，这样在从numericField中查找具体的排序，再从termDict中根据排序找到对应的byte[]。下一篇博客中看一下对应的方法吧。

lucene中的docValue实现源码解读（十）——SortedSet的写入

猜你喜欢