Source code interpretation of docValue in lucene (7) - reading of SortedDocValue

In a previous blog, I wrote about the writing of sortedDocValue. This time, let's see how to read it when using sortedDocValue again. The reading method is still in Lucene410DocValuesProducer. The method is public SortedDocValues ​​getSorted(FieldInfo field). Let's take a look at the SortedDocValue object returned by this method. In the source code of this class, we can find that this class inherits from BinaryDocValues, so the above The SortedDocValue mentioned in a blog is SortedBinaryDocValue, which is correct. Lucene deliberately omitted Binary. In addition to BinaryDocValue's ByteRef get(int docid), his method adds a lot, as follows:

public abstract class SortedDocValues extends BinaryDocValues {
  /**
   * Returns the ordinal for the specified docID. Returns the order of byte[] of the specified doc. If you read the previous blog, it is easy to know that the Numeric part of the storage structure is used.
   * @param  docID document ID to lookup
   * @return ordinal for the document: this is dense, starts at 0, then increments by 1 for the next value in sorted order. Note that missing values ​​are indicated by -1. Returns -1 if a doc has no value
   */
  public abstract int getOrd(int docID);

  /**
   * Find the corresponding byte[] according to the sorting. According to the previous blog, you can know that it is the binary part of the read, determine which small block to read according to ord, and then find the corresponding start according to the second small part of the first part (recording the index of each small block) position, and then read the specified byte[].
   */
  public abstract BytesRef lookupOrd (int ord);

  /**
   * Returns the number of all byte[]s.
   */
  public abstract int getValueCount();

  private final BytesRef empty = new BytesRef ();

  @Override
  public BytesRef get(int docID) {//This is the get method that rewrites BinaryDocValue, that is to get the corresponding byte[] according to docid
    int ord = getOrd(docID);//Get the doc order, this is the second part of reading, the Numeric part
    if (ord == -1) {
      return empty;
    } else {
      return lookupOrd(ord);//Search again according to the order
    }
  }

  /**
   * Get the order of this key, if it does not exist, return a negative number
   */
  public int lookupTerm(BytesRef key) {
    int low = 0;
    int high = getValueCount()-1;
    //Use binary search, but you need to query many times,
    while (low <= high) {
      int mid = (low + high) >>> 1;
      final BytesRef term = lookupOrd(mid);
      int cmp = term.compareTo(key);
      if (cmp < 0) {//This key is larger than the term, so look up to the right
        low = mid + 1;
      } else if (cmp > 0) {////This key is smaller than term, so look left
        high = mid - 1;
      } else {
        return mid; // key found
      }
    }
    return -(low + 1);  // key not found.
  }
  
  /**
   * This returns an iterator that iterates over all written byte[].
   */
  public TermsEnum termsEnum() {
    return new SortedDocValuesTermsEnum(this);
  }
}

After reading the SortedDocValue method, we can almost guess its reading process. Let's look at the source code. Like the previous reading process, all meta (that is, docValue) will be read in the Lucene410DocValuesProducer's construction method. index) file, corresponding to the SortedDocValue as follows:

 

 

private void readSortedField(int fieldNumber, IndexInput meta, FieldInfos infos) throws IOException {
	// sorted = binary + numeric
	if (meta.readVInt() != fieldNumber) {
		throw new CorruptIndexException(
				"sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
	}
	if (meta.readByte() != Lucene410DocValuesFormat.BINARY) {
		throw new CorruptIndexException(
				"sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
	}
	
	BinaryEntry b = readBinaryEntry(meta);//Read binary, which is the first part, but this time the format is different from the previous one. This time the prefix used is compressed, so it is necessary for us to go back to the readBinaryEntry method again.
	binaries.put(fieldNumber, b);//缓存

	if (meta.readVInt() != fieldNumber) {
		throw new CorruptIndexException(
				"sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
	}
	if (meta.readByte() != Lucene410DocValuesFormat.NUMERIC) {
		throw new CorruptIndexException(
				"sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
	}
	NumericEntry n = readNumericEntry(meta);//Read numeric, which is the second part, this part is the same as the previous NumericDocValue, so I won't read it here.
	ords.put(fieldNumber, n);//Cache the result of reading
}

 When I was in BinaryDocValue before, I saw readBinaryEntry, but I didn't see the number of prefix compressions, so look again, the code is as follows:

static BinaryEntry readBinaryEntry(IndexInput meta) throws IOException {
	BinaryEntry entry = new BinaryEntry();
	entry.format = meta.readVInt();//Stored format
	entry.missingOffset = meta.readLong();//Record the fp of those docSets that contain values
	entry.minLength = meta.readVInt();//Minimum
	entry.maxLength = meta.readVInt();//Maximum value
	entry.count = meta.readVLong();//The number of all docs (in sortedBinary, this is the number of byte[] written, not the number of docs anymore)
	entry.offset = meta.readLong();//fp of the real docValue
	switch (entry.format) {
	case BINARY_FIXED_UNCOMPRESSED:// ignore, we look at the one that uses prefix compression
		break;
	case BINARY_PREFIX_COMPRESSED://This is used when the sorted docValue
		entry.addressesOffset = meta.readLong();//The starting position of each small block in the data, which is the starting position of the second small part of the first part mentioned in the previous blog
		entry.packedIntsVersion = meta.readVInt();
		entry.blockSize = meta.readVInt();
		entry.reverseIndexOffset = meta.readLong();//The start position of the third part of the first part
		break;
	case BINARY_VARIABLE_UNCOMPRESSED:
		entry.addressesOffset = meta.readLong();//Ignore, look at the one that uses prefix compression
		entry.packedIntsVersion = meta.readVInt();
		entry.blockSize = meta.readVInt();
		break;
	default:
		throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta);
	}
	return entry;
}

 I didn't find anything after reading it, but I just read three indexes, one is the starting position of docValue, and the second is the index that records the starting position of each small block, which is the second small part of the first part. , and the start position of the first part of the third part. But the position of the second part is not read because the second part is read in readBinary.

After reading the reading of the meta file above, let's take a look at the specific process of finding docValue. The first part and the second part of the data file will be read in it, so let's take a look at these two methods first. It is difficult to read the first part, that is, the docValue compressed with the prefix, and the second part is to read numeric, which is the same as the ordinary NumericDocValue, which has been written before (there are three formats), so skip it .

The way to read the prefix-compressed docValue is in getBinary, as follows:

public BinaryDocValues getBinary(FieldInfo field) throws IOException {
	BinaryEntry bytes = binaries.get(field.number);
	switch (bytes.format) {
	case BINARY_FIXED_UNCOMPRESSED:
		return getFixedBinary(field, bytes);//All byte[] have the same length
	case BINARY_VARIABLE_UNCOMPRESSED:
		return getVariableBinary(field, bytes);//different
		
	case BINARY_PREFIX_COMPRESSED://This is used in sorted docValue, which uses prefix compression, based on block storage
		return getCompressedBinary(field, bytes);
	default:
		throw new AssertionError();
	}
}

Focus on the third one:

private BinaryDocValues getCompressedBinary(FieldInfo field, final BinaryEntry bytes) throws IOException {
	final MonotonicBlockPackedReader addresses = getIntervalInstance(field, bytes);//Record the address of each block in the index, that is, the reading of the second part of the first part
	final ReverseTermsIndex index = getReverseIndexInstance(field, bytes);//The last part
	assert addresses.size() > 0; // we don't have to handle empty case
	IndexInput slice = data.slice("terms", bytes.offset, bytes.addressesOffset - bytes.offset);//All the parts of the small block, from the beginning to the part of the position index
	return new CompressedBinaryDocValues(bytes, addresses, index, slice);//Return all results in an object.
}

  Among them, due to space reasons, we ignore the getIntervalInstance method and the getReverseIndexInstance method, but we need to take a look at the generated CompressedBinaryDocValues ​​object, which is also a BinaryDocValue, so there are corresponding methods, take a look at:

static final class CompressedBinaryDocValues extends LongBinaryDocValues {
	public CompressedBinaryDocValues(BinaryEntry bytes, MonotonicBlockPackedReader addresses,ReverseTermsIndex index, IndexInput data) throws IOException {
		this.maxTermLength = bytes.maxLength;//The maximum length in all byte[]
		this.numValues ​​= bytes.count;//Number of bytes[]
		this.addresses = addresses; //The storage location of each small block
		this.numIndexValues ​​= addresses.size();//The number of all positions, that is, the number of small blocks
		this.data = data;
		this.reverseTerms = index.terms;//The third part of the first part of the storage, that is, where a byte[] is stored every 1024
		this.reverseAddresses = index.termAddresses;//I haven't figured out what these two are for a while, but it doesn't affect understanding.
		this.numReverseIndexValues = reverseAddresses.size();//同上
		this.termsEnum = getTermsEnum(data);// Get the term enumerator, which is the object used to find all btye[]. It can be found that many of the following methods are called methods of this object.
	}
	@Override
	public BytesRef get(long id) {
		try {
			termsEnum.seekExact(id);//Search for BytesRef of the specified order. The method is called termEnum
			return termsEnum.term();
		} catch (IOException e) {
			throw new RuntimeException(e);
		}
	}
	long lookupTerm(BytesRef key) {//Query according to the string, if it exists, return its order, otherwise return a negative number
		try {
			switch (termsEnum.seekCeil(key)) {//
			case FOUND:
				return termsEnum.ord();
			case NOT_FOUND:
				return -termsEnum.ord() - 1;
			default:
				return -numValues - 1;
			}
		} catch (IOException bogus) {
			throw new RuntimeException(bogus);
		}
	}

	TermsEnum getTermsEnum() {
		try {
			return getTermsEnum(data.clone());
		} catch (IOException e) {
			throw new RuntimeException(e);
		}
	}

Through the above, you can find that the final thing is to get a TermEnum, and the return is a CompressedBinaryTermsEnum, so look at the code of this class, especially the method of TermsEnum

class CompressedBinaryTermsEnum extends TermsEnum {
	/** The order of the currently read term */
	private long currentOrd = -1;
	// offset to the start of the current block
	private long currentBlockStart;
	/** incoming data file */
	private final IndexInput input;
	// delta from currentBlockStart to start of each term
	/**This is to record the byte[] that is not the first term in each small block at the beginning of the second part of each small block*/
	private final int offsets[] = new int[INTERVAL_COUNT];
	//Start address of the own part of each small block except the first byte[] except the own length prefix of those byte[]
	private final byte buffer[] = new byte[2 * INTERVAL_COUNT - 1];
	/**Currently read term*/
	private final BytesRef term = new BytesRef(maxTermLength);
	/** the first term of each small block */
	private final BytesRef firstTerm = new BytesRef(maxTermLength);
	private final BytesRef scratch = new BytesRef ();
	/** The incoming data file */
	CompressedBinaryTermsEnum(IndexInput input) throws IOException {
		this.input = input;
		input.seek(0);//Point to the beginning of this file, because when storing, the first location is the stored binaryDocValue.
	}
        // read the first part of a small block
	private void readHeader() throws IOException {
		firstTerm.length = input.readVInt();
		input.readBytes(firstTerm.bytes, 0, firstTerm.length);//Read the first term of each small block
		input.readBytes(buffer, 0, INTERVAL_COUNT - 1);//When reading the remaining 15 lengths, the length mentioned here refers to the length of each byte[] except for the shared prefix. read into buffer
		if (buffer[0] == -1) {//Indicates that there are more than 254, then read short
			readShortAddresses();
		} else {
			readByteAddresses();
		}
		currentBlockStart = input.getFilePointer();
	}
	//This method and the following readShortAddress can look at one
	private void readByteAddresses() throws IOException {
		int addr = 0;
		for (int i = 1; i < offsets.length; i++) {//Start from 1, because only 15 are recorded
			addr += 2 + (buffer[i - 1] & 0xFF);//The currently processed byte[] records the address of each byte[]'s own byte[] part in the current small block. But how did the 2 here come from, I didn't understand, I think it should be 1, because the length of the shared prefix is ​​recorded in the second small part, which is the case of 1 here. I'm guessing it's a bug! The following is 2
			offsets[i] = addr;
		}
	}
	private void readShortAddresses() throws IOException {
		input.readBytes(buffer, INTERVAL_COUNT - 1, INTERVAL_COUNT);
		int addr = 0;
		for (int i = 1; i < offsets.length; i++) {
			int x = i << 1;
			addr += 2 + ((buffer[x - 1] << 8) | (buffer[x] & 0xFF));
			offsets[i] = addr;
		}
	}
	// This is very simple, because when reading the header, the first term has been read, so here is just setting it to the term
	private void readFirstTerm() throws IOException {
		term.length = firstTerm.length;
		System.arraycopy(firstTerm.bytes, firstTerm.offset, term.bytes, 0, term.length);
	}
        // read the next term, read from the second part of the small block
	private void readTerm(int offset) throws IOException {
		int start = input.readByte() & 0xFF;//Length of shared prefix
		System.arraycopy(firstTerm.bytes, firstTerm.offset, term.bytes, 0, start);//Copy the value of the first term of this small block, that is, the shared prefix, to the current term
		int suffix = offsets[offset] - offsets[offset - 1] - 1;//Find your own position, that is, in the second half of each small block, your own suffix.
		input.readBytes(term.bytes, start, suffix);//Read your own suffix,
		term.length = start + suffix;//Move your own pointer. This forms a term.
	}
	//Read the next byte[], which is called term here, because this TermEnum class was originally used to read the dictionary table
	public BytesRef next() throws IOException {
		currentOrd++;
		if (currentOrd >= numValues) {
			return null;
		} else {
			int offset = (int) (currentOrd & INTERVAL_MASK);//Find small blocks, because they are stored in small blocks.
			if (offset == 0) {//Just the beginning of a small block
				// switch to next block
				readHeader();//Read the header file, including a new byte[], and the length of multiple remaining byte[]s except the shared prefix
				readFirstTerm();//The first term of each small block has been read, here is just copied to the term
			} else {
				readTerm(offset);//If the current pointer is in a small block, read the next term
			}
			return term;
		}
	}

        //This is the one that was said when writing docValue. In the last part of the first part, the index of byte[] is written to narrow the scope of the query. But his query range is very large, because it is added every 1024 terms when writing, so he is very inaccurate.
	long binarySearchIndex(BytesRef text) throws IOException {
		long low = 0;
		long high = numReverseIndexValues - 1;
		while (low <= high) {
			long mid = (low + high) >>> 1;
			reverseTerms.fill(scratch, reverseAddresses.get(mid));
			int cmp = scratch.compareTo(text);
			if (cmp < 0) {
				low = mid + 1;
			} else if (cmp > 0) {
				high = mid - 1;
			} else {
				return mid;
			}
		}
		return high;
	}
	// binary search against first term in block range to find term's block
	long binarySearchBlock(BytesRef text, long low, long high) throws IOException {
		while (low <= high) {
			long mid = (low + high) >>> 1;
			input.seek(addresses.get(mid));
			term.length = input.readVInt();
			input.readBytes(term.bytes, 0, term.length);
			int cmp = term.compareTo(text);

			if (cmp < 0) {
				low = mid + 1;
			} else if (cmp > 0) {
				high = mid - 1;
			} else {
				return mid;
			}
		}
		return high;
	}

        //Find the specified byte[], first use a large-range search, and then use a small-range block search.
	@Override
	public SeekStatus seekCeil(BytesRef text) throws IOException {
		// locate block: narrow to block range with index, then search blocks
		final long block;
		long indexPos = binarySearchIndex(text);//What he means is to use the index with a larger range to search first and narrow the search range
		if (indexPos < 0) {
			block = 0;
		} else {
			long low = indexPos << BLOCK_INTERVAL_SHIFT;
			long high = Math.min(numIndexValues - 1, low + BLOCK_INTERVAL_MASK);
			block = Math.max(low, binarySearchBlock(text, low, high));
		}

		// position before block, then scan to term.
		input.seek(addresses.get(block));//Then use small blocks to search accurately.
		currentOrd = (block << INTERVAL_SHIFT) - 1;

		while (next() != null) {
			int cmp = term.compareTo(text);
			if (cmp == 0) {
				return SeekStatus.FOUND;
			} else if (cmp > 0) {
				return SeekStatus.NOT_FOUND;
			}
		}
		return SeekStatus.END;
	}

	@Override
	public void seekExact(long ord) throws IOException {
		long block = ord >>> INTERVAL_SHIFT;//Find the first block.
		if (block != currentOrd >>> INTERVAL_SHIFT) {//If it is not a block with the current one
			// switch to different block
			input.seek(addresses.get(block));//Switch to the specified block
			readHeader();
		}

		currentOrd = word;

		int offset = (int) (ord & INTERVAL_MASK);
		if (offset == 0) {//If it is the first,
			readFirstTerm();
		} else {//not
			input.seek(currentBlockStart + offsets[offset - 1]);
			readTerm(offset);
		}
	}

	@Override
	public BytesRef term() throws IOException {
		return term;
	}

	@Override
	public long ord() throws IOException {
		return currentOrd;
	}

	@Override
	public int docFreq() throws IOException {
		throw new UnsupportedOperationException();
	}

	@Override
	public long totalTermFreq() throws IOException {
		return -1;
	}

	@Override
	public DocsEnum docs(Bits liveDocs, DocsEnum reuse, int flags) throws IOException {
		throw new UnsupportedOperationException();
	}

	@Override
	public DocsAndPositionsEnum docsAndPositions(Bits liveDocs, DocsAndPositionsEnum reuse, int flags)
			throws IOException {
		throw new UnsupportedOperationException();
	}

	@Override
	public Comparator <BytesRef> getComparator () {
		return BytesRef.getUTF8SortedAsUnicodeComparator ();
	}
}

 After reading the final generated TermEnum, almost all methods require this class. After having this class, you can view the method that finally returns SortedDocValue. as follows:

public SortedDocValues getSorted(FieldInfo field) throws IOException {
	final int valueCount = (int) binaries.get(field.number).count;
	final BinaryDocValues ​​binary = getBinary(field);//Read the first part of the data, which is the part that stores the docValue, first look at this method, below, and then come back here.
	NumericEntry entry = ords.get(field.number);
	final LongValues ​​ordinals = getNumeric(entry);//Read the second part of the data, which is the part that stores the order of each doc. This is the same as the previous read, so it won't be repeated here.
	return new SortedDocValues() {
		@Override
		public int getOrd(int docID) {//Get the order of a doc, that is, sorting, the previous doc storage is based on the order of docid storage, so this can be easily read
			return (int) ordinals.get(docID);
		}
		@Override
		public BytesRef lookupOrd(int ord) {//Find the value according to the order.
			return binary.get(ord);
		}
		@Override
		public int getValueCount() {//The number of all byte[]
			return valueCount;
		}
		@Override
		public int lookupTerm(BytesRef key) {//Check if a byte[] exists.
			if (binary instanceof CompressedBinaryDocValues) {//This is used.
				return (int) ((CompressedBinaryDocValues) binary).lookupTerm(key);//Look up term, if found, return its order, otherwise return a negative number. It is also searched according to the TermEnum mentioned above.
			} else {
				return super.lookupTerm(key);
			}
		}
		@Override
		public TermsEnum termsEnum() {//Get all byte[],
			if (binary instanceof CompressedBinaryDocValues) {
				return ((CompressedBinaryDocValues) binary).getTermsEnum();
			} else {
				return super.termsEnum();
			}
		}
	};
	
}

In this way, after reading the sorted doValue, the most important thing is the storage of his sorting, which is the same as lucene's dictionary table when stored, and is also encapsulated by lucene's TermEnum when reading. With SortedDocValue, you can easily get the ranking of a doc. It is estimated that it will be used in the future.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326117765&siteId=291194637