In a previous blog, I wrote about the writing of sortedDocValue. This time, let's see how to read it when using sortedDocValue again. The reading method is still in Lucene410DocValuesProducer. The method is public SortedDocValues getSorted(FieldInfo field). Let's take a look at the SortedDocValue object returned by this method. In the source code of this class, we can find that this class inherits from BinaryDocValues, so the above The SortedDocValue mentioned in a blog is SortedBinaryDocValue, which is correct. Lucene deliberately omitted Binary. In addition to BinaryDocValue's ByteRef get(int docid), his method adds a lot, as follows:
public abstract class SortedDocValues extends BinaryDocValues { /** * Returns the ordinal for the specified docID. Returns the order of byte[] of the specified doc. If you read the previous blog, it is easy to know that the Numeric part of the storage structure is used. * @param docID document ID to lookup * @return ordinal for the document: this is dense, starts at 0, then increments by 1 for the next value in sorted order. Note that missing values are indicated by -1. Returns -1 if a doc has no value */ public abstract int getOrd(int docID); /** * Find the corresponding byte[] according to the sorting. According to the previous blog, you can know that it is the binary part of the read, determine which small block to read according to ord, and then find the corresponding start according to the second small part of the first part (recording the index of each small block) position, and then read the specified byte[]. */ public abstract BytesRef lookupOrd (int ord); /** * Returns the number of all byte[]s. */ public abstract int getValueCount(); private final BytesRef empty = new BytesRef (); @Override public BytesRef get(int docID) {//This is the get method that rewrites BinaryDocValue, that is to get the corresponding byte[] according to docid int ord = getOrd(docID);//Get the doc order, this is the second part of reading, the Numeric part if (ord == -1) { return empty; } else { return lookupOrd(ord);//Search again according to the order } } /** * Get the order of this key, if it does not exist, return a negative number */ public int lookupTerm(BytesRef key) { int low = 0; int high = getValueCount()-1; //Use binary search, but you need to query many times, while (low <= high) { int mid = (low + high) >>> 1; final BytesRef term = lookupOrd(mid); int cmp = term.compareTo(key); if (cmp < 0) {//This key is larger than the term, so look up to the right low = mid + 1; } else if (cmp > 0) {////This key is smaller than term, so look left high = mid - 1; } else { return mid; // key found } } return -(low + 1); // key not found. } /** * This returns an iterator that iterates over all written byte[]. */ public TermsEnum termsEnum() { return new SortedDocValuesTermsEnum(this); } }
After reading the SortedDocValue method, we can almost guess its reading process. Let's look at the source code. Like the previous reading process, all meta (that is, docValue) will be read in the Lucene410DocValuesProducer's construction method. index) file, corresponding to the SortedDocValue as follows:
private void readSortedField(int fieldNumber, IndexInput meta, FieldInfos infos) throws IOException { // sorted = binary + numeric if (meta.readVInt() != fieldNumber) { throw new CorruptIndexException( "sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")"); } if (meta.readByte() != Lucene410DocValuesFormat.BINARY) { throw new CorruptIndexException( "sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")"); } BinaryEntry b = readBinaryEntry(meta);//Read binary, which is the first part, but this time the format is different from the previous one. This time the prefix used is compressed, so it is necessary for us to go back to the readBinaryEntry method again. binaries.put(fieldNumber, b);//缓存 if (meta.readVInt() != fieldNumber) { throw new CorruptIndexException( "sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")"); } if (meta.readByte() != Lucene410DocValuesFormat.NUMERIC) { throw new CorruptIndexException( "sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")"); } NumericEntry n = readNumericEntry(meta);//Read numeric, which is the second part, this part is the same as the previous NumericDocValue, so I won't read it here. ords.put(fieldNumber, n);//Cache the result of reading }
When I was in BinaryDocValue before, I saw readBinaryEntry, but I didn't see the number of prefix compressions, so look again, the code is as follows:
static BinaryEntry readBinaryEntry(IndexInput meta) throws IOException { BinaryEntry entry = new BinaryEntry(); entry.format = meta.readVInt();//Stored format entry.missingOffset = meta.readLong();//Record the fp of those docSets that contain values entry.minLength = meta.readVInt();//Minimum entry.maxLength = meta.readVInt();//Maximum value entry.count = meta.readVLong();//The number of all docs (in sortedBinary, this is the number of byte[] written, not the number of docs anymore) entry.offset = meta.readLong();//fp of the real docValue switch (entry.format) { case BINARY_FIXED_UNCOMPRESSED:// ignore, we look at the one that uses prefix compression break; case BINARY_PREFIX_COMPRESSED://This is used when the sorted docValue entry.addressesOffset = meta.readLong();//The starting position of each small block in the data, which is the starting position of the second small part of the first part mentioned in the previous blog entry.packedIntsVersion = meta.readVInt(); entry.blockSize = meta.readVInt(); entry.reverseIndexOffset = meta.readLong();//The start position of the third part of the first part break; case BINARY_VARIABLE_UNCOMPRESSED: entry.addressesOffset = meta.readLong();//Ignore, look at the one that uses prefix compression entry.packedIntsVersion = meta.readVInt(); entry.blockSize = meta.readVInt(); break; default: throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta); } return entry; }
I didn't find anything after reading it, but I just read three indexes, one is the starting position of docValue, and the second is the index that records the starting position of each small block, which is the second small part of the first part. , and the start position of the first part of the third part. But the position of the second part is not read because the second part is read in readBinary.
After reading the reading of the meta file above, let's take a look at the specific process of finding docValue. The first part and the second part of the data file will be read in it, so let's take a look at these two methods first. It is difficult to read the first part, that is, the docValue compressed with the prefix, and the second part is to read numeric, which is the same as the ordinary NumericDocValue, which has been written before (there are three formats), so skip it .
The way to read the prefix-compressed docValue is in getBinary, as follows:
public BinaryDocValues getBinary(FieldInfo field) throws IOException { BinaryEntry bytes = binaries.get(field.number); switch (bytes.format) { case BINARY_FIXED_UNCOMPRESSED: return getFixedBinary(field, bytes);//All byte[] have the same length case BINARY_VARIABLE_UNCOMPRESSED: return getVariableBinary(field, bytes);//different case BINARY_PREFIX_COMPRESSED://This is used in sorted docValue, which uses prefix compression, based on block storage return getCompressedBinary(field, bytes); default: throw new AssertionError(); } }
Focus on the third one:
private BinaryDocValues getCompressedBinary(FieldInfo field, final BinaryEntry bytes) throws IOException { final MonotonicBlockPackedReader addresses = getIntervalInstance(field, bytes);//Record the address of each block in the index, that is, the reading of the second part of the first part final ReverseTermsIndex index = getReverseIndexInstance(field, bytes);//The last part assert addresses.size() > 0; // we don't have to handle empty case IndexInput slice = data.slice("terms", bytes.offset, bytes.addressesOffset - bytes.offset);//All the parts of the small block, from the beginning to the part of the position index return new CompressedBinaryDocValues(bytes, addresses, index, slice);//Return all results in an object. }
Among them, due to space reasons, we ignore the getIntervalInstance method and the getReverseIndexInstance method, but we need to take a look at the generated CompressedBinaryDocValues object, which is also a BinaryDocValue, so there are corresponding methods, take a look at:
static final class CompressedBinaryDocValues extends LongBinaryDocValues { public CompressedBinaryDocValues(BinaryEntry bytes, MonotonicBlockPackedReader addresses,ReverseTermsIndex index, IndexInput data) throws IOException { this.maxTermLength = bytes.maxLength;//The maximum length in all byte[] this.numValues = bytes.count;//Number of bytes[] this.addresses = addresses; //The storage location of each small block this.numIndexValues = addresses.size();//The number of all positions, that is, the number of small blocks this.data = data; this.reverseTerms = index.terms;//The third part of the first part of the storage, that is, where a byte[] is stored every 1024 this.reverseAddresses = index.termAddresses;//I haven't figured out what these two are for a while, but it doesn't affect understanding. this.numReverseIndexValues = reverseAddresses.size();//同上 this.termsEnum = getTermsEnum(data);// Get the term enumerator, which is the object used to find all btye[]. It can be found that many of the following methods are called methods of this object. } @Override public BytesRef get(long id) { try { termsEnum.seekExact(id);//Search for BytesRef of the specified order. The method is called termEnum return termsEnum.term(); } catch (IOException e) { throw new RuntimeException(e); } } long lookupTerm(BytesRef key) {//Query according to the string, if it exists, return its order, otherwise return a negative number try { switch (termsEnum.seekCeil(key)) {// case FOUND: return termsEnum.ord(); case NOT_FOUND: return -termsEnum.ord() - 1; default: return -numValues - 1; } } catch (IOException bogus) { throw new RuntimeException(bogus); } } TermsEnum getTermsEnum() { try { return getTermsEnum(data.clone()); } catch (IOException e) { throw new RuntimeException(e); } }
Through the above, you can find that the final thing is to get a TermEnum, and the return is a CompressedBinaryTermsEnum, so look at the code of this class, especially the method of TermsEnum
class CompressedBinaryTermsEnum extends TermsEnum { /** The order of the currently read term */ private long currentOrd = -1; // offset to the start of the current block private long currentBlockStart; /** incoming data file */ private final IndexInput input; // delta from currentBlockStart to start of each term /**This is to record the byte[] that is not the first term in each small block at the beginning of the second part of each small block*/ private final int offsets[] = new int[INTERVAL_COUNT]; //Start address of the own part of each small block except the first byte[] except the own length prefix of those byte[] private final byte buffer[] = new byte[2 * INTERVAL_COUNT - 1]; /**Currently read term*/ private final BytesRef term = new BytesRef(maxTermLength); /** the first term of each small block */ private final BytesRef firstTerm = new BytesRef(maxTermLength); private final BytesRef scratch = new BytesRef (); /** The incoming data file */ CompressedBinaryTermsEnum(IndexInput input) throws IOException { this.input = input; input.seek(0);//Point to the beginning of this file, because when storing, the first location is the stored binaryDocValue. } // read the first part of a small block private void readHeader() throws IOException { firstTerm.length = input.readVInt(); input.readBytes(firstTerm.bytes, 0, firstTerm.length);//Read the first term of each small block input.readBytes(buffer, 0, INTERVAL_COUNT - 1);//When reading the remaining 15 lengths, the length mentioned here refers to the length of each byte[] except for the shared prefix. read into buffer if (buffer[0] == -1) {//Indicates that there are more than 254, then read short readShortAddresses(); } else { readByteAddresses(); } currentBlockStart = input.getFilePointer(); } //This method and the following readShortAddress can look at one private void readByteAddresses() throws IOException { int addr = 0; for (int i = 1; i < offsets.length; i++) {//Start from 1, because only 15 are recorded addr += 2 + (buffer[i - 1] & 0xFF);//The currently processed byte[] records the address of each byte[]'s own byte[] part in the current small block. But how did the 2 here come from, I didn't understand, I think it should be 1, because the length of the shared prefix is recorded in the second small part, which is the case of 1 here. I'm guessing it's a bug! The following is 2 offsets[i] = addr; } } private void readShortAddresses() throws IOException { input.readBytes(buffer, INTERVAL_COUNT - 1, INTERVAL_COUNT); int addr = 0; for (int i = 1; i < offsets.length; i++) { int x = i << 1; addr += 2 + ((buffer[x - 1] << 8) | (buffer[x] & 0xFF)); offsets[i] = addr; } } // This is very simple, because when reading the header, the first term has been read, so here is just setting it to the term private void readFirstTerm() throws IOException { term.length = firstTerm.length; System.arraycopy(firstTerm.bytes, firstTerm.offset, term.bytes, 0, term.length); } // read the next term, read from the second part of the small block private void readTerm(int offset) throws IOException { int start = input.readByte() & 0xFF;//Length of shared prefix System.arraycopy(firstTerm.bytes, firstTerm.offset, term.bytes, 0, start);//Copy the value of the first term of this small block, that is, the shared prefix, to the current term int suffix = offsets[offset] - offsets[offset - 1] - 1;//Find your own position, that is, in the second half of each small block, your own suffix. input.readBytes(term.bytes, start, suffix);//Read your own suffix, term.length = start + suffix;//Move your own pointer. This forms a term. } //Read the next byte[], which is called term here, because this TermEnum class was originally used to read the dictionary table public BytesRef next() throws IOException { currentOrd++; if (currentOrd >= numValues) { return null; } else { int offset = (int) (currentOrd & INTERVAL_MASK);//Find small blocks, because they are stored in small blocks. if (offset == 0) {//Just the beginning of a small block // switch to next block readHeader();//Read the header file, including a new byte[], and the length of multiple remaining byte[]s except the shared prefix readFirstTerm();//The first term of each small block has been read, here is just copied to the term } else { readTerm(offset);//If the current pointer is in a small block, read the next term } return term; } } //This is the one that was said when writing docValue. In the last part of the first part, the index of byte[] is written to narrow the scope of the query. But his query range is very large, because it is added every 1024 terms when writing, so he is very inaccurate. long binarySearchIndex(BytesRef text) throws IOException { long low = 0; long high = numReverseIndexValues - 1; while (low <= high) { long mid = (low + high) >>> 1; reverseTerms.fill(scratch, reverseAddresses.get(mid)); int cmp = scratch.compareTo(text); if (cmp < 0) { low = mid + 1; } else if (cmp > 0) { high = mid - 1; } else { return mid; } } return high; } // binary search against first term in block range to find term's block long binarySearchBlock(BytesRef text, long low, long high) throws IOException { while (low <= high) { long mid = (low + high) >>> 1; input.seek(addresses.get(mid)); term.length = input.readVInt(); input.readBytes(term.bytes, 0, term.length); int cmp = term.compareTo(text); if (cmp < 0) { low = mid + 1; } else if (cmp > 0) { high = mid - 1; } else { return mid; } } return high; } //Find the specified byte[], first use a large-range search, and then use a small-range block search. @Override public SeekStatus seekCeil(BytesRef text) throws IOException { // locate block: narrow to block range with index, then search blocks final long block; long indexPos = binarySearchIndex(text);//What he means is to use the index with a larger range to search first and narrow the search range if (indexPos < 0) { block = 0; } else { long low = indexPos << BLOCK_INTERVAL_SHIFT; long high = Math.min(numIndexValues - 1, low + BLOCK_INTERVAL_MASK); block = Math.max(low, binarySearchBlock(text, low, high)); } // position before block, then scan to term. input.seek(addresses.get(block));//Then use small blocks to search accurately. currentOrd = (block << INTERVAL_SHIFT) - 1; while (next() != null) { int cmp = term.compareTo(text); if (cmp == 0) { return SeekStatus.FOUND; } else if (cmp > 0) { return SeekStatus.NOT_FOUND; } } return SeekStatus.END; } @Override public void seekExact(long ord) throws IOException { long block = ord >>> INTERVAL_SHIFT;//Find the first block. if (block != currentOrd >>> INTERVAL_SHIFT) {//If it is not a block with the current one // switch to different block input.seek(addresses.get(block));//Switch to the specified block readHeader(); } currentOrd = word; int offset = (int) (ord & INTERVAL_MASK); if (offset == 0) {//If it is the first, readFirstTerm(); } else {//not input.seek(currentBlockStart + offsets[offset - 1]); readTerm(offset); } } @Override public BytesRef term() throws IOException { return term; } @Override public long ord() throws IOException { return currentOrd; } @Override public int docFreq() throws IOException { throw new UnsupportedOperationException(); } @Override public long totalTermFreq() throws IOException { return -1; } @Override public DocsEnum docs(Bits liveDocs, DocsEnum reuse, int flags) throws IOException { throw new UnsupportedOperationException(); } @Override public DocsAndPositionsEnum docsAndPositions(Bits liveDocs, DocsAndPositionsEnum reuse, int flags) throws IOException { throw new UnsupportedOperationException(); } @Override public Comparator <BytesRef> getComparator () { return BytesRef.getUTF8SortedAsUnicodeComparator (); } }
After reading the final generated TermEnum, almost all methods require this class. After having this class, you can view the method that finally returns SortedDocValue. as follows:
public SortedDocValues getSorted(FieldInfo field) throws IOException { final int valueCount = (int) binaries.get(field.number).count; final BinaryDocValues binary = getBinary(field);//Read the first part of the data, which is the part that stores the docValue, first look at this method, below, and then come back here. NumericEntry entry = ords.get(field.number); final LongValues ordinals = getNumeric(entry);//Read the second part of the data, which is the part that stores the order of each doc. This is the same as the previous read, so it won't be repeated here. return new SortedDocValues() { @Override public int getOrd(int docID) {//Get the order of a doc, that is, sorting, the previous doc storage is based on the order of docid storage, so this can be easily read return (int) ordinals.get(docID); } @Override public BytesRef lookupOrd(int ord) {//Find the value according to the order. return binary.get(ord); } @Override public int getValueCount() {//The number of all byte[] return valueCount; } @Override public int lookupTerm(BytesRef key) {//Check if a byte[] exists. if (binary instanceof CompressedBinaryDocValues) {//This is used. return (int) ((CompressedBinaryDocValues) binary).lookupTerm(key);//Look up term, if found, return its order, otherwise return a negative number. It is also searched according to the TermEnum mentioned above. } else { return super.lookupTerm(key); } } @Override public TermsEnum termsEnum() {//Get all byte[], if (binary instanceof CompressedBinaryDocValues) { return ((CompressedBinaryDocValues) binary).getTermsEnum(); } else { return super.termsEnum(); } } }; }
In this way, after reading the sorted doValue, the most important thing is the storage of his sorting, which is the same as lucene's dictionary table when stored, and is also encapsulated by lucene's TermEnum when reading. With SortedDocValue, you can easily get the ranking of a doc. It is estimated that it will be used in the future.