Numeric index and numeric range query in lucene 3.0.3

      I watched it for 3 afternoons, plus one morning, and finally understood Lucene's index of numbers and retrieval of number ranges. The main time was spent on NumericRangeQuery. Although I failed again and again, I didn't plan to give up. , Research and exploration have always been one of my major interests. The joy at the end is better than all the previous pains! Thanks for the note, it is convenient for you who may be confused. Note: If you are not familiar with the index format of lucene, especially if you are new to lucene, please detour. This note is only suitable for programmers who have in-depth research on the source code.

      The indexing of numbers does not directly convert numbers into strings, because in this case, range search cannot be performed. For example, we index 1, 5, 20, 56, 89, 200, 201, 202, 203...299, 500, If sorted according to the characters of lucene, these terms are sorted in the dictionary table as 1, 20, 200, 201, 202, 203...299, 5, 56, 500, 89, obviously his order is not used for range query Yes, if we want to do a range search 1-300, we have to exhaust all terms under this domain, because there may be terms starting with 9, but it appears at the end of all terms, so it is obviously sorted by character order Not the best way to sort. Let's see how card luene is stored. For the field of numbers, we use lucene's NumeircField. One of the most important attributes in it is TokenStream - NumericTokenStream, which is used to segment the numbers in this field (that is, to form a Terry tree), look at his incrementToken method, which is used to segment numbers. Here we take long (that is, 64-bit as an example) type numbers as an example, and the method for word segmentation is to call the NumericUtils.longToPrefixCoded method :

/** Divide a numeric value into multiple shifts, processing the digits of precisionStep at a time*/
@Override
public boolean incrementToken() {
	
	if (valSize == 0)
		throw new IllegalStateException("call set???Value() before usage");
	if (shift >= valSize)
		return false;
	clearAttributes();
	final char[] buffer;
	switch (valSize) {
		case 64:
			buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_LONG);
			// Convert the number to character, and put the returned character content into the buffer 
                        // Each call will form a string and put it into the buffer, so the key is this method, which will use the currently processed digits number of digits of precision
			termAtt.setTermLength(NumericUtils.longToPrefixCoded(value, shift, buffer));
			break;
		case 32:
			buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_INT);
			termAtt.setTermLength(NumericUtils.intToPrefixCoded((int) value, shift, buffer));
			break;
		default:
			// should not happen
			throw new IllegalArgumentException("valSize must be 32 or 64");
	}
	typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC : TOKEN_TYPE_LOWER_PREC);
	posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0);//The first time it means a number, if it is not the first time it means 0 means it is the same number, so the position increment is 0
	shift += precisionStep;//Increase the offset
	return true;
	
}

 There are two important attributes in the above method, one is shift and the other is precisionStep. In the computer, numbers are represented in binary. When performing word segmentation, his idea is to shift the digits of precisionStep to the left each time. The shifted numbers are ignored, and the remaining numbers form a string, precisionStep is the number of binary digits that are shifted each time, and shift represents the total number of digits that have been shifted now, let's take a look The source code of NumericUtils.longToPrefixCoded:

public static int longToPrefixCoded(final long val, final int shift, final char[] buffer) {
		
	if (shift > 63 || shift < 0)
		throw new IllegalArgumentException("Illegal shift value, must be 0..63");
	
	// Calculate the number of chars to be generated for all the digits processed this time. The calculation rule is 64-the number of digits that have been processed (that is, shift) and then divided by 7, because every 7 bits are used to form a later. character, (63-shift)/7 + 1 is equivalent to this meaning.
	int nChars = (63 - shift) / 7 + 1;
	
	// This is the total number of chars generated by all the digits processed this time plus the characters that record the offset. The reason for one more than the above is that the first character is used to store the offset, the offset Represented by 32+shift to form a character.
	int len = nChars + 1;
	// Fill in the first character - the number of offset digits, which is 32+shift. There is also a reason for using the first digit to represent the offset, because the larger the offset, the higher the number of digits, so the representation The larger the number is (all numbers are stored with an offset, and the offset is 0 when it comes up), which corresponds to the character order of the string. If it is large, the string will be placed at the back (that is, bx must be after ax)
	buffer[0] = (char) (SHIFT_START_LONG + shift);
		
	//0x8000000000000000L This number is 1 in the 64th bit and 0 in other bits. Or his purpose is to invert the highest bit, that is, if the sign bit of a positive number becomes 1, the coincidence bit of a negative number becomes 0. After this operation, all negative numbers are arranged in front of positive numbers, and the relative positions of positive numbers remain unchanged, and the negative numbers do not change. (I think about this step, I will not explain it here)
	long sortableBits = val ^ 0x8000000000000000L;
	sortableBits >>>= shift; //Do not count the shifted bits, only count the remaining bits.
	while (nChars >= 1) { //The loop is processed every 7 bits. There is a reason why 7 bits are used to form a character, because the utf-8 format is finally used on the disk, and the maximum is 127 for 7 bits. , utf8 encoding uses one byte to represent one character when it is below 127, which is the most space-saving.
		// Store 7 bits per character for good efficiency when UTF-8 encoding.
		// The whole number is right-justified so that lucene can
		// prefix-encode the terms more efficiently.  。
		// Only take the last 7 bits to form a char. That is, every 7 bits form a char and put it into the buffer
		buffer[nChars--] = (char) (sortableBits & 0x7f);
		sortableBits >>>= 7;//Continue to process the next 7 bits
	}
	return len;

 

 

The above method is not very vivid in binary description. Let’s use decimal as an example. Suppose we want to index the number 8153. Our precisionStep is 1, that is, every time a decimal bit is shifted, the first shift is 0, so the entire 8152 will form a string, plus the number of digits used to store the offset (here we omit the number of digits used to represent the offset, and directly use the string form of the number as the final form string), so the first formed string is 8153, the second is 815 (but offset is 1), the third is 81 (offset is 2), and the fourth is 8 (offset is 2) The quantity is 3), so it is easy to find that he is a trie structure, and the meaning of the four terms formed by him can also be understood in this way, 8xxx or 81xx or 815x, and the most accurate 8153 can be Search to the current document. Now we make the precision a little bigger, and it is two decimal digits, which will form 8153 and 81. He indicates that the document can be found in both terms 8153 and 81xx. It can be found that when the precision is smaller, more terms will be generated, and the index will also be larger (if the precision is too large, the speed will be slowed down when performing a range search, which will be discussed later). When we index other numbers, we will continue to form different terms. All these terms will eventually form a trie tree, and the smaller the precisionStep, the more nodes in this tree, and the larger the final index volume will be. Big.

      

        Search: NumericRangeQuery   This class is used for range search using trie tree. The key is to segment the range to be searched, find the node on the trie tree (that is, the term claimed when the index was established) and then Rewrite all found terms according to the rewrite rules, such as claiming a booleanQuery, or generating a filter, it's that simple, but it took me a long time to understand. The most critical part of NumericRangeQuery is to find the previous term, by calling its getEnum method, which returns a NumericRangeTermEnum, let's take a look at its code:

// Divide the query interval into several small query intervals
NumericUtils.splitLongRange(new NumericUtils.LongRangeBuilder() {
	@Override
	public final void addRange(String minPrefixCoded, String maxPrefixCoded) {
		rangeBounds.add(minPrefixCoded);
		rangeBounds.add(maxPrefixCoded);
	}
}, precisionStep, minBound, maxBound);

 The above method is to find the nodes on the formed trie tree according to the range of the interval to be searched (that is, the maximum and minimum values) and the predictionStep, and put these nodes into a linked list. In the construction method of NumericRangeTermEnum , has converted all comparisons to >= or <=, let's take a look at the NumericUtils.splitLongRange method

/** Decompose the query interval into several small query intervals*/
private static void splitRange(final Object builder, final int valSize, final int precisionStep, long minBound,//下线
		long maxBound/*Online*/) {
	if (precisionStep < 1)
		throw new IllegalArgumentException("precisionStep must be >=1");
	if (minBound > maxBound)
		return;
	for (int shift = 0;; shift += precisionStep) {
		
		// calculate new bounds for inner precision
		final long diff = 1L << (shift + precisionStep);
		final long mask = ((1L << precisionStep) - 1L) << shift;//The maximum value of the current precision range
			
		// Does minBound have any part to be limited within the precision of this processing? If it is equal to 0, it means that there is no limit within this precision range, that is, any value within this precision range is acceptable. Continue to match the previous precision.
		// If it is true, it means that this precision range needs to be limited, so there is addRange at the back ( see explanation 1 below ).
		final boolean hasLower = (minBound & mask) != 0L;
		// Because the mask is the maximum value of the sub-precision range, if it is equal to the mask, it means that all the values ​​of this level meet the requirements. It does not mean that the description is limited, and some terms are not included.
		// If it is true, this precision needs to be limited, so there is addRange ( see Explanation 2 )
		final boolean hasUpper = (maxBound & mask) != mask;  
		
		
		// If there is a value in the current interval, diff will be added, that is, the next interval must be greater than the minimum value of the next precision (that is, adding the minimum value of the next precision), and then remove the current precision limit Later
		// For example, 632 in decimal, precisionStp is 1, after deleting 2 in 632, it can't be just 63x, because in this case, 631 630 will also match, so add a decimal 1, which means any 64x Values ​​are all ok, not 63x.
		final long nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask;//&~mask will shift to shift+precisionStep bit to 0.
		
		// The principle is the same as the above, such as the comparison in decimal, if the last digit is 9, any value of this precision will be satisfied, then you can directly compare the next precision, if maxBound is 765, you can't just delete it The final precision is 5, because pure 76x cannot be completely limited,
		// Because a value like 769 768 will also satisfy the condition at the second precision. So it is necessary to reduce the next precision by one bit, that is, it is possible to use 75x.
		final long nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
		
		// These two are extreme cases, see explanation 3 below
		final boolean lowerWrapped = nextMinBound < minBound;
		final boolean upperWrapped = nextMaxBound > maxBound;
		
		// 1. The first judgment is whether there is the next precision
		// 2. The second judgment is whether the next precision is crossed, that is, the minimum value is larger than the maximum value. If so, there is no more comparison, because the minimum value is still greater than the maximum value if it continues.
		// The principles of 3 and 4 are explained in 3 and 4
		if (shift + precisionStep >= valSize || nextMinBound > nextMaxBound || lowerWrapped || upperWrapped) {
			// We are in the lowest precision or the next precision is not available.
			addRange(builder, valSize, minBound, maxBound, shift);
			// exit the split recursion loop
			break;
		}
		
		if (hasLower)
                        //add minimum interval limit
			addRange(builder, valSize, minBound, minBound | mask, shift);//minBound is the minimum value of the current precision, minBound|mask represents the maximum value of the current precision,
		
		if (hasUpper)
                         //Add the maximum interval limit
			addRange(builder, valSize, maxBound & ~mask, maxBound, shift);

		minBound = nextMinBound;
		maxBound = nextMaxBound;
	}
}

       Explanation 1: final boolean hasLower = (minBound & mask) != 0L; If minBound and the maximum value of the current precision are compared to 0, it means that all bits of minBound in the current precision range are 0, then any value in the sub-precision range is It meets the requirement of >=0, so he does not need to add a node, and a restricted node will not be formed within this precision range, and the precision can continue to be reduced. On the contrary, if the current precision range is limited, that is, it is not the minimum value, then a limited node should be added to limit the search results, so when it is currently true, the addRange method will be added later. Look).

 

      Explanation 2 : final boolean hasUpper = (maxBound & mask) != mask, if maxBound is equal to the maximum value of the current precision, then within the current precision range, all values ​​will meet the requirements of <=maxBound, which means that there is no need to add the current The accuracy is limited, and you can continue to reduce the accuracy.

  Explanation 3 : nextMinBound < minBound; It seems that all nextMinBounds are greater than minBound, because nextMinBound is added with the minimum value of the next precision, how can it be smaller than minBound? Actually not, because he has a 64-bit or 32-bit limit, we still use decimal to describe, because there is a 32-64-bit limit in the computer, we use a 3-bit limit in decimal, so we When processing 997, after the last 7 has been processed, we have to add 10, and then delete the last digit, it will become 100x, due to the limit of 3 digits, it will become 00x, here Judgment is the case. When this happens, it means that minBound is already the maximum value of the next precision range, and now it is no longer possible to continue to the root of the trie tree, so it is necessary to exit the loop of finding more terms.

  Explanation 4 : nextMaxBound > maxBound; The reason is the same as the above 3, we still take the example of decimal: for example, we are dealing with 003, precisionStep is 1, the number of decimal digits is limited to 3, now we are dealing with 3, and then go to The next precisionStep, minus 10, becomes a negative number, so this situation occurs, which means that it is no longer possible to find deeper nodes in the trie tree.

 

 

addRange method: In the end, the longToPrefix method is called to convert the number into a string, and then put it into a linked list. For each transfer, two terms are added.

 

Let's take a look at how he finally uses the generated terms, in the org.apache.lucene.search.NumericRangeQuery.NumericRangeTermEnum.next() method

/** Increments the enumeration to the next element. True if one exists. */
@Override
public boolean next() throws IOException {
	
        //It seems that there is no need to close the current termEnum for the operation here, why not read it by calling a termEnum? Why turn it off and on again?
        //Because when collecting the terms in the dictionary table, they are not next to each other, and another reason is that the term restricted by the restrictive nodes in the trie cannot be read by reading from front to back. Re-read from the beginning, not at all like a normal term search,
        //And by using the index of the dictionary table (by using the tii file in 3.0), that is, the way to create termEnum in the while below, you can find the term you are looking for more quickly.
	// if a current term exists, the actual enum is initialized:
	// try change to next term, if no such term exists, fall-through
	if (currentTerm != null) {
		assert actualEnum != null;
		if (actualEnum.next()) {
			currentTerm = actualEnum.term();
			if (termCompare(currentTerm))
				return true;
		}
	}
	
	// if all above fails, we go forward to the next enum, if one is available
	currentTerm = null;
	while (rangeBounds.size() >= 2) {//As long as there are the last two left, the value is the addRange that exits the loop at the end
		
		assert rangeBounds.size() % 2 == 0;
		// close the current enum and read next bounds
		if (actualEnum != null) {
			actualEnum.close();
			actualEnum = null;
		}
		final String lowerBound = rangeBounds.removeFirst();//
		this.currentUpperBound = rangeBounds.removeFirst();
		// create a new enum
		actualEnum = reader.terms(termTemplate.createTerm(lowerBound));
		currentTerm = actualEnum.term();
		if (currentTerm != null && termCompare(currentTerm))
			return true;
		// clear the current term for next iteration
		currentTerm = null;
	}
			
	// no more sub-range enums available
	assert rangeBounds.size() == 0 && currentTerm == null;
	return false;
}

 His idea is to find all terms in their interval based on the terms generated as constraints, so the idea is clear here. To summarize: he generates multiple restricted terms based on maxBoung, minBound and precisionStep. , and then look up all the terms in the dictionary table according to these restrictive terms in the dictionary table. The advantage of using this method is that it can reduce the use of leaf nodes, because when generating all restrictive terms, Both are the parent node of the trie tree used, and the parent node of the parent node, until there is no parent node or maxBound and minBound cross each other, so that the dictionary table will bypass a large number of terms when looking up, and only look for in the Restricting terms between nodes greatly reduces the number of searched terms, thereby speeding up the search.

The effect of the size of the precisionStep used in the search: the larger the value, the shallower the depth of the generated trie, the larger the range of restrictive nodes, and the greater the number of terms found. The slower it is; the smaller the value, the greater the depth of the generated trie, the greater the number of restrictive nodes, the smaller the number of searched terms, and the greater the speed of merging.

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326699167&siteId=291194637