Lucene index numeric type and scope of analysis

Query q = NumericRangeQuery.newLongRange("idField", 1L, 10L, true, true);

Of numeric type indexing time, the value will be converted into a plurality of  lexicographic sortable string, then indexes into  trie  trie structure.

For example: Suppose num1 disassembled into a, ab, abc; num2 disassembled into a, ab, abd.

【figure 1】:

Num1 can be prefixed with ab, num2 have to find out by searching ab. When looking at the range of values ​​within the same prefix lookup to find a range of object can be achieved doc plurality of return, thereby reducing lookups.

Explain the following: Works numeric types of indexes and scope of the query.

1: the binary representation of the value

In long Case: +63 bit integer bit sign bit, the sign bit 0 for positive and 1 for negative.

The larger the positive number is low for larger this number 63, it is also negative for the low 63 increases.

If the sign bit inverted. The long.min - long.max can be expressed as: 0x0000,0000,0000,0000 - 0xFFFF, FFFF, FFFF, FFFF

After such a conversion is not a character from the level is already sorted from small to large?

2: How to split a prefix

To 0x0000,0000,0000, F234, for example, each time the right four.

 1: 0x0000,0000,0000, F23 and 0x0000,0000,0000, F230 --0x0000,0000,0000, prefix for all values ​​within the range of one consistent F23F

 2: 0x0000,0000,0000, F2 and 0x0000,0000,0000, F200 --0x0000,0000,0000, consistent prefix for all values ​​within the range F2FF

 3: 0x0000,0000,0000, F 0x0000,0000,0000, F000 --0x0000,0000,0000, consistent with the prefix for all values ​​within the range of FFFF

 ....

 0x0

 If the values ​​do several right shift key, it may represent a respective range. prefix will be understood as key values

3: folded into a small region of a wide range of

Lucene query time in the practice of law is folded into a wide range of small-scale and small-scale separately for each look with a prefix, thus reducing the number of lookups.

4: the index value to achieve the type

A first set PrecisionStep (default 4), each right type of the value (n-1) * PrecisionStep bits.

After each shift, starting from the left into a 7 each byte, consisting of a byte [],

Inserting a special byte and bit 0 in the array, the offset identified.

Each byte [] can be converted into a lexicographic sortable string.

lexicographic sortable string of characters according to the lexicographic order, and the offset value of the order is the same. - This is NumericRangeQuery  key range to find!

A total of 64 long type, if precisionStep = 4, then there will be 16 lexicographic sortable string.

16 corresponds to a long value corresponding to the prefix, and then the inverted index lucene eventually index similar to that of FIG. 1 is the index structure.

Split key code:

longToPrefixCodedBytes org.apache.lucene.util.NumericUtils class () method

public static void longToPrefixCodedBytes(final long val, final int shift, final BytesRefBuilder bytes) {
    if ((shift & ~0x3f) != 0)  // ensure shift is 0..63
      throw new IllegalArgumentException("Illegal shift value, must be 0..63");
       //计算byte[]的大小,每位七位存入一个byte
    int nChars = (((63-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
       //最后还有第0位存偏移量,所以+1
    bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
    bytes.grow(BUF_SIZE_LONG);
       //标识偏移量,shift
    bytes.setByteAt(0, (byte)(SHIFT_START_LONG + shift));
       //把符号位取反
    long sortableBits = val ^ 0x8000000000000000L;
       //右移shift位,第一次shifi传0,之后按precisionStep递增
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
         //每7位存入一上byte ,前面第一位为0——在utf8中表示ascii码.并加到数组中。
      bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
      sortableBits >>>= 7;
    }
  }

5: range queries

 It is generally thought Start Split from both ends of the range. First split into a lower range value, and then moves to the next PrecisionStep another and to a high range.

Finally, each value in the inter-cell, according to the number of movements, and in the same manner as indexing turn into lexicographic sortable string. To find.

Code:

 splitRange org.apache.lucene.util.NumericUtils class () method

private static void splitRange(
    final Object builder, final int valSize,
    final int precisionStep, long minBound, long maxBound
  ) {
    if (precisionStep < 1)
      throw new IllegalArgumentException("precisionStep must be >=1");
    if (minBound > maxBound) return;
    for (int shift=0; ; shift += precisionStep) {
      // calculate new bounds for inner precision
      final long diff = 1L << (shift+precisionStep),
        mask = ((1L<<precisionStep) - 1L) << shift;
      final boolean
        hasLower = (minBound & mask) != 0L,
        hasUpper = (maxBound & mask) != mask;
      final long
        nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask,
        nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
      final boolean
        lowerWrapped = nextMinBound < minBound,
        upperWrapped = nextMaxBound > maxBound;
      
      if (shift+precisionStep>=valSize || nextMinBound>nextMaxBound || lowerWrapped || upperWrapped) {
        // We are in the lowest precision or the next precision is not available.
        addRange(builder, valSize, minBound, maxBound, shift);
        // exit the split recursion loop
        break;
      }
      
      if (hasLower)
        addRange(builder, valSize, minBound, minBound | mask, shift);
      if (hasUpper)
        addRange(builder, valSize, maxBound & ~mask, maxBound, shift);
      
      // recurse to next precision
      minBound = nextMinBound;
      maxBound = nextMaxBound;
    }
  }

For example: fractional resolved into 1001,0001-1111,0010

1: 1001,0001-1001,1111 (0x91-0x9F term post 15 has zeroth offset)

   And 1111,0000 -1111,0010 (0xF0-0F2 after the term has three zeroth offset)

2: 1002,0000 - 1110, 1111 right after the first (0x11- 0x15 there are five term) 

Find 23 lexicographic sortable string. You can cover the entire range.

Guess you like

Origin blog.csdn.net/asdfsadfasdfsa/article/details/90644476