FieldCache in lucene3.0.3

      FieldCache is a deeply hidden thing in lucene. It is not visible in the api. I only saw this class in the source code of sorting. Today I read a CustomScoreQuery, which also uses FieldCache, so I think this class is still more important, and it is necessary to record this class. (Addition: The version I'm looking at now is 3.0.3. In lucene4.x and later versions, docValue has been used for sort and facet, and FieldCache is no longer used. The reason why I recorded this blog is because I have obsessive-compulsive disorder cry)

      The purpose of FieldCache is to associate each term and doc in the dictionary table (tis) and load it into memory. For example, if we index the document of a person, the person has an attribute called age, which is age. We are searching now. Sort by age. FieldCache will take out all terms, assuming that there are only two ages of 20 and 30, and the 20-year-old has Zhang San (id is 0), Li Si (id is 2), and the 30-year-old has Wang Wu (id is 2) 1), then the final return is an array - [20,30,20]. (Students who have seen the source code of lucene may sigh here, because lucene does not simply store 20,30 when storing numbers, he will perform splitRange, and will store more numbers, such as 20, 22, 25, 28, 30 , if you have this question, it means that you have a good understanding of lucene, but when using FiledCache, such as sort, facet, CustomScoreQuery, word segmentation is not allowed. value, so the age here must not use NumericField, just use a simple Field)

      FieldCache uses a global value - an instance of the FieldCacheImpl class. In FieldCache, public static FieldCache DEFAULT = new FieldCacheImpl(); all caches are stored in this instance. There is a map attribute caches in this class, and its key is distinguished according to the final type to be obtained. In the init method, it can be found that the key is Byte.class Integer.class, and the corresponding value is ByteCache, IntCache, etc. In the following analysis, we take ByteCache as an example (in fact, they are all the same, because all strings are stored in the index, and byte, int, and long are just different converters used), and a map is also stored in ByteCache. The key is a certain file. In the get method of org.apache.lucene.search.FieldCacheImpl.Cache.get(IndexReader, Entry), we can find that his key is reader.getFieldCacheKey(); in segmentReader, it is the file freqStream , I understand that it doesn't matter what this getFieldCacheKey is, it can be anything, it is used for consistency distinction - it is used to indicate whether the current reader is cached, because the dictionary table is finally cached, that is, as long as it is cached, It must be his dictionary table that is cached . Its value is another map, the key of this map is Entry, it contains two attributes, one is the name of the domain, the other is parser, which means that all terms under a domain in the dictionary table are Take it out and parse it with the specified parser (specifically, ByteCache finally calls the Byte.parser method).

      Seeing that this is still not related to doc, it is time to code and speak with code. The code below is the code of org.apache.lucene.search.FieldCacheImpl.Cache.get(IndexReader, Entry)

public Object get(IndexReader reader, Entry key) throws IOException {
      Map<Entry,Object> innerCache;
      Object value;
      final Object readerKey = reader.getFieldCacheKey();//The object used for consistency distinction has no practical meaning
      synchronized (readerCache) {//Prevent multi-threaded concurrent settings, so lock
        innerCache = readerCache.get(readerKey);//readerKey is only used as a key, without any practical use, innerCache is also a map
        if (innerCache == null) {//If the current reader is not cached
          innerCache = new HashMap<Entry,Object>();
          readerCache.put(readerKey, innerCache);
          value = null;
        } else {
          value = innerCache.get(key);//If it has been cached, return it according to Entry, which specifies the domain and parser (converter) in the dictionary table
        }
        if (value == null) {//If not according to Entry (the possible reason is that the domain is different, or the Parser is different, or the current reader is not cached at all.
          value = new CreationPlaceholder();//Encapsulate the cache with CreationPlaceholder
          innerCache.put(key, value);
        }
      }
      if (value instanceof CreationPlaceholder) {
        synchronized (value) {
          CreationPlaceholder progress = (CreationPlaceholder) value;
          if (progress.value == null) {//If there is no value, there is no cache
            progress.value = createValue(reader, key);//Add cache, in the next code
            synchronized (readerCache) {
              innerCache.put(key, progress.value);
            }

            // Only check if key.custom (the parser) is
            // non-null; else, we check twice for a single
            // call to FieldCache.getXXX
            if (key.custom != null && wrapper != null) {
              final PrintStream infoStream = wrapper.getInfoStream();
              if (infoStream != null) {
                printNewInsanity(infoStream, progress.value);
              }
            }
          }
          return progress.value;
        }
      }
      return value;
    }

 

 

   The following is the code for adding cache of ByteCache:

@Override
    protected Object createValue(IndexReader reader, Entry entryKey)
        throws IOException {
      Entry entry = entryKey;
      String field = entry.field;//域
      ByteParser parser = (ByteParser) entry.custom;
      if (parser == null) {
        return wrapper.getBytes(reader, field, FieldCache.DEFAULT_BYTE_PARSER);
      }
      final byte[] retArray = new byte[reader.maxDoc()];
      TermDocs termDocs = reader.termDocs();//Get the inverted list of the current reader
      TermEnum termEnum = reader.terms (new Term (field));//The dictionary table of the current reader, located at the beginning of the specified field
      try {
        //loop through all terms in the current domain,
        do {
          Term term = termEnum.term();
          if (term==null || term.field() != field) break;
          byte termval = parser.parseByte(term.text());//Parse the text value of term into byte with parser, the default is to use Byte.parse method
          termDocs.seek (termEnum);//Get the inverted list of the current term,
          // Loop through all doc under the current term
          while (termDocs.next()) {//The value of the position where all the doc of the current term are located is termval, which is the parsed value.
            retArray[termDocs.doc()] = termval;
          }
        } while (termEnum.next());
      } catch (StopFillCacheException stop) {
      } finally {
        termDocs.close();
        termEnum.close();
      }
      return retArray;
    }

(There may be some doubts about the above code: what if a doc does not have any term in the current domain? The answer is very simple, the default is 0, because whether it is int, short, byte or float, the default is 0, so if there is no term , the default is 0)

 

In the above do while code, we can calculate its time complexity. Term and doc have a one-to-many relationship, that is, one term corresponds to multiple docs, but one doc can only correspond to one term (as mentioned above, No word segmentation), so his time complexity is the number of doc, that is, the complexity of O(n) just to make a cache, if you do facet, the complexity will be higher, so the appearance of docValue is very If necessary, he does not need to do fieldCache, and can get the corresponding facet or sort value directly according to docId.

      The code of FieldCache is so simple. This class is used in sort, facet, and CustomScoreQuery. Although more efficient DocValues ​​have been adopted after lucene4, it is still worth seeing, at least we know its ins and outs.

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326928560&siteId=291194637