Lucene's index sorting uses the reverse sorting principle

Lucene's index sorting uses the reverse sorting principle. The structure and the corresponding generation algorithm are as follows: There are two articles 1 and 2. The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too. The content of article 2 is: He once lived in Shanghai. 1. Because lucene It is based on keyword indexing and query. First, we need to obtain the keywords of these two articles. Usually we need the following processing measures: a. What we have now is the content of the article, that is, a string. First, we need to find all the word, participle. English words are easier to handle because they are separated by spaces. Chinese words are linked together and require special word segmentation. b. Words such as "in", "once" and "too" in the article have no practical meaning, and words such as "de" and "is" in Chinese usually have no specific meaning. These words that do not represent concepts can be filtered out. That is, StopTokens mentioned in "Lucene Detailed Analysis" c. Users usually want to find out the articles containing "he" and "HE" when looking up "He", so all words need to be capitalized. d. Users usually want to find articles containing "lives" and "lived" when they search "live", so they need to restore "lives" and "lived" to "live" e. Punctuation marks in articles are usually different. Indicates a certain concept and can also be filtered out. In lucene, the above measures are completed by the Analyzer class. After the above processing: All keywords in article 1 are: [tom] [live] [guangzhou]   [live] [guangzhou] All the keywords of article 2 are: [he] [live] [shanghai] 2. After we have the keywords, we can build an inverted index. The above correspondence is: "article number" pairs "All keywords in the article". The inverted index reverses this relationship and becomes: "keyword" versus "all article numbers with that keyword". Articles 1 and 2 become <!--[if !supportLineBreakNewLine]--> after inversion. Keyword article number guangzhou 1 he 2 i 1 live 1,2 shanghai 2 tom 1 Usually only know which articles the keyword appears in It is not enough, we also need to know the number of occurrences and the position of the keyword in the article. There are usually two positions: a) Character position, that is, record the word is the first character in the article (the advantage is that the keyword is positioned when the keyword is highlighted. fast); b) keyword position, that is, record the word as the first keyword in the article (the advantage is that the index space is saved, and the phrase (phase) query is fast), which is recorded in lucene. After adding the "occurrence frequency" and "occurrence location" information, our index structure becomes: keyword article number [occurrence frequency] occurrence location guangzhou 1[2] 3, 6 he 2[1] 1 i 1[1] 4 live 1[2], 2[1] 2, 5, 2 shanghai 2[1] 3 tom 1[1] 1 Taking live as an example, we will explain the structure: live appears twice in article 1, and the article It appears once in 2, and its position is "2,5,2" What does it mean? We need to analyze the article number and frequency of occurrence. Article 1 appears twice, then "2, . Second, the compression of the number is used a lot, and the number only stores the difference from the previous value (this can reduce the length of the number, thereby reducing the number of bytes required to store the number). For example, the current article number is 16389 (uncompressed and stored in 3 bytes), the previous article number is 16382, and 7 is saved after compression (only one byte is used). Below we can explain why we need to build an index by querying the index. Suppose you want to query the word "live", lucene first searches the dictionary binary, finds the word, reads all the article numbers through the pointer to the frequency file, and then returns the result. Dictionaries are usually very small, so the entire process takes milliseconds. With the ordinary sequential matching algorithm, instead of indexing, string matching is performed on the content of all articles. This process will be quite slow. When the number of articles is large, the time is often unbearable.3.2. Lucene's correlation score formula score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d Notes: score_d : the score of the document d sum_t : the sum of the scores of all items tf_q : in the query string q , the square root of the number of times an item appears tf_d : the square root of the number of times an item appears in document d numDocs : the total number of documents with a score greater than 0 found in this index docFreq_t : the total number of documents that contain item t idf_t : log (numDocs/docFreq+1)+1.0 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : in document d, in the same domain as item t, the square root of the sum of all items boost_t : of item t boost factor, typically 1.0 coord_q_d : in document d, the number of hit items divided by the total number of items in query q minMergeDocs 10 The number of documents cached in memory, it will be written to disk when it exceeds maxFieldLength 1000 The maximum number of terms in a Field, if the excess is ignored, it will not be indexed into the field, so naturally the details of these parameters cannot be searched The description is more complicated: mergeFactor has a dual function. Set each mergeFactor document to write a segment, for example, write a segment every 10 documents. Set each mergeFacotr to merge small segments into a large segment. For example, when 10 documents are merged into 1 segment, later After 10 small segments are merged into a large segment, and 10 large segments are merged later, the actual number of documents will be the index of the mergeFactor. Simply put, the larger the mergeFactor, the system will use more memory and less disk processing. If you want to make indexes in batches, it is right to set the mergeFactor to a larger size. After the mergeFactor is smaller, the number of indexes will increase, and the efficiency of searching will decrease. However, if the mergeFactor increases a little bit, the memory consumption will increase a lot (the index relationship), so be careful not to "out of memory" and set maxMergeDocs small, you can force a certain number of documents to be written as a segment, which can offset the effect of some mergeFactor. minMergeDocs is equivalent to setting a small cache, the first one This number of documents will stay in memory and not be written to disk. These parameters also have no optimal values ​​and must be adjusted a little bit according to the actual situation. maxFieldLength can be set at any time. After setting, the field of the next index will be intercepted according to the new length, and the part that has been indexed before will not change. It can be set to Integer.MAX_ 3.3.6. The conversion of RAMDirectory and FSDirectory to RAMDirectory(RAMD) is much more efficient than FSDirectyr(FSD), so we can manually use RAMD as the buffer of FSD, In this way, there is no need to tune so many parameters of FSD very hard, you can run the index with RAM first, and write it back and forth in FSD periodically (or some other algorithm). RAMD can be used as a buffer for FSD. 3.3.7. Optimizing the index for query (index) The Indexwriter.optimize() method can optimize the index (index) for the query. The parameter tuning mentioned above is optimized for the indexing process itself, and here is the query optimization. The optimization is mainly Reduce the number of index files, so that fewer files are opened when querying. During the optimization process, lucene will copy the old index and merge it. After the merge is completed, the old index will be deleted. Therefore, during this period, the disk usage will increase, and the IO match will also increase. At the moment when the optimization is completed, the disk usage will be twice that before the optimization, and the search can be performed at the same time during the optimization process. 3.3.8. Concurrent operation Lucene and locking mechanism v All read-only operations can be concurrent v During the index modification period, all read-only operations can be concurrent v No concurrent index modification operations, an index can only be occupied by one thread v index Optimization, merging, and adding are all modification operations. v Instances of IndexWriter and IndexReader can be shared by multiple threads. They are internally synchronized, so external use does not require synchronization. 3.3.9. Locing lucence uses files to lock internally, the default locking The file is placed in java.io.tmpdir. You can specify a new dir through -Dorg.apache.lucene.lockDir=xxx. There are two files, write.lock commit.lock. The lock file is used to prevent parallel operation of the index. If it is operated in parallel, Lucene will throw an exception. You can disable locking by setting -DdisableLuceneLocks=true. This is generally dangerous unless you have OS or physical-level read-only guarantees, such as burning the index file to CDROM. Optimizing the index for queries (index) The Indexwriter.optimize() method can optimize the index (index) for the query. The parameter tuning mentioned above is for the optimization of the indexing process itself, and here is for the optimization of the query. The optimization is mainly to reduce the number of index files. , so that less files are opened when querying. During the optimization process, lucene will copy the old index and merge it, and delete the old index after the merge is completed. Therefore, during this period, the disk usage increases, and the IO match will also increase. At the moment when the optimization is completed , the disk occupancy will be 2 times that before optimization, and search can be performed at the same time during the optimization process. 3.3.8. Concurrent operation Lucene and locking mechanism v All read-only operations can be concurrent v During the index modification period, all read-only operations can be concurrent v No concurrent index modification operations, an index can only be occupied by one thread v index Optimization, merging, and adding are all modification operations. v Instances of IndexWriter and IndexReader can be shared by multiple threads. They are internally synchronized, so external use does not require synchronization. 3.3.9. Locing lucence uses files to lock internally, the default locking The file is placed in java.io.tmpdir. You can specify a new dir through -Dorg.apache.lucene.lockDir=xxx. There are two files, write.lock commit.lock. The lock file is used to prevent parallel operation of the index. If it is operated in parallel, Lucene will throw an exception. You can disable locking by setting -DdisableLuceneLocks=true. This is generally dangerous unless you have OS or physical-level read-only guarantees, such as burning the index file to CDROM. Optimizing the index for queries (index) The Indexwriter.optimize() method can optimize the index (index) for the query. The parameter tuning mentioned above is for the optimization of the indexing process itself, and here is for the optimization of the query. The optimization is mainly to reduce the number of index files. , so that less files are opened when querying. During the optimization process, lucene will copy the old index and merge it, and delete the old index after the merge is completed. Therefore, during this period, the disk usage increases, and the IO match will also increase. At the moment when the optimization is completed , the disk occupancy will be 2 times that before optimization, and search can be performed at the same time during the optimization process. 3.3.8. Concurrent operation Lucene and locking mechanism v All read-only operations can be concurrent v During the index modification period, all read-only operations can be concurrent v No concurrent index modification operations, an index can only be occupied by one thread v index Optimization, merging, and adding are all modification operations. v Instances of IndexWriter and IndexReader can be shared by multiple threads. They are internally synchronized, so external use does not require synchronization. 3.3.9. Locing lucence uses files to lock internally, the default locking The file is placed in java.io.tmpdir. You can specify a new dir through -Dorg.apache.lucene.lockDir=xxx. There are two files, write.lock commit.lock. The lock file is used to prevent parallel operation of the index. If it is operated in parallel, Lucene will throw an exception. You can disable locking by setting -DdisableLuceneLocks=true. This is generally dangerous unless you have OS or physical-level read-only guarantees, such as burning the index file to CDROM. The optimize() method can optimize the index (index) for the query. The parameter tuning mentioned above is optimized for the indexing process itself, and here is the query optimization. The optimization is mainly to reduce the number of index files, so that fewer files are opened when querying , During the optimization process, lucene will copy the old index and merge it, and delete the old index after the merge is completed, so during this period, the disk usage increases, and the IO compliance also increases. At the moment when the optimization is completed, the disk usage will be 2 before optimization times, you can search at the same time during the optimization process. 3.3.8. Concurrent operation Lucene and locking mechanism v All read-only operations can be concurrent v During the index modification period, all read-only operations can be concurrent v No concurrent index modification operations, an index can only be occupied by one thread v index Optimization, merging, and adding are all modification operations. v Instances of IndexWriter and IndexReader can be shared by multiple threads. They are internally synchronized, so external use does not require synchronization. 3.3.9. Locing lucence uses files to lock internally, the default locking The file is placed in java.io.tmpdir. You can specify a new dir through -Dorg.apache.lucene.lockDir=xxx. There are two files, write.lock commit.lock. The lock file is used to prevent parallel operation of the index. If it is operated in parallel, Lucene will throw an exception. You can disable locking by setting -DdisableLuceneLocks=true. This is generally dangerous unless you have OS or physical-level read-only guarantees, such as burning the index file to CDROM. The optimize() method can optimize the index (index) for the query. The parameter tuning mentioned above is optimized for the indexing process itself, and here is the query optimization. The optimization is mainly to reduce the number of index files, so that fewer files are opened when querying , During the optimization process, lucene will copy the old index and merge it, and delete the old index after the merge is completed, so during this period, the disk usage increases, and the IO compliance also increases. At the moment when the optimization is completed, the disk usage will be 2 before optimization times, you can search at the same time during the optimization process. 3.3.8. Concurrent operation Lucene and locking mechanism v All read-only operations can be concurrent v During the index modification period, all read-only operations can be concurrent v No concurrent index modification operations, an index can only be occupied by one thread v index Optimization, merging, and adding are all modification operations. v Instances of IndexWriter and IndexReader can be shared by multiple threads. They are internally synchronized, so external use does not require synchronization. 3.3.9. Locing lucence uses files to lock internally, the default locking The file is placed in java.io.tmpdir. You can specify a new dir through -Dorg.apache.lucene.lockDir=xxx. There are two files, write.lock commit.lock. The lock file is used to prevent parallel operation of the index. If it is operated in parallel, Lucene will throw an exception. You can disable locking by setting -DdisableLuceneLocks=true. This is generally dangerous unless you have OS or physical-level read-only guarantees, such as burning the index file to CDROM.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326714956&siteId=291194637