Performance tuning of solr grinding

  Author: Fighting nation is doing 

  Please indicate the address for reprint: http://www.cnblogs.com/prayers/p/8982141.html

 

  In this article, let's take a look at the performance tuning of Solr, which is divided into schema optimization, index update and commit tuning, index merge performance tuning, Solr cache, Solr query performance optimization

Schema optimization

  1. Compared with index=false, index=true takes up more memory during indexing, index merging and optimization takes longer, and the index volume also becomes larger. If you do not need to search for this domain, you can set index= false

  2. If you don't care about the impact of the number of times Term appears in the document on the final document, you can set omitNorms=true, that is, cancel the standardization and therefore the impact on the score. It reduces disk space usage and speeds up indexing

     3. If you do not need to highlight the field, you can also set omitPositions=true to further reduce the index size

  4. If you only need to use the inverted index structure to find the corresponding document according to the specified Term, and do not need to calculate the frequency of Term's appearance in the Document to consider the weight of each indexed document, you can also set omitTermFreqAndPositions=true to ignore the TF calculation and The position information of Term in TermVector, which can further reduce the index volume

  5. For the stored attribute, the execution cost of returning the stored=true field through the FL parameter in the response result set is very high, because the field value needs to be stored to the hard disk to write IO, and the disk read IO is required to extract the field value when querying. Storage can set stored=false to further optimize the volume of the index

  6. If the length of the domain value you want to store is not large, but in order to alleviate the disk IO caused by extracting the storage domain, you can set compressed=true to enable the data compression of the domain value. Turning on compressed reduces disk IO but increases CPU overhead

  7. If you don't always need to use storage domains, you can set domain lazy loading, especially when you enable domain value data compression. After setting lazy loading to enable lazy loading, the fields to be returned will be loaded immediately by SetNonLazyFieldSelector, and other fields are lazy loaded. To enable domain lazy loading, you need to configure the following in solrconfig.xml  

<enableLazyFieldLoading>false</enableLazyFieldLoading>

  8. If your field value is very large, you can use the ExternalFileField field (external file), which does not support solr query and can only be used for display and function calculation. You can also store the field value in an external system, such as redis, etc., when needed When the domain value is extracted from the cache according to the UniqueKey of solr

  9. For date-time type data in Java, it is recommended that you use the date field type in Solr. If you need to query the date and time range, it is recommended to use the date field type in Solr instead of the string field type.

  10. You can set docValue=true for the facet field and sorting field, it will generate an additional positive row table, which will improve the efficiency of faceting and sorting

  

Index update and commit tuning

  1. It is not recommended to use display hard submission. It is recommended to configure automatic soft/hard submission in solrConfig.

   2. When the client submits index documents, it is recommended to use batch soft submission to add index documents

   3. In stand-alone mode, it is recommended to use the ConcurrentUpdateSolrClient class when submitting the index. For the solrCloud mode, it is recommended to use the CloudSolrClient class to update or submit the index.

   4. By default, Solr will index each domain value of the document. When indexing some large documents, because Solr needs to cache the document in memory during the index creation process, if the domain value of the domain is very high. If it is large, the memory usage will be large, which may trigger more frequent GC. GC may cause the index creation process to be suspended. Configure the LimitTokenCountFilterFactory for the field type used by some large text fields to limit the actual text length of the index, thereby reducing the memory usage during the indexing process.

   5. When creating an index, when the text needs to be segmented, it is recommended to configure stop words to eliminate useless noise words, thereby reducing the index volume, and at the same time avoiding the noise word impression The final retrieval result

   6. Disable CompoundFile: Although starting compound files can reduce the number of segment files, it will increase your index creation time by 7% to 33%. The specific configuration is as follows  

<useCompoundFile>false</useCompoundFile> 
<mergePolicy class=”org.apache.lucene.index.TieredMergePolicy”>
<float name=”noCFSRatio”>0.0</float>
</mergePolicy>

   7. If the indexing speed is still relatively slow after a series of optimizations, it is recommended to use the MapReduce framework to create solr indexes in parallel using the resources of multiple machines, thereby speeding up the indexing speed

 

Index merge performance tuning  

   1. Reduce the frequency of index merging: After index merging, Solr query performance can be accelerated, but index merging is an expensive operation. Therefore, you need to reduce the frequency of index merging as much as possible while ensuring query performance.

  2. Increase the parameter values ​​of ramBufferSizeMB and maxBufferedDocs, and reduce the frequency of explicit submission as much as possible: In addition to the user's explicit commit operation, index submission will also be automatically triggered after the ramBufferSizeMB or maxBufferedDocs parameters reach the threshold. Therefore, in order to reduce the frequency of index merging, the parameter values ​​of ramBufferSizeMB and maxBufferedDocs should be increased, and the frequency of explicit submission should be reduced as much as possible. explicit commit

  3. Increase the value of the mergeFactor parameter : Increasing the value of the mergeFactor parameter can indeed speed up index creation and reduce the frequency of index merging, but at the same time it will also reduce the response speed of your Solr query

 

Solr cache

  Caches in Solr are managed by SolrIndexSearcher instances. A SolrIndexSearcher instance corresponds to a set of cache systems. If you create a new SolrIndexSearcher instance, all the previous SolrIndexSearchers will be invalid. When you have a large amount of data, the increment is very frequent. There is a lot of dependence on the cache. After this, you need to preload the cache in the new SolrIndexSearcher. The term is called preheating.

  Solr default 4 cache types

 

  1、filterCache

    It is used to cache the unordered ID of the Document extracted from the hard disk by the Filter Query. The next time the same FieldQuery is executed, the cache will be hit directly. Solr provides FilterCache by default for each FilterQuery.

  Application scenarios:

  1) Cache all result sets returned by FilterQuery, solr will intersect the result set of the main Q query and the unordered Document ID set collection cached by Filter

  2) When facet.method=enum, it will hit the Filter cache

  3) If <useFilterForSortedQuery/>true</useFilterForSortedQuery> is configured in solrconfig.xml, the Filter cache will also be used for Solr sorting operations.

  4) Filter cache is usually used for other Solr queries, such as facet.query, group.query

  Not applicable scenarios :

  Price range and time range query: There are too many price ranges for all categories, and the time is accurate to the second. If FilterCache is enabled for each price range FilterQuery requires a lot of memory support, and because the range is too complex, the cache hit rate will also be greatly reduced, so at this time we can disable Filter cache with a FilterQuery like this

  2、documentCache

  DocumentCache (ie document cache): used to store document objects in Lucence that have been extracted from disk. The maximum number of items saved in the document cache: should be greater than the maximum possible value in the returned result set * the maximum concurrent amount of the query. The purpose of this is to ensure that solr does not extract index documents from disk, but as the number of docs increases, the memory occupied by the documentCache will increase

  When you enable document caching and enable lazy loading, the object extracted by indexReader only contains the Field specified by the fl parameter, and other Fields will be loaded lazily, which can reduce the memory usage of the document cache. , is subsequently requested, then indexReader will temporarily load the domain from the hard disk

  It should also be noted that the document cache cannot perform cache preheating, which means that when a SolrIndexSearcher is opened this time, the cache will not be loaded in advance, because the document cache uses the document ID inside lucence, when the index data is After the change, the ID will also change

 

  3、queryResultCache

  QueryResult cache (query result set cache): The ordered Document IDs used to cache the TOP N result sets of the query, sorted according to the sorting field. The memory footprint of the query result set cache is significantly smaller than that of Filter, because only queries with consistent q, fq, and sort parameters at the same time will hit the cache.

 

  4、fieldValueCache

  fieldValueCache (ie field value cache) : Similar to fieldCache in lucence, but the difference is that FieldValueCache supports multiple values ​​for each document (multiple value fields for multi-value fields, or multiple Term for single-value fields due to word segmentation). This cache is mostly used for facet queries. The key of the cache is the name of the domain, and the value is the data structure of the mapping from docid to multiple values. If there is no <fieldValueCache> defined in solrconfig.xml, then Solr will automatically generate a <fieldValueCache> for you with size=10, max Size= 10 000, no autowarm

  

  HTTP cache : In addition to enabling Solr cache at the backend service layer, you can also enable HTTP cache at the front-end HTTP protocol layer. For resources that are not updated, you can directly return them from the HTTP cache, avoiding frequent requests for the same query request. Server, which can reduce the load pressure of Solr Server to a certain extent. If you want to enable HTTP caching, configure as follows:  

<httpCaching never304=”false”>
<cacheControl>max-age=30,  public</cacheControl> 
</httpCaching>

or

<httpCaching lastModifiedFrom=”openTime” etagSeed=”Solr”>
<cacheControl>max-age=30,  public</cacheControl> 
</httpCaching> 

  The never304 parameter is set to false, which means that the HTTP cache in Solr is enabled. The default never304=true means that the HTTP cache is disabled. The HTTP cache in Solr only supports GET and HEAD requests, not POST requests. SolrHTTP cache is compatible with HTTPI.O and HTTPl.l protocol headers.

  You can also configure firstSearcher and newSearcher event listeners in solrconfig.xml to automatically trigger cache auto-warming.

  newSearcher is used when a new IndexSearcher instance is created, in addition to automatically preheating part of the cache from the old IndexSearcher instance, you can also explicitly specify a query to preheat the cache. When a query takes a long time, you can use the newSearcher monitor to warm up in advance, so that you will directly hit the cache when you execute the slow query later.

  firstSearcher means that when a new IndexSearcher instance is being initialized and there is no old Index Searcher instance for the new IndexSearcher instance to automatically warm up the cache, you need to explicitly specify a query to automatically warm up the cache. This firstSearcher is mainly used to configure what query Solr will execute when it first starts and put it into the cache. Because the cache is definitely empty when Solr is just started, in order to ensure efficient query performance within a period of time when Solr is just started, you need to configure firstSearcher to extract warm-up.

  When using the que-result cache, you can also add the <queryResultWindowSize> configuration to optimize it. When a query is executed, the returned DocumentID will be collected. For example, the documentID matched by the query is between [10, 19). If queryWindowSize= 50, then DocumentID [0, 50] will be collected and cached. Document will hit the cache

 

Solr query performance optimization

  1. If your query needs to query on three fields, you can use copyField to merge the three fields into one field, and query on the merged field. Because querying on a single domain is more efficient than querying on N domains. But after using copyField, you can't weight each individual field

  2. The FilterQuery that can filter out most of the indexed documents should be executed first

  3. When performing a range query on a numeric field, you can adjust the precisionStep to optimize the rangeQuery. The default value of precisionStep is 4. The larger the value is, the more index prefix indexes are decomposed, and the faster the numerical range query is, but the index volume will be increased.

  There are still many optimization points in query, which need to be analyzed and used differently for different scenarios. Most of them can be experienced in the process of learning solr, so I will not repeat them here.

  

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325162613&siteId=291194637