Some performance adjustments of Solr in the case of one stress test

 
After changing the solr fuzzy word search from copyfield to qf (query function), the performance of its query is greatly reduced. The original method is to copy all the fields that need to be searched into the same field. Recently, the weight analysis of the fuzzy matching results is required. This method cannot meet the requirements at all, so the query function is used, so that the weights of different fields can be defined. Now, for example, our qf can be defined as follows:
 
 
product_name^2.0 category_name^1.5 category_name1^1.5
  
 
The search results will be sorted by similarity according to the scores of different matches.
 
However, during the performance test, it consumes a lot of memory. In the solr query service, we open the solr search log solr.log. The format of the search log is as follows:
 
2016-10-19 13:31:26.955 INFO  (qtp1455021010-596) [c:product s:shard1 r:core_node1 x:product] o.a.s.c.S.Request [product] webapp=/solr path=/select params={sort=salesVolume+desc&fl=product_id,salesVolume&start=0&q=brand_id:403+OR+category_id:141&wt=javabin&version=2&rows=100} hits=3098 status=0 QTime=3
2016-10-19 13:31:28.618 INFO  (qtp1455021010-594) [c:product s:shard1 r:core_node1 x:product] o.a.s.c.S.Request [product] webapp=/solr path=/select params={mm=100&facet=true&sort=psfixstock+DESC,salesVolume+DESC&facet.mincount=1&facet.limit=-1&wt=javabin&version=2&rows=10&fl=product_id&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+1000]&facet.query=price:[1000+TO+2000]&facet.query=price:[2000+TO+5000]&facet.query=price:[5000+TO+10000]&facet.query=price:[10000+TO+*]&start=0&q=*:*+AND+(category_id:243+OR+category_path:243)+AND+-category_path:309+AND+brand_id:401+AND+good_stop:0+AND+product_stop:0+AND+is_check:1+AND+status:1&facet.field=category_id&facet.field=brand_id&facet.field=color_id&facet.field=gender&facet.field=ctype&qt=/select&fq=price:[1000+TO+*]&fq=psfixstock:[1+TO+*]} hits=17 status=0 QTime=5
2016-10-19 13:31:30.867 INFO  (qtp1455021010-614) [c:product s:shard1 r:core_node1 x:product] o.a.s.c.S.Request [product] webapp=/solr path=/select params={mm=100&facet=true&sort=price+ASC,goods_id+DESC&facet.mincount=1&facet.limit=-1&wt=javabin&version=2&rows=10&fl=product_id&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+1000]&facet.query=price:[1000+TO+2000]&facet.query=price:[2000+TO+5000]&facet.query=price:[5000+TO+10000]&facet.query=price:[10000+TO+*]&start=0&q=*:*+AND+(category_id:10+OR+category_path:10)+AND+-category_path:309+AND+color_id:10+AND+gender:(0)+AND+good_stop:0+AND+product_stop:0+AND+is_check:1+AND+status:1&facet.field=category_id&facet.field=brand_id&facet.field=color_id&facet.field=gender&facet.field=ctype&qt=/select&fq=price:[5000+TO+*]&fq=psfixstock:[1+TO+*]} hits=9 status=0 QTime=7
2016-10-19 13:31:32.877 INFO  (qtp1455021010-594) [c:product s:shard1 r:core_node1 x:product] o.a.s.c.S.Request [product] webapp=/solr path=/select params={mm=100&facet=true&sort=psfixstock+DESC,salesVolume+DESC&facet.mincount=1&facet.limit=-1&wt=javabin&version=2&rows=10&fl=product_id&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+1000]&facet.query=price:[1000+TO+2000]&facet.query=price:[2000+TO+5000]&facet.query=price:[5000+TO+10000]&facet.query=price:[10000+TO+*]&start=0&q=*:*+AND+(category_id:60+OR+category_path:60)+AND+-category_path:309+AND+brand_id:61+AND+gender:(0)+AND+good_stop:0+AND+product_stop:0+AND+is_check:1+AND+status:1&facet.field=category_id&facet.field=brand_id&facet.field=color_id&facet.field=gender&facet.field=ctype&qt=/select&fq=price:[*+TO+*]&fq=psfixstock:[1+TO+*]} hits=5 status=0 QTime=8
2016-10-19 13:31:42.896 INFO  (qtp1455021010-89) [c:product s:shard1 r:core_node1 x:product] o.a.s.c.S.Request [product] webapp=/solr path=/select params={mm=100&facet=true&sort=psfixstock+DESC,salesVolume+DESC&facet.mincount=1&facet.limit=-1&wt=javabin&version=2&rows=10&fl=product_id&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+1000]&facet.query=price:[1000+TO+2000]&facet.query=price:[2000+TO+5000]&facet.query=price:[5000+TO+10000]&facet.query=price:[10000+TO+*]&start=0&q=*:*+AND+-category_path:309+AND+brand_id:323+AND+color_id:3+AND+good_stop:0+AND+product_stop:0+AND+is_check:1+AND+status:1&facet.field=category_id&facet.field=brand_id&facet.field=color_id&facet.field=gender&facet.field=ctype&qt=/select&fq=price:[*+TO+*]&fq=psfixstock:[1+TO+*]} hits=3 status=0 QTime=4
 
 
In order to reasonably count the QTime of the query, a python script was written to analyze the QTime of each solr query and the corresponding query conditions online:
 
 
import sys
 
if __name__ == '__main__':
    input_file = sys.argv[1]
 
    min_time = 0 if len(sys.argv) < 3 else int(sys.argv[2])
 
    try:
        with open(input_file) as file:
            while (1):
                line = file.readline().strip()
                if not line:
                    break
                splits = line.split(" ")
                if not splits[len(splits) - 1].startswith("QTime"):
                    continue
                q_time = int(splits[len(splits) - 1].replace('\n', '').replace('QTime=', ''))
                if q_time <= min_time:
                    continue
 
                date = splits[0]
                time = splits[1]
                params = splits[13].split("&")
 
                dict = {}
                for param in params:
                    keyValuePair = param.split("=")
                    dict[keyValuePair[0]] = keyValuePair[1]
 
                query = dict.get('q', None)
                if query:
                    print "%s - %s , QTime=% 5d, Query = %s" % (date, time, q_time, query)
    except IOError as error:
        print error
 
 
 
This script can analyze the logs in solr.log and filter out QTimes greater than a certain end time (milliseconds). After analysis, it is found that the QTime at this time is greater than 1000ms, which accounts for a large proportion, indicating that the performance of the query function we configured is not ideal, at least compared to the copy field.
 
Through jvisualvm monitoring, it is found that the frequency of full gc is very high, the CPU usage is high, and the request response time jitters, and the jitter occurs when the CPU usage of the server decreases.
 
Dump the online stack, and the hprof files analyzed by JProfiler are dominated by solr and lucene-related classes:
 



 
 
 
And found another situation, the old age uses about 400M resident memory, and usually when the old age rises to about 500M, a Full GC will occur (it happens very frequently).
 
We need to adjust the proportion of memory usage of various parts inside the JVM, but the effect is very small, close to 10:1 (MinorGC: FullGC) GC, too many FullGC times make the overall application processing request slow down:
 
 

 
 
Preliminary analysis thinks that the space of the Survivor is not enough to store a large object, so that the unreclaimed objects of the new generation are directly promoted to the old generation, resulting in frequent GC, but after increasing the SurvivorRatio ratio, it is found that the problem has not been solved.
 
The current GC recovery strategy of solr is CMSGC. According to the garbage recovery strategy found on the Internet, promotion failed and concurrent mode failure may occur. After we checked the pressure test log of the day, it is true that there are many such cases:
 
> grep "concurrent mode failure" solr_gc_log_20161018_* | wc -l
4919
> grep "promotion failed" solr_gc_log_20161018_* | wc -l
127
 
 
The recommended practice online is:
 
http://blog.csdn.net/chenleixing/article/details/46706039 wrote
Promotion failed is caused when the survivor space cannot be put in the Minor GC, and the objects can only be put into the old age, and the old age cannot be put down at this time; concurrent mode failure is caused by the fact that there are objects to be put into the CMS GC process at the same time The old generation is caused by insufficient space in the old generation at this time (sometimes "insufficient space" is caused by too much current floating garbage during CMS GC, causing temporary insufficient space to trigger Full GC).
The countermeasures are: increase the survivor space, the old age space or reduce the ratio of triggering concurrent GC, but in JDK 5.0+ and 6.0+ versions, the CMS may trigger the sweeping action long after the remark is completed due to the bug29 of the JDK. . For this situation, it can be avoided by setting -XX:CMSMaxAbortablePrecleanTime=5 (unit is ms).
 
 
 
However, after comprehensive consideration, this strategy could not be adjusted normally, and it was decided to increase the process heap memory of the JVM to 3.5G, and replace the original responsive CMSGC with a throughput-first method to reduce the number of FullGCs and the overall time:
 
-Xmn2048m \
-XX:-UseAdaptiveSizePolicy \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseParallelGC -XX:+UseParallelOldGC
  
 
After this adjustment: the total memory of the Heap is 3.5G, the old generation is 1.5G, the new generation is 2G, the Eden area is 1.3G, and the two survivor areas are 300M respectively.
 
However, there will be other problems after the space is improved. It is found that a major problem caused by the space of the old age is that the time of a single FullGC will become very large. The first time of the GC actually exceeds 5s, which is also due to the space recovered in a single time. Caused by:
 


 
 
Originally, I always thought that the space in the survivor area was not large enough. When I adjusted it to a larger size, it was found that this problem was not the same. However, after running for a period of time in this case, the overall situation was relatively stable, and the frequency and time were also controlled, but FullGC was performed. STW (Stop The World) may occur when it occurs, causing a long pause. If it is a CMS method, this problem will not occur.
 
The setting of SurvivorRatio=2 is not very reasonable, and it is still reduced to 4-5 later.
 


  
 
After the solr part of the test, the basic conclusion is still in line with expectations. Under the same conditions, for the analyzed field (word segmentation operation), the response time (Response Time) will be much longer than the non-analysed field, especially in the stress test conditions. Below (the last line in the figure is a fuzzy match, and the other two are exact matches).
 


  
 
Some other problems caused by SolrCloud
 
After comparing and verifying the solrCloud online and test environment, it is found that the performance of multiple online services is actually lower than that of a single server. On the contrary, it will lead to performance degradation. The following is our current core deployment structure, divided into two shards
 


 
 
In the debugQuery mode of solr, you can see that the final QTime is a little more than the sum of the two QTime increments, and the extra time should be the time to merge the results. In the case of a large amount of data, multiple shards will reduce the load on a certain server.
 
If it is made into a single shard, connected to solrCloud through zookeeper, and the replica is added, the pressure will only be placed on a single server. Although this is faster than sharding (in the case of small data sets), it will cause waste of service resources.
 


 
 
We decided to adopt a compromise solution and use solrCloud to ensure the data consistency of the 4 solr servers (our current data changes infrequently), and then each application server selects a solr server for single-machine connection, which also has a problem of loss. High availability, but during e-commerce activities, it is no problem to do this temporarily, but it is necessary to ensure that other cores of each server have comprehensive data, otherwise there will be errors that some nodes do not have corresponding cores. 
 
 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327009762&siteId=291194637