[Search Engine] Improving Apache Solr Performance

This is a short story about how we managed to overcome stability and performance issues with our search and correlation stack.

context


I've had the pleasure of working with the Personalization and Relevance team over the past 10 months. We are responsible for delivering "personalized and relevant content" to users based on rankings and machine learning. We do this through a set of microservices that provide three common endpoints, the Home Feed, Search, and Related items APIs. I remember a few months after joining the team, the next challenge was to be able to provide quality service to the larger key countries. The goal is to maintain the flawless performance and stability we already have in smaller countries.


We use SolrCloud (v 7.7) in AWS on Openshift using Zookeeper. As of this writing, we are proud to mention that the API is serving about 150,000 requests per minute and sending about 210,000 updates per hour to Solr in our largest region.


baseline


After deploying Solr in our largest market, we had to test it. We use internal tools for stress testing and we can roughly get the traffic we want. We believe Solr is well configured, so the team is working on improving client performance and setting higher timeouts against Solr. In the end we agreed that we could handle traffic a little looser.


after migration


The service responded with acceptable response times and the Solr client was behaving very well until it started opening some circuit breakers due to timeouts. The timeouts are generated by apparently random issues with Solr replicas taking too long to respond, which affect front-end clients more frequently without information being displayed. Here are some issues we encountered:

  • A high percentage of replicas go into recovery and take a long time to recover

  • Errors in replicas cannot reach leaders because they are too busy

  • Leaders are under too much load (from indexes, queries, and replica syncs), which prevents them from functioning properly and crashes shards

  • Doubts about the "Index/Update Service" as reducing its traffic to Solr prevents replicas from being stopped or going into recovery mode

  • The full garbage collector runs frequently (old generation and young generation).

  • SearchExecutor thread running on CPU, and garbage collector

  • The SearchExecutor thread throws an exception when the cache is warming up (LRUCache.warm)

  • Response time increased from ~30 ms to ~1500 ms

  • Found 100% IOPS on some Solr EBS volumes


solving issues


analyze


As part of our analysis, we came up with the following themes


Lucene settings


Apache Solr is a widely used search and ranking engine, thoughtfully designed with Lucene behind the scenes (also shared with ElasticSearch). Lucene is the engine behind all the calculations and does the magic for ranking and Faceting. Is it possible to do math on Lucene and check the settings? I can share an approximate result based on extensive documentation and forum reading, but it's not as configuration heavy as Solr's math.
Tweaking Lucene is possible if you are willing to sacrifice the structure of your document. Is it really worth the effort? No, you'll find out more as you read further.


Document and disk size


Let's say we have about 10 million documents. Assume an average document size of 2 kb. Initially, your disk space will be at least the following:

3c1c3bf52990e290a7ef1a598c9c476f.png

Fragmentation


Having multiple shards for a collection does not necessarily result in a more resilient Solr. When one shard has a problem and other shards can respond anyway, the time response or blocker will be the slowest shard.


When we have multiple shards, we divide the total number of documents by the number of shards. This reduces cache and disk size and improves the indexing process.


Indexing/updating process


Is it possible we have an overkill indexing/updating process? Given our experience, this is not excessive. I will leave the analysis of this issue for another article. Otherwise, this would be too broad. In our main market, we've reached 210,000 updates per hour (peak traffic).

Zookeeper


Apache Zookeeper's only job in this environment is to keep the cluster state available to all nodes as accurately as possible. If replicas are restored too often, a common problem is that the cluster state may get out of sync with Zookeeper. This would create an inconsistent state between running replicas, and replicas trying to recover would end up in a long loop that could last for hours. Zookeeper is very stable, it might just fail due to network resources, or better said lack of it.


Do we have enough memory?


theory


One of the most important drivers of Solr performance is RAM. Solr requires sufficient memory for the Java heap and free memory for the OS disk cache.


It is strongly recommended that Solr run on 64-bit Java, as 32-bit Java is limited to a 2GB heap, which may result in an artificial limit where larger heaps do not exist (discussed later in this article).


Let's take a quick look at how Solr uses memory. First, Solr uses two types of memory: heap memory and direct memory. Direct memory is used to cache blocks read from the file system (similar to the file system cache in Linux). Solr uses direct memory to cache data read from disk, mainly indexes, to improve performance.

f2d6ebc4e45e1668047919021751a0c3.png

When it is exposed, most of the heap memory is used by multiple caches.


The JVM heap size needs to match the Solr heap requirement estimate, and more for buffering purposes. This difference in heap and OS memory settings gives the environment some headroom to accommodate sporadic memory usage spikes, such as background merges or expensive queries, and allows the JVM to perform GC efficiently. For example, set up an 18Gb heap on a 28Gb RAM computer.


Let's keep in mind the equation we've been improving for Solr, the areas most relevant to memory tuning are the following:

87b646bd8087b6f7670eb1370639fb49.png

While the explanation below is long and complex, in order to build another post, I still wanted to share the math we've been working on. We used our own calculator at the beginning of the problem only to implement similar problems that were later shared by the online community.
Additionally, we ensured that the garbage collector is properly enabled in the JVM Args when starting Solr.

9c3c8e8c97d1a50006051f34616976c9.png

cache evidence


We tune the cache based on the evidence in the Solr admin panel as follows:

  • The hit ratio of queryResultCache is 0.01

  • The hit rate of filterCache is 0.43

  • documentCache has a hit rate of 0.01

Garbage Collector and Heap


Using New Relic, we can check the memory and GC activity on the instance and notice that the NR agent is frequently opening its circuit breaker (light red vertical line) due to memory threshold: 20%; garbage collection CPU threshold: 10%. This behavior is clear evidence of a problem with the available memory on the instance.

608382354b0b7dc79508f873665f11d4.png

We can also monitor some high-CPU instance processes and find that the searcherExecutor thread is using about 99% of the heap while using 100% of the CPU. Using JMX and JConsole, we encountered an exception containing:
...org.apache.solr.search.LRUCache.warm(LRUCache.java:299) ...as part of the stack trace. The above exception is related to cache setting size and warmup.


Disk Activity - AWS IOPS

91264c0396d93a0ae1ba1b4500197064.png

start to solve the problem


Search results fault tolerance


The first idea for serving search results to front-end clients is to always have Solr replicas still alive to respond to queries, in case the cluster becomes unstable due to replicas being in a recovering or even gone state. Solr 7 introduces new ways to synchronize data between the leader and its replicas:

  • NRT Replicas: The old way of handling replication in SolrCloud.

  • TLOG replicas: It uses transaction logs and binary replication.

  • PULL replicas: Replicate only from the leader and use binary replication.

Long story short, NRT replicas can perform the three most important tasks, indexing, searching, and bootstrapping. TLOG replicas, on the other hand, will handle indexing, searching and bootstrapping in a slightly different way. The differentiating factor is the PULL replica, which only serves queries with SEARCH.


By applying this configuration, we can guarantee that PULL replicas will respond as long as the shard has a leader, greatly improving reliability. Also, such replicas are not recovered as frequently as replicas that handle indexing processes.


We are still facing issues when indexing service is fully loaded causing TLog replicas to go into recovery.


Tuning Solr Memory


Based on the question Do we have enough RAM to store the number of documents? , we decided to experiment. The original concern was why we configured these values ​​in the "units" of the document like so:

4135c5fe6a7f04fa8ae3237b748c327a.png

Based on the formula shared earlier, the estimated RAM is about 3800 Gb considering we have 7 million documents. However, assuming we have 5 shards, each shard will handle approximately 1.4 million documents that directly affect replicas. We can estimate that with this sharding configuration, the required RAM is about 3420 Gb. This won't make a fundamental difference, so we move on.


Cache results


From the cache evidence, we can see that only one cache is used best, filterCache. The tested solution is as follows:

4f7e63806fb804e5afefda0faebcf492.png

With the previous cache configuration, we obtained the following results:

  • The hit ratio of queryResultCache is 0.01

  • The hit rate of filterCache is 0.99

  • documentCache has a hit rate of 0.02

garbage collector result


In this section we can see the garbage collector metrics provided by New Relic. We have no old generation activity, which would normally cause the New Relic proxy to open its circuit breaker (memory exhaustion).

1347b58347c60413e194ca86a1d81ecf.png

disk activity result


We're also getting amazing results in terms of disk activity, and indexes are also down significantly.

8ef47694b822a9461d042548d67de35a.png

external service result


One of the services accessing Solr saw a significant drop in response time and error rates in New Relic.

df12d42b08e4567a4450e44a05c72db6.png

Tuning the Solr cluster


A downside of the multi-shard model is that if any replicas are corrupted, the shard leader will take more time than its peers to answer. This results in the worst time response among shards, as Solr waits for all shards to answer before providing a final response.


To alleviate the above problems and taking into account the previously described results, we decided to start gradually reducing the number of nodes and shards, which has the effect of lowering the internal replication factor.


in conclusion


After weeks of investigation, testing, and tuning, we not only got rid of the initial exposure, but also improved performance by reducing latency, reduced administrative complexity by setting up fewer shards and fewer replicas, and gained confidence in indexing. /Renewed trust service works at full capacity and helps the company reduce expenses by using almost half of AWS EC2 instances.

This article: https://architect.pub/improving-solr-performance
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [ca_cto] or add QQ group [792862318]
No public
 
【jiagoushipro】
【Super Architect】
Brilliant graphic and detailed explanation of architecture methodology, architecture practice, technical principles, and technical trends.
We are waiting for you, please scan and pay attention.
WeChat trumpet
 
[ca_cea]
Community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.
 

QQ group
 
[285069459] In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc.
Join the QQ group to share valuable reports and dry goods.

video number [Super Architect]
Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute.
1 minute a day, the structure is familiar.

knowledge planet [Chief Architect Circle] Ask big names, get in touch with them, or get private information sharing.  

Himalayas [Super Architect] Learn about the latest black technology information and architecture experience on the road or in the car. [Intelligent moments, Mr. Architecture will talk to you about black technology]
knowledge planet Meet more friends, workplace and technical chat. Knowledge Planet【Workplace and Technology】
LinkedIn Harry https://www.linkedin.com/in/architect-harry/
LinkedIn group LinkedIn Architecture Group https://www.linkedin.com/groups/14209750/
Weibo‍‍ 【Super Architect】 smart moment‍
Bilibili 【Super Architect】

Tik Tok 【cea_cio】Super Architect

quick worker 【cea_cio_cto】Super Architect

little red book [cea_csa_cto] Super Architect  

website CIO (Chief Information Officer) https://cio.ceo
website CIOs, CTOs and CDOs https://cioctocdo.com
website Architect practical sharing https://architect.pub   
website Programmer cloud development sharing https://pgmr.cloud
website Chief Architect Community https://jiagoushi.pro
website Application development and development platform https://apaas.dev
website Development Information Network https://xinxi.dev
website super architect https://jiagou.dev
website Enterprise technical training https://peixun.dev
website Programmer's Book https://pgmr.pub    
website developer chat https://blog.developer.chat
website CPO Collection https://cpo.work
website chief security officer https://cso.pub    ‍
website CIO cool https://cio.cool
website CDO information https://cdo.fyi
website CXO information https://cxo.pub

Thank you for your attention, forwarding, likes and watching.

Guess you like

Origin blog.csdn.net/jiagoushipro/article/details/131733924