ElasticSearch vs. Solr

Why log service provider Loggly select ElasticSearch instead of Solr.
Original link:  HTTP: //loggly.wpengine.com/bl ...

In the early stages of Gen2 products, we are in fact a failure, which prompted us to re-examine our existing technology stack. We carefully analyze each individual component in the system, and recorded, of course, including that constitute our core function search engine technology.

In our scenario universal log management system, it allows you to query for each individual event and log analysis charts of all log events to help their clients understand the basic requirements of their dynamic data to address these scenarios are as follows:

  • A scalable and log collection pipes high fault tolerance of our team used. The Apache Kafka used to live as a data pipeline; needs to be emphasized is that if you need to subscribe to any of large amounts of data into a search engine, you need a steady pipeline system.

  • Powerful search capabilities: give large amounts of data to provide near real-time indexing support, while also providing high availability search request.

When we first generation product Gen1 (2010), we used the time Solr has a cloud processing capabilities and provide NRT (near real-time) search function, just Solr of these two features is exactly what we need. At first we build systems based on the first version of Solr branch has the ability to support cloud began. For some reason, stable version of SolrCloud + NRT function until 2012 was basically a useable form. During that time, we directly by plug-ins and modify the source code continued expansion and use Solr.

In 2012, we are ready to start Gen2, SolrCloud4.0 was just released, but has also been ElasticSearch version 0.19.9. In the technical research, I was a strong supporter of Solr technology, but after a few months of comparison, I finally we realize ElasticSearch is the better choice.

In any technology selection process, there is always a lot of considerations. Here was prompted us to select some of the important considerations ElasticSearch. However, these summary is based on the scene in 2012 when, in recent years may have undergone enormous changes.

1) search features

Because the ES are based on Lucene and Solr to build, so no matter which selection can provide search features we need. However, because each system development process and its design goals, perhaps we have to face and differentiate their strengths and weaknesses.

Solr is designed primarily to address the difficulties of information retrieval (IR: information retrieval). This problem is also reflected in its API designed to provide a more powerful search capability than ES ES, as the name suggests the main positioning. elastic scalability, so a little lacking on the IR characteristics. For our business scenario, does not need to immediately use these sophisticated advanced features. Although better Solr search technology, however ES meets our current needs and So win.

2) Search Scalability

Because Gen1 Solr has been used, so we have the ability to extend Solr and mastered how to manage and Solr some restrictions in this regard. ES is different, we need to verify its ability to expand to meet our needs.

We started a cluster for the deployment of each system, and limit testing by loading large amounts of data, as well as shut down by forcing some of the nodes in order to observe the performance of each system. At that time, we like a bunch of monkeys trouble.

Found in the test SolrCloud biggest problem is that it's cluster management capabilities, while low memory, SolrCound also faced steady stability. It also met with 集群锁定(lock-up) problem, only to restart the whole cluster can be resolved. at the same time, under the same scenario ES did not encounter an unrecoverable failure. Although there are ways to make the ES data loss, but we clearly know when it will happen, and to have solutions to these problems.

3) Configuration Management

In Gen1, we spent a lot of energy to deal with Solr configuration management, including managing the flow of data and index fragmentation and so on. We were handled by a system of plug-ins and source code modifications. While the technology challenges make exciting, but we do not want to spend too much time on it, but put more focus on product differentiation: better display of the results through the acquisition and analysis engine, and to provide customers with better data connotations.

There is no doubt, ES ES won the team pursuit this comparison because of the flexibility of almost crazy specific as follows:

  • Solr's Collections API is the latest available, and very simple. The ES provides a native, stable index management functions.

  • Solr and by default ES provides a reasonable slice allocation strategy, but compared to ES routing framework of Solr Collections API is more robust.

  • We have discussed the ES Master / Slave model is not as good as Solr distributed model, in fact function to achieve quality is more important than the perfect theory.

4) Performance

Although ES and Solr are based on Lucene, but they found a different way in the use of our performance testing on the indexing speed, Solr is undoubtedly more efficient, as shown below:

clipboard.png

 

Figure above each node is a separate set of test results, in each group, we are in a fixed period of time (2,5,10,20 minutes, etc.) in each batch index 8000 in the test the results can be clearly seen in the results are divided into two groups: Solr index substantially stable 18,000 times / sec, the speed is quite ES (although not very stable) in the beginning, but as the test time becomes longer, ES rate of decline of the index can not withstand 12,000 times / sec.

Although it looks Solr won, but a lot of variables that may Solr using Lucene4, and ES using Lucene3.6. So hard to say which is better objective. And ES will also be introduced Lucene4, so the difference it should will no longer exist.

Finally, in the performance of our decision and not only as a reference factor determining factor.

5) Community support

ES and Solr is an open source project, and the community is very active. We discussed different models of ES and Solr, in theory, the more likely Solr open model, but also on the ES team management is impressive. While in Solr They are dominant on the size and activity of the community, but ES is also growing rapidly, and growing fast.

6) other factors

We discuss these two there are many products, then simply look below some of them:

  • ES is very elegant and powerful API, and it fits well with our REST service before and after the end of the separation of architecture

  • Scanning and filtering characteristics of the ES is very interesting, and found a lot of possible use scenarios

  • Are native support for JSON data format, can save a lot of code we have written in the Loggly Gen1

  • ES type of automatic parsing and dynamic field Solr are very useful to avoid the management of the continuing evolution of the input data

  • Stratified Solr4 native / center model is praise

  • ES memory management easier than Solr

  • Plug-in support for both are very good, but more core support Solr plug-ins

There are some factors that we do not stubbornly insist, as follows;

  • Once an index is created, we are not allowed to dynamically modify the number of fragments

  • ES only single-layer model, but it does not matter

  • Use main fragmentation and fragmentation have copies of inconsistent data situation in the ES, in a quasi-real-time system may pose potential problems

In the last discussion, we believe that these effects are not very serious. With in-depth discussion, we more clearly aware of what factors have a deeper impact on the system, the corresponding these factors have been given more weight.

Our choice

Finally, based on the following two points, we chose ES:

  • We want to reduce their energy on configuration management, more attention to the product itself

  • We believe that with the strong ES, a number of different compared to the Solr will gradually solve

Of course, this choice will be paid. While the need to configure the ES into our pipeline, and the front and rear ends and pipeline management architecture team need to achieve, but in comparison, these are higher-level job in the Gen1 pay. So we you can be more focused in their own product itself, which is desired at the beginning, and our customers need to get.

With SolrCloud and ElasticSearch mature, narrow the gap, and today very difficult to be both choices but for all of us, this is indeed a good thing: more power and drive them both Lucene has been increasing.

 

Published 298 original articles · won praise 107 · Views 140,000 +

Guess you like

Origin blog.csdn.net/ywl470812087/article/details/104874625