Search engine selection: Elasticsearch and Solr search engine selection research document

Search Engine Selection: Elasticsearch and Solr
Search Engine Selection Research Documentation
Introduction to Elasticsearch*

Elasticsearch is a real-time distributed search and analysis engine. It can help you process large-scale data with unprecedented speed.

It can be used for full text search, structured search and analysis, and of course you can combine all three.

Elasticsearch is a search engine based on the full-text search engine Apache Lucene™. It can be said that Lucene is the most advanced and efficient full-featured open source search engine framework today.

But Lucene is just a framework, to take full advantage of its functions, you need to use JAVA and integrate Lucene in the program. It takes a lot of learning to understand how it works, and Lucene is really complex.

Elasticsearch uses Lucene as its internal engine, but when using it for full-text search, you only need to use the unified developed API, and you don't need to understand the complicated Lucene operating principle behind it.

Of course, Elasticsearch is not just as simple as Lucene, it not only includes full-text search capabilities, but also can do the following:

    distributed real-time file storage, and indexes every field so that it can be searched.

    A distributed search engine for real-time analytics.

    It can scale to hundreds of servers and handle petabytes of structured or unstructured data.

With so much functionality integrated into a single server, you can easily communicate with ES's RESTful API via the client or any programming language you like.

Getting started with Elasticsearch is very simple. It comes with a lot of very reasonable default values, which makes it a good way for beginners to avoid complex theory when they get started.

It's installed and ready to use and can be productive with little learning cost.

As you learn more and more deeply, you can also take advantage of more advanced functions of Elasticsearch, and the entire engine can be configured flexibly. You can customize your own Elasticsearch according to your own needs.

Use case:

    Wikipedia uses Elasticsearch for full-text search and keyword highlighting, as well as search suggestions such as search-as-you-type, did-you-mean, etc.

    The Guardian uses Elasticsearch to process visitor logs so that the public's reaction to different articles can be fed back to its editors in real time.

    StackOverflow combines full-text search with geolocation and related information to provide a more-like-this-related presentation.

    GitHub uses Elasticsearch to retrieve over 130 billion lines of code.

    Goldman Sachs uses it to index 5TB of data every day, and many investment banks use it to analyze stock market movements.

But Elasticsearch is not just for large enterprises, it has also helped many startups like DataDog and Klout expand their capabilities.
Advantages and disadvantages of Elasticsearch**:
Advantages

    Elasticsearch is distributed. No other components are required, the distribution is real-time and is called "Push replication".
    Elasticsearch fully supports Apache Lucene's near real-time search.
    Handling multitenancy requires no special configuration, whereas Solr requires more advanced settings.
    Elasticsearch adopts the concept of Gateway to make complete backup easier.
    Each node forms a peer-to-peer network structure, and when some nodes fail, other nodes will be automatically assigned to work instead of them.

Disadvantages

    Only one developer (the current Elasticsearch GitHub organization is more than that, already has quite active maintainers)
    Not enough automatic (not suitable for the current new Index Warmup API)

Solr Introduction *

Solr (pronounced "solar") is Apache The Lucene Project's open source enterprise search platform. Its main functions include full-text search, hit-marking, faceted search, dynamic clustering, database integration, and processing of rich text (eg Word, PDF). Solr is highly scalable and provides distributed search and index replication. Solr is the most popular enterprise search engine, and Solr4 also adds NoSQL support.

Solr is a standalone full-text search server written in Java and running on a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library as the core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs. Solr's powerful external configuration capabilities make it possible to adapt it to many types of applications without Java coding. Solr has a plugin architecture to support more advanced customization.

Since the Apache Lucene and Apache Solr projects were merged in 2010, both projects were made by the same Apache Software Foundation development team. Lucene/Solr or Solr/Lucene are the same when referring to technology or products. Advantages
and disadvantages of Solr


    Solr has a larger and more mature community of users, developers and contributors.
    Support adding indexes in various formats, such as: HTML, PDF, Microsoft Office series software formats, and plain text formats such as JSON, XML, and CSV.
    Solr is more mature and stable.
    It is faster to search without considering indexing at the same time.

Disadvantages

    When the index is established, the search efficiency decreases, and the real-time index search efficiency is not high.

Elasticsearch vs Solr *

Solr is faster when purely searching on existing data.

When the index is built in real time, Solr will block io, and the query performance will be poor. Elasticsearch has obvious advantages.

As the amount of data increases, Solr's search efficiency becomes less efficient, while Elasticsearch does not change significantly.

To sum up, Solr's architecture is not suitable for real-time search applications.
Actual production environment test *

The picture below shows that the average query speed after switching the search engine from Solr to Elasticsearch has increased by 50 times.

Summary of the comparison between Elasticsearch and Solr

    Both are easy to install;
    Solr uses Zookeeper for distributed management, while Elasticsearch itself has a distributed coordination management function;
    Solr supports more formats of data, while Elasticsearch only supports json file format;
    Solr officially provides more functions, while Elasticsearch itself focuses more on core functions, and many advanced functions are provided by third-party plug-ins;
    Solr performs better than Elasticsearch in traditional search applications, but is significantly less efficient than Elasticsearch in processing real-time search applications .

Solr is a powerful solution for traditional search applications, but Elasticsearch is more suitable for emerging real-time search applications.
Other Lucene-based open source search engine solutions *

    Direct use of Lucene

Instructions: Lucene is a JAVA search class library, which itself is not a complete solution and requires additional development work.

Pros: Mature solution with many success stories. The apache top-level project is continuing to make rapid progress. Huge and active development community, lots of developers. It is just a class library with enough room for customization and optimization: after simple customization, it can meet most common needs; after optimization, it can support 1 billion+ searches.

Cons: Requires additional development work. All expansion, distribution, reliability, etc. need to be implemented by yourself; non-real-time, there is a time delay from indexing to being able to search, and the scalability of the current "Near Real Time" (Lucene Near Real Time search) search scheme To be further improved

    Katta

description: based on Lucene, supports distributed, scalable, fault-tolerant, quasi-real-time search scheme.

Advantages: Out of the box, it can be distributed with Hadoop. With expansion and fault tolerance mechanism.

Disadvantages: It is only a search scheme, and the indexing part still needs to be implemented by yourself. In the search function, only the most basic needs are realized. There are fewer successful cases, and the maturity of the project is slightly less. Because of the need to support distributed, for some complex query requirements, customization will be more difficult.

    Hadoop contrib/index

description: Map/Reduce mode, distributed index building scheme, can be used with Katta.

Advantages: Distributed indexing and scalability.

Disadvantages: only the indexing scheme, does not include the search implementation. Works in batch mode with poor support for real-time search.

    LinkedIn's open source solution

Description : A series of solutions based on Lucene, including quasi real-time search zoie, facet search implementation bobo, machine learning algorithm decomposer, summary repository krati, database schema wrapper sensei, etc.

Pros: Proven solution, support Distributed, extensible, rich function implementation

Disadvantage : Too close with linkedin company, poor customization

    Lucandra

Description: Based on Lucene, the index exists in the cassandra database

Advantages : Refer to the advantages of cassandra

Disadvantages: Refer to the disadvantages of cassandra. In addition, this is just a demo, without a lot of verification

    HBasene

Description: Based on Lucene, the index exists in the HBase database

Advantages : Refer to the advantages of HBase

Disadvantages: See Disadvantages of HBase. In addition, in the implementation, lucene terms are stored in rows, but the posting lists corresponding to each term are stored in columns. As the posting lists of a single term increase, the query speed will be greatly affected



. Reprint : http://blog.csdn.net/jameshadoop/article/details/44905643

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326943886&siteId=291194637