search engine choice

Introduction to Elasticsearch *

Elasticsearch is a real-time distributed search and analytics engine. It can help you process large-scale data with unprecedented speed.

It can be used for full text search, structured search and analysis, and of course you can combine all three.

Elasticsearch is a search engine based on the full-text search engine Apache Lucene™. It can be said that Lucene is the most advanced and efficient full-featured open source search engine framework today.

But Lucene is just a framework, to take full advantage of its functions, you need to use JAVA and integrate Lucene in the program. It takes a lot of learning to understand how it works, and Lucene is really complex.

Elasticsearch uses Lucene as its internal engine, but when using it for full-text search, you only need to use the unified developed API, and you don't need to understand the complicated Lucene operating principle behind it.

Of course, Elasticsearch is not just as simple as Lucene. It not only includes full-text search functions, but also can perform the following tasks:

  • Distributed real-time file storage and indexes every field, making it searchable.

  • A distributed search engine for real-time analytics.

  • It can scale to hundreds of servers and handle petabytes of structured or unstructured data.

With so much functionality integrated into a single server, you can easily communicate with ES's RESTful API via the client or any programming language you like.

Getting started with Elasticsearch is very simple. It comes with a lot of very reasonable default values, which makes it a good way for beginners to avoid complex theory when they get started.

It's installed and ready to use and can be productive with little learning cost.

As you learn more and more deeply, you can also take advantage of more advanced functions of Elasticsearch, and the entire engine can be configured flexibly. You can customize your own Elasticsearch according to your own needs.

Use Cases:

  • Wikipedia uses Elasticsearch for full-text search and keyword highlighting, as well as search suggestions such as search-as-you-type, did-you-mean, and more.

  • The Guardian uses Elasticsearch to process visitor logs so that the public's reaction to different articles can be fed back to its editors in real time.

  • StackOverflow combines full-text search with geolocation and related information to provide a more-like-this-related presentation.

  • GitHub uses Elasticsearch to retrieve over 130 billion lines of code.

  • Goldman Sachs uses it to index 5TB of data every day, and many investment banks use it to analyze stock market movements.

But Elasticsearch is not just for large enterprises, it has also helped many startups like DataDog and Klout expand their capabilities.

Advantages and disadvantages of Elasticsearch * * :

advantage

  1. Elasticsearch is distributed. No other components are required, the distribution is real-time and is called "Push replication".
  2. Elasticsearch fully supports Apache Lucene's near real-time search.
  3. Handling multitenancy requires no special configuration, whereas Solr requires more advanced settings.
  4. Elasticsearch adopts the concept of Gateway to make complete backup easier.
  5. Each node forms a peer-to-peer network structure, and when some nodes fail, other nodes will be automatically assigned to work instead of them.

shortcoming

  1. There is only one developer (the current Elasticsearch GitHub organization is more than that, there are already quite active maintainers)
  2. Not automatic enough (not suitable for current new Index Warmup API)

Introduction to Solr *

Solr (pronounced "solar") is the open source enterprise search platform of the Apache Lucene project. Its main functions include full-text search, hit-marking, faceted search, dynamic clustering, database integration, and processing of rich text (eg Word, PDF). Solr is highly scalable and provides distributed search and index replication. Solr is the most popular enterprise search engine, and Solr4 also adds NoSQL support.

Solr is a standalone full-text search server written in Java and running on a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library as the core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs. Solr's powerful external configuration capabilities make it possible to adapt it to many types of applications without Java coding. Solr has a plugin architecture to support more advanced customization.

Since the Apache Lucene and Apache Solr projects were merged in 2010, both projects were made by the same Apache Software Foundation development team. Lucene/Solr or Solr/Lucene are the same when referring to technology or products.

Pros and Cons of Solr

advantage

  1. Solr has a larger and more mature community of users, developers and contributors.
  2. Support adding indexes in various formats, such as: HTML, PDF, Microsoft Office series software formats, and plain text formats such as JSON, XML, and CSV.
  3. Solr is more mature and stable.
  4. It is faster to search without considering indexing at the same time.

shortcoming

  1. When the index is established, the search efficiency decreases, and the real-time index search efficiency is not high.

Comparison of Elasticsearch and Solr *

Solr is faster when purely searching for existing data.

Search Fesh Index While Idle

When the index is built in real time, Solr will block io, and the query performance will be poor. Elasticsearch has obvious advantages.

search_fresh_index_while_indexing

As the amount of data increases, Solr's search efficiency becomes less efficient, while Elasticsearch does not change significantly.

search_fresh_index_while_indexing

To sum up, Solr's architecture is not suitable for real-time search applications.

Actual production environment testing *

The graph below shows a 50x increase in average query speed after switching the search engine from Solr to Elasticsearch.

average_execution_time

Elasticsearch vs Solr Comparison Summary

  • Both are easy to install;
  • Solr uses Zookeeper for distributed management, while Elasticsearch itself has a distributed coordination management function;
  • Solr supports more formats of data, while Elasticsearch only supports json file format;
  • Solr officially provides more functions, while Elasticsearch itself focuses more on core functions, and many advanced functions are provided by third-party plug-ins;
  • Solr performs better than Elasticsearch in traditional search applications, but is significantly less efficient than Elasticsearch when dealing with real-time search applications.

Solr is a powerful solution for traditional search applications, but Elasticsearch is more suitable for emerging real-time search applications.

Other Lucene-based Open Source Search Engine Solutions *

  1. Use  Lucene directly

Description: Lucene is a JAVA search class library, which is not a complete solution by itself and requires additional development work.

Pros: Mature solution with many success stories. The apache top-level project is continuing to make rapid progress. Huge and active development community, lots of developers. It is just a class library with enough room for customization and optimization: after simple customization, it can meet most common needs; after optimization, it can support 1 billion+ searches.

Cons: Requires additional development work. All expansion, distribution, reliability, etc. need to be implemented by yourself; non-real-time, there is a time delay from indexing to being able to search, and the scalability of the current "Near Real Time" (Lucene Near Real Time search) search scheme to be further improved

Description: Based on Lucene, it supports distributed, scalable, fault-tolerant, and quasi-real-time search solutions.

Advantages: Out of the box, it can be distributed with Hadoop. With expansion and fault tolerance mechanism.

Disadvantages: It is only a search scheme, and the indexing part still needs to be implemented by yourself. In the search function, only the most basic needs are realized. There are fewer successful cases, and the maturity of the project is slightly less. Because of the need to support distributed, for some complex query requirements, customization will be more difficult.

Description: Map/Reduce mode, distributed index building scheme, can be used in conjunction with Katta.

Advantages: Distributed indexing and scalability.

Disadvantages: only the indexing scheme, does not include the search implementation. Works in batch mode with poor support for real-time search.

Description: A series of solutions based on Lucene, including quasi-real-time search zoie, facet search implementation bobo, machine learning algorithm decomposer, summary repository krati, database schema wrapper sensei, and more

Pros: Proven solution that supports distributed, scalable, feature-rich implementation

Disadvantages: Too close to linkedin company, poor customizability

Description: Based on Lucene, the index exists in the cassandra database

Advantages: refer to the advantages of cassandra

Disadvantages: Refer to the disadvantages of cassandra. In addition, this is just a demo, not a lot of verification

Description: Based on Lucene, the index exists in the HBase database

Advantages: See the advantages of HBase

Disadvantages: See Disadvantages of HBase. In addition, in the implementation, lucene terms are stored in rows, but the posting lists corresponding to each term are stored in columns. With the increase of the posting lists of a single term, the speed of the query will be greatly affected

 

Reprinted: http://blog.csdn.net/jameshadoop/article/details/44905643

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326810581&siteId=291194637