lucent, solr, ES comparison

 

| 0 What is the full-text search


What is the full-text search engine?

Baidu Wikipedia definition :
full-text search engine is widely used in the major search engines. It works by indexing computer program by scanning every word of the article, to establish an index of every word, indicating the number and location of the word appears in the article, when a user queries, according to the index retrieval program previously established Find the results, and to find the way back to retrieve the user. This process is similar to the process of search words by searching the word list dictionary.

By definition, we can have a general understanding of the idea of ​​a full-text search, for a more detailed description, we start talking about the data lives.

Data of our lives in general are divided into two types: structured data  and  unstructured data .

  • Structured Data : means having a fixed data format or a finite length, such as a database, metadata and the like.
  • Unstructured data : unstructured data can be called full-text data, Zhibuding or no fixed length data format, such as e-mail, word documents.

Of course, in some places there will be a third: semi-structured data , such as XML, HTML, etc., when necessary, may be processed configuration data can also be extracted by unstructured plaintext data processing.

The two data sorting, searching is correspondingly divided into two: the structured data and unstructured data relevant to the search.

For structured data, we usually can store and search through a relational database table (mysql, oracle, etc.), but also can be indexed.
For unstructured data, i.e., there are two main methods for searching the full text data: sequential scanning method , full-text search .

Sequential scanning : by writing the name can also be learned about its search for a way that specific keyword query in the order of scanning manner.
For example, give you a newspaper allows you to find the newspaper "RNG" text appeared in what areas. You definitely need to read the newspaper from beginning to end and then scan it again to mark keywords which appeared in the Forum as well as its appearance position.

This is undoubtedly the most time-consuming way of the least efficient, if small font newspaper publishing, and even how many sections of newspapers, so you scanned your eyes about it.

Full text search : slow sequential scanning of unstructured data, whether or not we can be optimized? Our unstructured data to find ways to have a certain structure is not made on the line yet? The part of the information extracted from unstructured data, reorganized to make it have a certain structure, and some of this data structure is searched, so as to achieve a relatively fast search purposes. In this way it constitutes the basic idea of full-text search. This information is then reorganized part extracted from unstructured data, we call it the index .

Also to read newspapers, for example, we want to focus on recent S8 World Final League news, if you are a fan of RNG, RNG how to quickly find the news section of the newspaper and what? Full-text search is the way, all the newspapers in all sections of the extracted keywords, such as "EDG", "RNG", "FW", " clan", "Heroes Union" and so on. These keywords are then indexed by the index we can correspond to the keywords appear in newspapers and forum. Note the difference between directory search engine .

2 | 0 Why use full-text search engine search


Before, a colleague asked me, why use a search engine? All of our data in a database have, and Oracle, SQL Server and other database can also provide query or cluster analysis function, you can not directly query the database through yet? Indeed, most of our query function can be obtained through a database query, if the query is inefficient and can also improve efficiency through the construction of the database indexes, SQL optimization, etc., and even to accelerate return by introducing cached data speeds. If a larger amount of data, it can sub-library sub-table queries to share the pressure.

So why do full-text search engine? Our analysis mainly from the following reasons:

  • Data type
    full-text search indexing support search unstructured data, you can better quickly search for any word or words of unstructured text groups abound.
    Such as Google, Baidu search sites like, which are based on the keywords generate an index page, when we enter a keyword search, the keyword index that is they will match all pages to return; there is a common project application log search, and so on. For these unstructured text data, relational database search is not well supported.

  • Index maintenance
    is generally traditional databases, full-text search are implemented very sad, because no one with a general inventory data text field. Search the full text needs to scan the entire table, even if the data is large, then the syntax of SQL optimization, also with little success. The establishment of the index, but it is also very troublesome to maintain, for insert and update operations will rebuild the index.

When using a full-text search engine:

  1. Search data objects are large amounts of unstructured text data.
  2. Documentation of the amount of hundreds of thousands or millions or even more.
  3. It supports a large number of interactive text-based query.
  4. Demand is very flexible full-text search queries.
  5. There are special requirements for highly relevant search results, but does not meet the available relational database.
  6. A case where different record types, demand for non-text data security operations or transactions of relatively small.

3|0Lucene,Solr, ElasticSearch ?


Now the mainstream search engines is probably: Lucene, Solr, ElasticSearch.

They are based on the indexing inverted index generate index of way, what is the inverted index?

Wikipedia
inverted index (English: Inverted index), also often referred to as an inverted index, or reverse files into archives, is an indexing method, it is used to store full-text search in under a word in a document or mapping storage location of a set of documents. It is a document retrieval system is the most commonly used data structures.

3|1Lucene


Lucene is a Java full-text search engine written entirely in Java. Lucene is not a complete application, but a code library and API, you can easily add search functionality to your application.

Lucene offers powerful features through a simple API:

Scalable, high-performance index

  • On modern hardware than 150GB / hour
  • Small RAM requirements - only 1MB heap
  • As fast incremental indexing and batch index
  • Index size about 20-30% of the index text size

Powerful, accurate and efficient search algorithm

  • Search rankings - the best results returned first
  • Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries, etc.
  • Search on site (such as title, author, contents)
  • Sort by any field
  • The combined results of the use of multi-index search
  • Allows simultaneous update and searching
  • Flexible facet, highlight, and the result packet connection
  • Fast, memory efficiency and fault tolerance recommendations
  • Pluggable ranking models, including vector space model and Okapi BM25
  • Configurable storage engine (CODEC)

Cross-platform solution

  • Available as open source software under the Apache license allows you to use Lucene in commercial and open source program
  • 100%-pure Java
  • The realization of other programming languages ​​available are compatible index

Apache Software Foundation
supports the Apache Software Foundation open source software projects Apache community.

But Lucene is just a framework, to take full advantage of its features, you need to use JAVA, and integrated in the program Lucene. Requires a lot of learning to know, to understand how it works, skilled use of Lucene is indeed very complicated.

3|2Solr


Apache Solr is a Java-based library called Lucene open source search platform constructed. It provides Apache Lucene search capabilities to user-friendly way. As a participant nearly a decade of industry, it is a mature product, has a strong and extensive user community. It provides distributed indexing, replication, load balancing inquiry and automatic failover and recovery. If it is properly deployed and managed well, it can become a highly reliable, scalable and fault-tolerant search engine. Many Internet giants, such as Netflix, eBay, Instagram and Amazon (CloudSearch) use Solr, because it can index and search multiple sites.

The main features list includes:

  • research all
  • protruding
  • Faceted Search
  • Real-time Index
  • Dynamic Cluster
  • Database Integration
  • NoSQL features and rich document processing (such as Word and PDF files)

3|3ElasticSearch


Elasticsearch is an open source (Apache 2 license), it is based on Apache Lucene search engine library build RESTful.

Elasticsearch is a few years after the launch of Solr. It provides a distributed, multi-tenant capabilities of full-text search engine, has HTTP Web interface (REST) ​​architecture and no JSON documents. Elasticsearch official client libraries provide Java, Groovy, PHP, Ruby, Perl, Python, .NET and Javascript.

It comprises a distributed search engine index may be divided into slices, and each slice may have a plurality of copies. Elasticsearch Each node can have one or more slices, which may also act as a coordinator engine, the operation assigned to the correct fragments.

Elasticsearch can be extended through near real-time search. One of its main features is multi-tenant.

The main features list includes:

  • Distributed Search
  • Multi-tenant
  • Analysis of search
  • Polymerizable group and

4 | 0 elasticsearch vs. Solr choice


Because of the complexity of Lucene, it seldom consider it as the first choice of the search, some companies need to exclude self-development search framework, the underlying need to rely on Lucene. So here we focus on analysis Elasticsearch and Solr.

Elasticsearch vs. Solr. Which one is better? What are they different? Which one should you use?

4 | 1 historical comparison


Apache Solr is a mature project, has a large and active developer and user communities, as well as Apache brand. Solr was first released in 2006 to open source, it has long been dominated by the search engine field, and no one need search engine of choice. It matures into feature-rich, rather than simple text indexing and searching; such facet, a packet, a strong filter, can be inserted into the document processing, the search can be inserted into the chain assembly language detection.

Solr for many years occupied the dominant position in the search field. Then, in around 2010, Elasticsearch become another choice in the market. At that time, it is far less stable Solr, Solr no functional depth, there is no thought to share, brand and so on.

Elasticsearch although very young, but it's also some of their own advantages, Elasticsearch built on more modern principles, use cases, and to make it easier to handle large indexes and high query rate for the construction of more modern. In addition, because it is too young, no community can work, it can move forward freely, without any consensus or cooperation with others (users or developers), backward compatibility, or any other more sophisticated software usually It must be addressed.

Therefore, it publicly before Solr some very popular feature (for example, near real-time search, English: Near Real-Time Search). Technically, the ability to search for NRT does come from Lucene, Solr search base is the foundation and Elasticsearch use. Ironically, because Elasticsearch first disclosed NRT search, so people will search for NRT contact Elasticsearch together, although part of Solr and Apache Lucene project are the same, therefore, one would first expect Solr have such high requirements Features.

4 | 2 features diff


Both search engines are popular, advanced open source search engine. Construction of the Lucene - - they are all around the core underlying search base but they are also different. Like all things, as each has its advantages and disadvantages, depending on your needs and expectations, each of which may be better or worse. Solr and Elasticsearch are rapidly developing, so, did not talk much, they first look at the list of differences:

feature Solr/SolrCloud Elasticsearch
Communities and developers Apache Software Foundation and community support Single business entity and its employees
Node discovery Apache Zookeeper, a large number of projects in mature and battle-tested Zen built in Elasticsearch itself, requires a dedicated master node can be split brain protection
Placing debris Is static in nature, requires manual work to migrate slice, starting from Solr 7 - Autoscaling API allows some dynamic operation Dynamic, move on demand fragment according to the state of the cluster
cache Global, each segment change invalid Each piece is more suitable to dynamically change data
Performance Analysis Engine Very suitable for accurate calculation of static data Accuracy of the results depends on the data placement
Full text search Lucene-based linguistic analysis, and more suggestions, spell check, rich highlighting support Lucene-based linguistic analysis, recommend a single API, highlighted recalculated
DevOps Support Not yet fully, but coming Very good API
Non-plane data processing Nested documents and parent - child support And nested object type of natural support allows virtually unlimited nesting and parent - child support
Query DSL JSON (limited), XML (limited) or URL parameters JSON
Index / collection control leadership Leader and leader placement control even on re-balance the load node can impossible
Machine Learning Built-in - on flow aggregation, focused on logistic regression and rank contribution Learning Module Business functions, focusing on exceptions and outliers as well as time-series data

Learn more here .

4 | 3 comprehensive comparison


In addition, we have to analyze the following aspects:

  • In recent years, a trend
    we look at Google search trends for both products. Google Trends shows that compared with Solr, Elasticsearch very attractive, but that does not mean Apache Solr has died. Although some people may not think so, but Solr remains one of the most popular search engine, has strong community support and open source.

  • Installation and configuration
    compared with Solr, Elasticsearch easy to install and very lightweight. In addition, you can install and run Elasticsearch in minutes.
    However, if improperly Elasticsearch management, this easy-to-deploy and use may become a problem. JSON-based configuration is very simple, but if you want to specify a comment for each configuration file, then it is not right for you.
    In general, if your application using JSON, so Elasticsearch is a better choice. Otherwise, use Solr, because it's schema.xml solrconfig.xml are well documented and recorded.

  • Community
    Solr have a larger, more mature users, developers and community contributors. ES although smaller scale but has an active user community and the growing community of contributors.
    Solr is a true open source community code. Anyone can contribute to Solr, and elect a new Solr developers is an advantage (also known as the submitter). Elasticsearch is technically open source, but not so important mentally. Anyone can see the source, anyone can change it and provide a contribution, but only Elasticsearch employees can really make changes to Elasticsearch.
    Solr committers and contributors from many different organizations, and Elasticsearch submitter from a single company.

  • Maturity
    Solr more mature, but growing rapidly ES, I think it is stable.

  • Documents
    Solr scored high here. It is a very well documented product with API clear examples and case scenarios. Elasticsearch documentation well organized, but it's clear lack of good examples and configuration instructions.

5 | 0 summary


So in the end Solr or Elasticsearch?
Sometimes difficult to find a clear answer. Whether you choose Solr or Elasticsearch, you first need to understand the proper use cases and future needs. Summary of each of their properties.

remember:

  • Due to ease of use, Elasticsearch more popular in the new developer. However, if you're used to cooperate with Solr, please continue to use it, because the migration to Elasticsearch no particular advantage.

  • If, in addition also need it to search for text processing and analysis inquiries, Elasticsearch is a better choice.

  • If you need a distributed index, you need to select Elasticsearch. The need for good performance and scalability of cloud and distributed environments, Elasticsearch be a better choice.

  • Both have good business support (consulting, production support, integration, etc.)

  • Both have a very good working tool, although Elasticsearch because of its easy to use API and more attracted DevOps crowd, so you can create a more vivid tool ecosystem around it.

  • Elasticsearch example dominate with open log management, many organizations Elasticsearch index their log in to make it searchable. While Solr can now also be used for this purpose, but it just missed the idea.

  • Solr still more text-oriented search. On the other hand, Elasticsearch typically used to filter and group - analysis query workload - and not necessarily text search. Elasticsearch developers on Lucene and Elasticsearch level put a lot of effort to make such queries more efficient (lower memory footprint and CPU usage). Therefore, not only for text search, and applications that require complex search time aggregation, Elasticsearch is a better choice.

  • Elasticsearch easier to get started, a download and a command can start everything. Require more work and traditional knowledge Solr, but Solr recent eliminate this point has made tremendous progress, and now just trying to change its reputation.

  • In terms of performance, they are substantially the same. I say "roughly" because no one has done a comprehensive and unbiased benchmark. 95% of the cases for use, either option will be good in terms of performance, with the remaining 5% of their specific needs and the specific data access patterns to test both solutions.

  • Operationally, Elasticsearch simple to use - it's just a process. Solr Elasticsearch similar in its fully deployed in distributed mode SolrCloud dependent on Apache ZooKeeper. ZooKeeper is a mature super, super and so widely used, but it is still another active part. That is, if you are using Hadoop, HBase, Spark, Kafka or some other newer distributed software, you may have to run ZooKeeper somewhere in the organization.

  • Although Elasticsearch ZooKeeper built similar components Xen, but ZooKeeper can better prevent the terrible split-brain problem that sometimes occurs in Elasticsearch cluster. To be fair, Elasticsearch developers have been aware of the problem and working to improve this aspect of Elasticsearch.

  • If you like monitoring and indicators, use Elasticsearch, you will go to heaven. This thing has more indicators than New Year's Eve in Times Square can squeeze people! Solr exposed the key indicators, but nowhere near Elasticsearch so much.

In short, both of which are feature-rich search engine, as long as properly designed and implemented, they can provide more or less the same performance. The overall content of this article is as follows map that park by the Friends of ReyCG carefully drawn and available.

Guess you like

Origin www.cnblogs.com/aibabel/p/11449207.html