7 Open Source Search Engines for Big Data Processing

Big data is an all-encompassing term that refers to datasets that are so large and complex that they require specially designed hardware and software tools. Data sets are usually T or larger. These datasets are created from a variety of sources, including sensors, collected meteorological information, and publicly available information such as magazines, newspapers, and articles. Also included are purchase transaction records, web logs, medical records, military reconnaissance, video and image archives, large-scale e-commerce, and more.

To analyze these data requires specialized software and hardware, this article introduces 7 open source search engines suitable for big data processing:

1. Apache Lucene

Lucene is an open source full-text search engine toolkit of the Apache Software Foundation. It is a full-text search engine architecture that provides a complete query engine, indexing engine, and partial text analysis engine. The purpose of Lucene is to provide a simple and easy-to-use toolkit for software developers to easily implement full-text search functions in the target system, or to build a complete full-text search engine based on this.

characteristic:

  • Indexing process:
    • Processes over 150GB of data per hour on today's popular hardware platforms
    • Small memory footprint, only 1MB of heap memory
    • Incremental indexing is as fast as bulk indexing
    • The index size is about 20-30% of the size of the text index
    • static index pruning
  • Search algorithm:
    • Scoped search - returns the best results first
    • Many powerful query types: phrase queries, wildcard queries, approximate queries, range queries, etc.
    • Can be queried for a field individually
    • Can be sorted by a field alone
    • Multi-index search and merge search results
    • Allows simultaneous updates of indexing and searching
    • Flexible facade search, highlighting, union and grouping of result sets
    • Fast, low memory footprint and fault tolerant
    • Pluggable ranking models including VSM and Okapi MB25
    • Configurable storage engine
  • Cross-Platform Solutions
    • 100% pure Java
    • Index-compatible implementations are available in other languages

 

2. Apache Solr

Apache Solr (pronounced:  SOLer)  is an open source search server. Solr is developed using the Java language and is mainly implemented based on HTTP and Apache Lucene . The resources stored in Apache Solr are stored as Document objects. Each document consists of a series of Fields, each Field representing a property of the resource. Each Document in Solr needs to have an attribute that can uniquely identify itself. By default, the name of this attribute is id, which is used in the Schema configuration file: <uniqueKey>id</uniqueKey>to describe.

 

 

3. ElasticSearch

Elastic Search is an open source, distributed, RESTful search engine built on Lucene . Designed for use in cloud computing, it can achieve real-time search, stable, reliable, fast, and easy to install and use. Support for data indexing using JSON over HTTP.

 

4. Sphinx

Sphinx is a full-text search engine based on SQL. It can be combined with MySQL and PostgreSQL for full-text search. It can provide more professional search functions than the database itself, making it easier for applications to achieve professional full-text search. Sphinx specially designs a search API interface for some scripting languages, such as PHP, Python, Perl , Ruby, etc., and also designs a storage engine plug-in for MySQL.

 

5. Japanese

Xapian is a full-text retrieval program written in C++, and its function is similar to Java's lucene. Although lucene is already a standard full-text retrieval program in the Java world, there is no corresponding tool in the C/C++ world, and Xapian fills this gap.

 

6. Nutch

Nutch is an open source Java implementation of a search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawler.

Although web search is a basic requirement for roaming the Internet, the number of existing web search engines is declining. And this is likely to further evolve into a company that monopolizes almost all web searches for its commercial interests. This is obviously not good for the vast majority of Internet users.

Nutch provides us with such a different choice. Compared with those commercial search engines, Nutch as an open source search engine will be more transparent and more trustworthy. Now all major search engines use proprietary ranking algorithms, Without explaining why a page ranks in a particular position. Besides that, some search engines rank according to what a site pays, not according to their own value. Unlike them, Nucth has nothing to hide , and there is no incentive to distort the search results. Nutch will do its best to provide users with the best search results.

Nutch is committed to making it easy and inexpensive for everyone to configure a world-class web search engine. To accomplish this ambitious goal, Nutch must be able to:

  • Fetch billions of web pages every month
  • maintain an index for these pages
  • Thousands of searches per second on indexed files
  • Provide high-quality search results
  • Operate with minimal cost

 

7. LGTE

LGTE is based on Lucene and provides extended Lucene API for integrating many services, such as fragment generation, query expansion, etc., and provides a set of unit tests.

Features include:

  • An abstraction layer that provides a simple and efficient Lucene API
  • Basis to support integrated retrieval and sorting by subject, time and geography
  • Support Lucene standard retrieval model, provide more advanced probabilistic retrieval methods
  • Support for Rochio query expansion
  • Provides a framework for IR simulation experience (e.g. handling CLEF/TREC topics)
  • Contains a Java replacement for the trec_eval tool
  • Contains a simple test application to search for Braun Corpus or Cranfield Corpus
  • TREC/CLEF simulation framework - tools for collection indexing, running topic searches and outputting results in treckeval format
  • Use separate folders to provide isolated fields
  • Provides hierarchical indexing through foreign key fields
  • Provides classes for parsing documents with Yahoo PlaceMaker

via linuxlinks

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326218880&siteId=291194637