full text search engine

This article is reproduced from the blog of xum2008, mainly introduces 13 existing open source search engines, you can use them in your project to realize the retrieval function.

1.

The development language of Lucene Lucene is Java, and it is also the most famous open source search engine in the Java family. It is already a standard full-text search program in the Java world. It provides a complete query engine and indexing engine. There is no Chinese word segmentation engine. You need to implement it yourself, so using Lucene to make a search engine requires your own architecture. In addition, it does not support real-time search, but linkedin and twitter have respectively improved real-time search for Lucene. Lucene has a C++ ported version called CLucene, Because CLucene is written in C++, it is theoretically faster than lucene.

Official homepage: http://lucene.apache.org/

CLucene official homepage: http://sourceforge.net/projects/clucene/

2. Sphinx

Sphinx is a The open source search engine written in C++ language is also one of the more mainstream search engines. It is 50% faster than Lucene in terms of indexing events, but the index file is twice as large as that of Lucene, so Sphinx trades space for index establishment. The event strategy is similar to Lucene in terms of retrieval speed, but Lucene is better than Sphinx in terms of retrieval accuracy. In addition, in terms of the difficulty of adding Chinese word segmentation engine, Lucene is better than Sphinx. Among them, Sphinx supports real-time search, and it is relatively easy to use. Simple and convenient.

Official homepage: http://sphinxsearch.com/about/sphinx/

3. Xapian

Xapian is a full-text retrieval program written in C++. Its api and retrieval principle are similar to lucene in many aspects, which can be regarded as filling a vacancy in lucene in C++.

Official homepage: http://xapian.org/

4. Nutch

Nutch is an open source web search engine implemented in java, including crawler, indexing engine, and query engine. Nutch is based on Lucene, and Lucene provides Nutch with text indexing and search APIs.

For whether to use Lucene or Nutch , it should be if you don't need to scrape data, you should use Lucene, the most common application is: you have a data source and need to provide a search page for this data, in this case, the best way is directly from the database The data is extracted from the data and indexed with Lucene API.

Official homepage: http://nutch.apache.org/

5. DataparkSearch

DataparkSearch is an open source search engine implemented in C language. The web page sorting is based on a neural network model. It supports HTTP, HTTPS, FTP, NNTP, etc. download web pages. Including indexing engine, retrieval engine and Chinese word segmentation engine (this is also the only open source search engine with Chinese word segmentation engine). It can customize search results and have complete log records. .Official

homepage: http://www.dataparksearch.org/ 6.

Zettair

Zettair is a full-text retrieval experimental system based on the research results of Justin Zobel. It is implemented in C language. Among them, Justin Zobel is very famous in the field of full-text retrieval. He is the first person in the industry to systematically propose an inverted index differential compression algorithm. , the compression of the inverted list greatly improves the performance of retrieval and loading, and the space expansion rate is also reduced to a very good level. Since Zettair is originated from academia, the code is written by the search engine organization of RMIT University, so its The code is concise and refined, and the algorithm is efficient. It is a very good example of learning the classic algorithm of inverted index. It supports linux, windows, mac os and other systems.

Official homepage: http://www.seg.rmit.edu.au/zettair/ about.html

7. Indri

Indri is a full-text search engine system written in C language and C++ language. It is an open source project jointly launched by the University of Massachusetts and Carnegie Mellon University. It is characterized by cross-platform, API interface supports Java, PHP, C++.

Official homepage: http://www.lemurproject.org/indri/

8. Terrier

Terrier is a full-text retrieval system developed by the School of Computing Science, University of Glasgow using java.

Official homepage: http://terrier.org/

9. Galago

Galago is a text search tool set written in java language. It includes indexing engine and query engine, and also includes a distributed computing framework called TupleFlow (similar to Google's MapReduce). This retrieval system supports many Indri query language.

Official homepage: http://www.galagosearch.org/

10. Zebra

Zebra is a retrieval program implemented in C language, featuring support for big data and data in formats such as EMAIL, XML, and MARC.

Official homepage: https ://www.indexdata.com/zebra

11. Solr

Solr is an independent enterprise-level search application server developed in java. It provides an API interface similar to Web-service. It is a full-text retrieval server based on Lucene. It is a variant of Lucene. Many first-tier Internet companies are using Solr, and it is also a mature solution.

Official homepage: http://lucene.apache.org/solr/

12. Elasticsearch

Elasticsearch is a Java language developed , an open source, distributed search engine constructed based on Lucene. Designed for cloud computing, it can achieve real-time search, stable and reliable. The data model of Elasticsearch is JSON.

Official homepage: http://www.elasticsearch.org/

13. Whoosh

Whoosh is an open source search engine written in pure python.

Official homepage: https://bitbucket.org/mchaput/whoosh/wiki/Home

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327089947&siteId=291194637