【Project Combat】Introduction to Lucene API

1. Search Engine Framework

What is the difference between Lucene or solr? When are they used?
First of all, Solr is based on Lucene. Lucene is an information retrieval toolkit, but it does not include a search engine system. It includes functions such as index structure, reading and writing index tools, correlation tools, and sorting. Therefore, when using Lucene, you Still need to pay attention to the search engine system, such as data acquisition, parsing, word segmentation and other aspects.

The goal of Solr is to create an enterprise-level search engine system, so it is closer to the search engine system we recognize. It is a search engine service. Through various APIs, your application can use search services without The search logic needs to be coupled into the application. Moreover, Solr can define the way of data analysis according to the configuration file, which is more like a search framework, and it also supports operations such as master-slave and hot-swapping database. It also adds support for common functions of search engines such as Piaohong and facet.

Therefore, Lucene is more flexible to use, but you need to handle the search engine system architecture and the implementation of other additional functions by yourself. Solr has done more for you, but it is a high-level framework. Many new features of Lucene cannot be transparently uploaded in time, so sometimes you may find that you need a function. Lucene supports it, but you can’t see the relevant interface on Solr. .

2. Explanation of Lucene vs Solr

Many people new to Lucene and Solr will ask the obvious question: Should I use Lucene or Solr?The answer is simple: if you’re asking yourself this question, in 99% of situations, what you want to use is Solr.A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can’t drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can’t use as-is, whereas Solr is a complete application which you can use out-of-box.Make sth people want.

3. More explanation

Lucene is more like an SDK. There are complete API families and corresponding implementations. You can use these to implement advanced queries (based on inverted index technology) in your own applications. Lucene is very practical and convenient for stand-alone or desktop applications. However, Lucene requires developers to maintain index files themselves, and it is very troublesome to back up and synchronize index files in a multi-machine environment. So, there is Solr. Solr is a Lucene-based query server with an HTTP interface, which encapsulates many Lucene details. Your own application can directly use HTTP GET/POST requests such as .../solr?q=abc to query and maintain and modify indexes. For example, Lucene gives you a bunch of packages to let you build a database from the bottom layer. Solr is a well-implemented database program that can be used directly after installation.

4. Comparison

Lucene is a search engine, a java library, and provides java api, which can be used in any application (including solr of course) Solr is a search server, based on an http wrapper on Lucene (although this statement is not accurate, solr also provides Many other features), provided that http apisolr depends on Lucene, the purpose is to be more convenient to use. The main competitor is elasticsearch, and elasticsearch is also based on Lucene's solr, which is more used in combination with nutch

1. Five basic classes in Lucene's API

In order to index documents, Lucene provides five basic classes

  • Document
  • Field
  • IndexWriter
  • Analyzer
  • Directory

1.1 Document

  • used to describe the document
  • A document can refer to an HTML page, an email, or a text file.
  • A Document object is composed of multiple Field objects.
  • A Document object = a record in the database, each Field object = a field of the record.

1.2 Field

  • used to describe a property of a document
  • For example, the title and content of an email can be described by two Field objects.

1.3 Analyzer

  • Before a document is indexed, the content of the document needs to be word-segmented first, and this part of the work is done by Analyzer.
  • The Analyzer class is an abstract class that has multiple implementations.
  • Choose a suitable Analyzer for different languages ​​and applications.
  • Analyzer sends the word-segmented content to IndexWriter to build an index.

1.4 IndexWriter

  • A core class used by Lucene to create indexes
  • Function: Add each Document object to the index.

1.5 Directory

  • The storage location of Lucene's index
  • An abstract class which currently has two implementations

[x] The first is FSDirectory, which represents the location of an index stored in the file system
[x] The second is RAMDirectory, which represents the location of an index stored in memory.

2. Query class in Lucene API

This is an abstract class with multiple implementations, such as TermQuery, BooleanQuery, and PrefixQuery. The
purpose of this class is to encapsulate the query string entered by the user into a Query that Lucene can recognize.

3. IndexSearcher in Lucene API

Searches are performed on established indexes.
An index can only be opened read-only, so there can be multiple instances of IndexSearcher operating on an index.

4. Hits in Lucene API

save search results

Guess you like

Origin blog.csdn.net/wstever/article/details/129655667