Search engines learn (a) acquaintance Lucene

A, Lucene relevant basic concepts

Definition: a simple tool kit , functional file search, support for Chinese, keywords, multi-criteria query, any file name or file contents included are checked out.

Data Classification: structured data (fixed data format or finite length) and unstructured data (variable length data or no fixed format)

The PS: lucene underlying search engine is implemented, solr lucene is actually a framework for encapsulation.

Second, data search

Structured Data [1]

  Since the data have certain norms and structures, typically used sql statement to query.

[2] structured query

  (1) sequential scanning method: a document to find a document, inefficient and very slow.

(2) full-text search: the extracted portion of unstructured data in the re-organized into an index , this index was built first, and then the index search process is called full-text search. (Example: Dictionary)

PS: Although the process is very time-consuming to create the index, but the index once created can be used multiple times, full-text search query is the main treatment, so time-consuming to create the index is well worth it!

Third, the search process

 

Index: 1, using the stream to read the contents of the document 2, specific objects built of document content (bean) 3, to do word document content 4, creating an index 

Index Library: inside the store both the index, but also stored for specific documents. (Can be seen as a dictionary-like structure: the directory, there are specific content) 

User Interface Query: i.e. keyword input box, does not refer to the corresponding interface classes implement java.

Creating an index

1. obtain the original document

The original document: refers to index and search the content, form of expression, including web sites, documents and other data on the disk and the database ...

2. Create a document object

 lucene document object: contains many domain (field), each document has a unique number , that document id.

  • Each document can have multiple domains
  • Different documents may have different domains
  • The same documents can have the same domain (domain name and domain values are the same)

3. Analysis of documents

Word: original documents to extract word dividing word, remove punctuation , remove stop words , all uppercase characters converted to lowercase word , the final generation language exchange unit (a one word) .

term: After word of each word is called a term, different domains in split out the same word is different Term

term structure: similar to the structure of KV: term  domain name (K)   value field (V)

4. Create an index

 Index structure: inverted index structure (inverted index structure), including indexing and document in two parts.

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/riches/p/11437213.html