A, Lucene relevant basic concepts
Definition: a simple tool kit , functional file search, support for Chinese, keywords, multi-criteria query, any file name or file contents included are checked out.
Data Classification: structured data (fixed data format or finite length) and unstructured data (variable length data or no fixed format)
The PS: lucene underlying search engine is implemented, solr lucene is actually a framework for encapsulation.
Second, data search
Structured Data [1]
Since the data have certain norms and structures, typically used sql statement to query.
[2] structured query
(1) sequential scanning method: a document to find a document, inefficient and very slow.
(2) full-text search: the extracted portion of unstructured data in the re-organized into an index , this index was built first, and then the index search process is called full-text search. (Example: Dictionary)
PS: Although the process is very time-consuming to create the index, but the index once created can be used multiple times, full-text search query is the main treatment, so time-consuming to create the index is well worth it!
Third, the search process
Index: 1, using the stream to read the contents of the document 2, specific objects built of document content (bean) 3, to do word document content 4, creating an index
Index Library: inside the store both the index, but also stored for specific documents. (Can be seen as a dictionary-like structure: the directory, there are specific content)
User Interface Query: i.e. keyword input box, does not refer to the corresponding interface classes implement java.
Creating an index
1. obtain the original document
The original document: refers to index and search the content, form of expression, including web sites, documents and other data on the disk and the database ...
2. Create a document object
lucene document object: contains many domain (field), each document has a unique number , that document id.
- Each document can have multiple domains
- Different documents may have different domains
- The same documents can have the same domain (domain name and domain values are the same)
3. Analysis of documents
Word: original documents to extract word dividing word, remove punctuation , remove stop words , all uppercase characters converted to lowercase word , the final generation language exchange unit (a one word) .
term: After word of each word is called a term, different domains in split out the same word is different Term !
term structure: similar to the structure of KV: term domain name (K) value field (V)
4. Create an index
Index structure: inverted index structure (inverted index structure), including indexing and document in two parts.