[] Lucene full-text index

2. Lucene full-text search process
2.1. Indexing and searching flowchart


Green indicates the indexing process, the original content to search index build an index database, indexing process comprising: determining the content of the original content that is to be searched -> Acquisition Document -> Create Document -> Analysis Document -> index documents
red search process , content from the search index database, including the search process: the user through the search interface - create query> -> perform a search, a search from the index library -> render search results
2.2 Creating index.
step:

For documentation

The original document: To search based on those data, then the data is the original document.
Search engines: use crawlers to obtain the original document
Search this site: data in the database.
Case: io files directly on the stream to read the disk.
Construction of Document Object

Corresponding to each original document to create a Document object
for each document object contains multiple domains (field)
data is stored in the domain of the original document
name of the domain, the domain of the value of
each document has a unique number, is the document id .
Note: Each Document may have multiple Field, different Document may have a different Field, the same may have the same Document Field (domain name and domain values are the same)

Document analysis

Is the process of segmentation of
the string split according to space, get a list of words
to be converted into a unified word lowercase.
Remove punctuation.
Removal of stop words (meaningless words)
each keyword into a packaged object Term
Term contains two parts:
Image's domain
keyword itself is
split up in different domains of the same keyword is different the Term.
Creating an index

Create an index based on the keyword list. Stored in the index database.
Index Library:
Index
document object
corresponding relationship between keywords and documents
through the document to find the words, the structure of such an index called an inverted index structure. As shown below:

Inverted index structure is also called an inverted index structure, including indexes and documents of two parts, namely the index vocabulary, its small size and large document collections.
2.3. Query Index
User query interface

Where the user enters the query conditions

For example: Baidu's search box


The packaged as a keyword query object (create a query)

Domain to query
keywords to search
query is executed

Search domain according to the corresponding keyword to query.
Find keywords, find documents based on keywords corresponding
rendering results

Find the documentation in accordance with the document object id
of the highlighted keywords
paging through
a final presentation to the user to see.
3. entry procedures
3.1 development environment to configure
Lucene download

Lucene full-text search function is to develop a toolkit downloaded from the official website lucene-7.4.0, and extract.


Official Website: http: //lucene.apache.org/
Version: 7.4.0 Lucene-
Jdk requirements: 1.8 or more
jar package used

​ lucene-core-7.4.0.jar

​ lucene-analyzers-common-7.4.0.jar

3.2. Demand
that make searching a document, search for files by keyword, any file name or file contents of files, including keywords need to find out. It can also be queried according to Chinese words, and the need to support multiple criteria query.
In this case the original content is the file on disk, as shown below:


3.3 to create an index
implementation steps:

Step 1: Create a java project, and import the jar package.
Step 2: Create a indexwriter object.
1) Directory storage position specified index Library
2) specifies a IndexWriterConfig object.
Step 2: Create document object.
Step 3: Create an object field, add the field to the document object.
Step Four: Using indexwriter objects document object into the library index, this index creation process. And writes the index database indexing and document objects.
Step five: Turn off IndexWriter object.

Code:

/**
* @Auther: lss
* @Date: 2019/5/7 17:27
* @Description:
*/
public class LuceneFirst {

@Test
public void createIndex() throws IOException {

// Create a Directory object, saved the library location specified index
// the index database is kept in memory
// Directory Directory = new new RAMDirectory ();
// The index database stored on disk
Directory directory = FSDirectory.open (new File . ( "D: \\ IDEA1 \\ \\ lelucene index") toPath ());
// create an object based on IndexWriter Directory objects
IndexWriter Writer = new new IndexWriter (Directory, new new IndexWriterConfig ());
// read the disk the file corresponding to each file to create a document object.
File the dir = new new File ( "D: \\ searchsource");
File [] = dir.listFiles Files ();
for (File File: Files) {
// get the file name
String fileName = file.getName ();
// path of the file
String filePath = file.getPath ();
path // file
String fileContent = FileUtils.readFileToString (file, " utf-8");
the size of the file //
long fileSize = FileUtils.sizeOf(file);
// 创建 Field

Field fieldName = new TextField("name", fileName, Field.Store.YES);
Field fieldPath = new TextField("path", filePath, Field.Store.YES);
Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);

// Create a document object
the Document the Document = new new the Document ();
// add the domain to the document object
document.add (fieldName);
document.add (fieldPath);
document.add (fieldContent);
document.add (the FieldSize);
/ / write the document object index repository
writer.addDocument (document);
}
// close indexWriter objects
writer.Close ();
}
}

. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
26 is
27
28
29
30
31 is
32
33 is
34 is
35
36
37 [
38 is
39
40
41 is
42 is
43 is
44 is
45
46 is
47
48
49
50
51
test is performed: generating an index library

Generated here can not read a bunch of files that can not see with ordinary text editor, how to do this, there is a small software luck can view the index database. I will upload files to download. download

Use luke view the contents of the index Library

The version we use is luke luke-7.4.0, with the corresponding version of lucene. Version 7.4.0 can open the library lucene index was created. Note that this version of Luke is jdk9 compiled, so in order to run this tool also requires jdk9 can (PS: jdk 1.8 is seemingly possible).

3.4 query index
implementation steps:

Step 1: Create a Directory object that is put in the position index stocks.
Step 2: Create a indexReader object, you need to specify Directory object.
Step 3: Create a indexsearcher object, you need to specify the object IndexReader
Step Four: Create a TermQuery object that specifies the domain and query keyword query.
Step five: execute the query.
Step Six: return query results. Traverse the query results and output.
Step Seven: Close objects IndexReader

Code:

@Test
public void SearchIndex () throws Exception {
// create a Directory object location specified index Library
Directory directory = FSDirectory.open (new File ( "D: \\ IDEA1 \\ lelucene \\ index".) ToPath () );
// Create a IndexReader objects
IndexReader IndexReader = DirectoryReader.open (Directory);
// Create a Indexsearcher object parameters IndexReader object constructor method.
IndexSearcher = new new IndexSearcher IndexSearcher (IndexReader);
// create a Query object, TermQuery
Query Query = new new TermQuery (new new Term ( "name", "the Spring"));
// execute a query, get a TopDocs objects
// Parameter 1: query object parameter 2: the maximum number of query results are returned records
TopDocs TopDocs indexSearcher.search = (query, 10);
// get the total number of records of the query result
System.out.println ( "records the total number of query:" + topDocs.totalHits );
// get list of documents
ScoreDoc [] = scoreDocs topDocs.scoreDocs;
// print document
for (ScoreDoc DOC: scoreDocs) {
// get the document ID
int = the docId doc.doc;
// get the document object according to ID
the Document Document indexSearcher.doc = ( docId);
System.out.println (document.get ( "name"));
System.out.println (document.get ( "path"));
// System.out.println (document.get ( "Content" ));
System.out.println (document.get ( "size"));
System.out.println ( "gorgeous ------------------ dividing line") ;
}
// close IndexReader objects
indexReader.close ();
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
26 is
27
28
29
30
31 is
32
a test is performed:

There is no print content, too bad screenshots, students can test themselves.

We used to see the results of luck word found it is unfriendly to Chinese, English to follow the word points, while the Chinese only a word by word,

Search in English is not a problem, Chinese to die, you can also search a single word. With this issue, which introduces a parser.


4. Analyzer
4.1 Analyzer effect word
code for:

@Test
public void testTokenStream () throws Exception {
// create a standard parser object
Analyzer Analyzer StandardAnalyzer new new = ();
// get the object tokenStream
// first parameter: the domain, can easily give a
// second parameter : text to analyze
the tokenStream tokenStream = analyzer.tokenStream ( "", "at the a, Comprehensive Programming the Provides the Spring Framework and the Configuration Model.");
// the tokenStream tokenStream = analyzer.tokenStream ( "", "word word: that is, according to Chinese a word for word segment words such as: "I love China" ");
// add a reference, can be obtained for each keyword
CharTermAttribute charTermAttribute = tokenStream.addAttribute (CharTermAttribute.class);
// add an offset reference, the keyword recording start position and the end position
OffsetAttribute offsetAttribute = tokenStream.addAttribute (OffsetAttribute.class);
// adjust the pointer to the head of the list
tokenStream.reset();
// iterate keyword list, if the list is determined by the method ends incrementToken
the while (tokenStream.incrementToken ()) {
// keyword starting position
// System.out.println ( "start->" + offsetAttribute.startOffset () );
// get keyword
System.out.println (charTermAttribute);
// end position
// System.out.println ( "eND->" + offsetAttribute.endOffset ());
}
// close tokenStream objects
tokenStream.close ();
}
. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
26 is
27
** perform the test: English **

Chinese:

Unfriendly to Chinese, it would not do ah, we view the source code to create an index database we found that word is not specified, the default is to use the word breaker StandardAnalyzer

Here are Chinese analyzer

4.2 Chinese analyzer
4.2.1.Lucene comes with the Chinese word breaker
StandardAnalyzer:

Word word: word is carried out in accordance with the Chinese word for word. Such as: "I love China",
the effect of: "I", "love", "medium" and "country."

SmartChineseAnalyzer:

Chinese support is good, but poor scalability, extended thesaurus, dictionary and thesaurus disabled and other hard to deal with.

4.2.2.IKAnalyzer

Usage:
The first step: add the jar package to the project
Step two: Add a profile and extended dictionary and stop words dictionary to the classpath (hotword.dic and stopword.dic, with file IKAnalyzer.cfg. xml)

Note: The format hotword.dic and ext_stopword.dic file is UTF-8, note that no BOM of UTF-8 encoding that is prohibited to use windows Notepad to edit the dictionary file extension

Use EditPlus.exe saved as no BOM of UTF-8 encoding format, as shown below:

- ** Extended Dictionary: ** add some new words

** stop word dictionary: ** meaningless word or sensitive words

4.3 Analyzer using custom
code for:

@Test
public void addDocument () throws Exception {
// create a IndexWriter objects, as required IKAnalyzer analyzer
Directory directory = FSDirectory.open (new File ( "D: \\ IDEA1 \\ lelucene \\ index".) ToPath ( ));
// Create a IndexWriter objects, as required IKAnalyzer analyzer
indexWriterConfig indexWriterConfig new new indexWriterConfig = (new new IKAnalyzer ());
IndexWriter Writer = new new IndexWriter (Directory, indexWriterConfig);
// create a document object
Document document = new Document ();

// add an object to the document domain
document.add (new TextField ( "name" , " file newly added", Field.Store.YES));
document.add (new new TextField ( "Content", "file newly added content ", Field.Store.NO));
document.add (new new StoredField (" path "," C: / the TEMP / the Hello "));
// add the documents to write an index library
writer.addDocument (the document);
// Close index database
writer.Close ();
}
. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19

Use word test the effect of the above code:

We later when creating an index library, use IKAnalyzer it.
--------------------- 

Guess you like

Origin www.cnblogs.com/hyhy904/p/10961716.html