Lucene preliminary understanding and learning

Lucene learning
First, what is the full-text search

1. Data classification

  1. Data Structure

    Fixed format, fixed length, fixed data type.

    For example: data in the database;

  2. Unstructured Data

    word document, pdf documents, e-mail, html, txt

    Format is not fixed, fixed length, data type is not fixed

2. Data Query

  1, structured data query

    Sql statement, query structured data. Simple, fast.

  2. Unstructured Data Query

    Example: Find the file that contains the words spring from a text file

    1. visual inspection

    2. implemented using program code, read the file into memory, the string matching is sequentially scanned.

    3. The structured data into structured data

      (First string split according to space, get a list of words to create an index based on the list of words)  

       Index: An order to speed up the search, create a collection of some data structures

       And then query the index, find the list of documents based on the correspondence between words and documents, a process called full-text search

3. The full-text search

  Create an index, and then query the index procedure, called full-text search. (Creating an index for a long time, but once created can be used multiple times, each time the average rate would be increased query speed)

Second, the scenario of full-text search

1, the search engine

    Baidu, 360 search, Google, Sogou

2, the station search

    Forum Search, microblogging search, article search

3, the electricity supplier search

    Taobao search, Jingdong search

4, as long as there is a local search, you can use the full-text search technology

Third, what is Lucene

  Lucene is a full-text search tool based on java development package, java development Lucene choice.

Four, Lucene full-text search process.

1, to create an index

    1) to obtain documents

        The original document: To search based on those data, then the data is the original document

        Search engines: use crawlers to obtain the original document

        Search: data in the database. Taken into the database using jdbc

        Case column: io files directly on the stream to read the disk.

    2) Construction of Document Object

        Corresponding to each original document, to create a Document object.

        Each Document object contains multiple domains (Field)

        Keep the data of the original document domain.

           Domain name; value of the field;

           Each document has a document number, the document is the id

    3) analysis of documents

        Segmentation is the process of

        1. split according to space, get a list of words

        2. The word unified converted to uppercase or lowercase

        3. Remove the punctuation

        4. remove stop words (stop words: meaningless words; such as: and)

        Each keyword into a packaged object Term, Term contains two parts: a field containing keywords; keywords themselves; (split out of the keywords are different in different domains Term)

    4) Create an index

        To create a keyword list based on an index, saved to disk, save the index database.

        Index library contains:

              index

              Document object

              Correspondence between the keyword and documentation

        By words to find documents, such an index structure called an inverted index structure

2, query index 

    1) user query interface

        Where the user enters the query conditions

        For example: Baidu's search box

    2) A package as a keyword query object

        Domain to be queried

        Key words to search

    3) execute the query

        To the corresponding domain search by keyword to query

        Find keywords to find the corresponding documents according to keywords

    4) rendering results

        Find the target document according to the document id

        For keywords highlighted

        Pagination

        The final exhibit to show users
V. entry procedures
  1, create an index
      environment:
        need to download Lucene
        http://lucene.apache.org/
        current version-7.4.0 Lucene
        jdk1.8 version of the minimum requirements for
    engineering structures:
        create a java project
        Add JAR:
        Lucene-Analyzers-Common-7.4.0.jar
        Lucene-Cone-7.4.0.jar
        Commons-io.jar
      steps:
        1. Create a Directory object stored at the specified index
        2, based on a target IndexReader
        3, read files on the disk, corresponding to each file to create a document object
        4, add the domain to the document object
        5, the document object written index database
        6, close indexWriter objects

public void createIndex() throws Exception {
    // 1, a Director to create an object, saved the library specified index position.
    // The index database is kept in memory
    //Directory directory = new RAMDirectory();
    // The index database stored on disk
  Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
    // 2, create a Directory Object object based on IndexWriter
  IndexWriterConfig config = new IndexWriterConfig (new IKAnalyzer ());
  IndexWriter index writer = new IndexWriter (directory, config);
    // 3, read files on the disk, corresponding to each file to create a document object.
  File dir = new File ( "C: \\ A0.lucene2018 \\ 05 references \\ searchsource.");
  File[] files = dir.listFiles();
      for (File f :files) {
    // get the file name
  String fileName = f.getName();
    // file path
  String filePath = f.getPath();
    // file contents
  String fileContent = FileUtils.readFileToString(f, "utf-8");
    // file size
  long fileSize = FileUtils.sizeOf(f);
    // Create Field
    The domain name, parameter 2:: // parameter content field 1, 3 parameters: whether to store
  Field fieldName = new TextField("name", fileName, Field.Store.YES);
    //Field fieldPath = new TextField("path", filePath, Field.Store.YES);
  Field fieldPath = new StoredField("path", filePath);
  Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
    //Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
  Field fieldSizeValue = new LongPoint("size", fileSize);
  Field fieldSizeStore = new StoredField("size", fileSize);
    // Create a document object
  Document document = new Document();
    // add a domain to the document object
     document.add(fieldName);
     document.add(fieldPath);
    document.add(fieldContent);
    //document.add(fieldSize);
    document.add(fieldSizeValue);
    document.add(fieldSizeStore);
      // 5, the document object written index Library
    indexWriter.addDocument (document);
}
    // 6, close indexwriter objects
    indexWriter.close ();
}

  

2, using luke view the contents of the index library (use luke are required to jdk version)
        version: luke-7.4.0 (required version jdk1.9)
3, the index database query
      steps:
        1. Create a Directory object that specifies position index database
        2, to create one of indexReader
        . 3, to create a IndexSearcher object parameters indexReader object constructor method
        4, a query object, TremQuery the object
        5 is performed a query to yield a TopDocs objects
        6, taken query result of the total number of records
        7 , take the list of documents
        8, the contents of the printed document
        9 objects close IndexReader

public void searchIndex() throws Exception {
      // 1, a Director to create an object, the position of the specified index Library
  Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
      // 2, creates a target IndexReader
  IndexReader indexReader = DirectoryReader.open(directory);
      // 3. Create a IndexSearcher object parameter indexReader object constructor method.
  IndexSearcher index searcher = new index Searcher (index reader);
      // 4, create a Query object, TermQuery
  Query query = new TermQuery(new Term("name", "spring"));
      // 5, execute a query, get a TopDocs objects
      // Parameter 1: Maximum number of records returned by the query results: query object parameter 2
  TopDocs topDocs = indexSearcher.search(query, 10);
      // 6, taking the total number of records in the query results
  System.out.println ( "The total number of records the query:" + topDocs.totalHits);
      // 7, take the list of documents
  ScoreDoc[] scoreDocs = topDocs.scoreDocs;
      // 8, the content of the printed document
  for (ScoreDoc doc : scoreDocs) {
      // get the document id
  int docId = doc.doc;
      // get the document object based on the id
  Document document = indexSearcher.doc(docId);
  System.out.println(document.get("name"));
  System.out.println(document.get("path"));
  System.out.println(document.get("size"));
    //System.out.println(document.get("content"));
  System.out.println ( "----------------- lonely dividing line");
  }
    // 9, closed IndexReader objects
  indexReader.close();
}

 

Sixth, the analyzer
is used by default standard analyzer StandardAnalyzer
    1, see the analysis of the effect of the analyzer
    using the Analyzer object tokenS his ream method returns a TokenStream the object, the object contains the final word segmentation results
      achieved steps:
        1) create a Analyzer objects, StandrdAnalyzer object
        2) analyzer using a method to get the object tokenStream TokenStream object
        3) to set a reference TokenStream object, a pointer is equivalent to setting
        4) method call rest TokenStream object, if the call does not throw exception
        5) while loop traversal Object TokenStream
        6) Close Object TokenStream

public void testTokenStream() throws Exception {
    // 1) create a Analyzer objects, StandardAnalyzer objects
    // Analyzer analyzer = new StandardAnalyzer();
  Analyzer analyzer = new IKAnalyzer();
    2 //) analyzer using a method for obtaining object tokenStream objects TokenStream
  TokenStream tokenStream = analyzer.tokenStream ( "", "police Lucene is a high performance, scalable information retrieval (IR) database tool information retrieval means searches the document, the document information or document relevance of search metadata search other operations ");
    // 3) to set a reference TokenStream object, a pointer corresponding to the number of
  CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    // 4) method calls the rest TokenStream object. If you do not call Throws
  tokenStream.reset();
    // 5) while loop to traverse the object TokenStream
  while(tokenStream.incrementToken()) {
  System.out.println(charTermAttribute.toString());
  }
    // 6) closed TokenStream objects
  tokenStream.close();
}

  

2, IKAnalyze to use
    1) The IKAnalyze jar package added to the project to
    2) to add a profile and extended dictionary to the project's classpath
    Note: Extended Dictionary is prohibited to use windows notepad editor, must be extended dictionary encoding format is utf-8
    extension dictionary: add some new words
    stop word dictionary: meaningless word or sensitive words

public void createIndex() throws Exception {
    // 1, a Director to create an object, saved the library specified index position.
    // The index database is kept in memory
    //Directory directory = new RAMDirectory();
    // The index database stored on disk
  Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
    // 2, create a Directory Object object based on IndexWriter
  IndexWriterConfig config = new IndexWriterConfig (new IKAnalyzer ());
  IndexWriter index writer = new IndexWriter (directory, config);
    // 3, read files on the disk, corresponding to each file to create a document object.
  File dir = new File ( "C: \\ A0.lucene2018 \\ 05 references \\ searchsource.");
  File[] files = dir.listFiles();
  for (File f : files) {
  // get the file name
  String fileName = f.getName();
    // file path
  String filePath = f.getPath();
    // file contents
  String fileContent = FileUtils.readFileToString(f, "utf-8");
    // file size
  long fileSize = FileUtils.sizeOf(f);
    // Create Field
    The domain name, parameter 2:: // parameter content field 1, 3 parameters: whether to store
  Field fieldName = new TextField("name", fileName, Field.Store.YES);
    //Field fieldPath = new TextField("path", filePath, Field.Store.YES);
  Field fieldPath = new StoredField("path", filePath);
  Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
    //Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
    Field fieldSizeValue = new LongPoint("size", fileSize);
  Field fieldSizeStore = new StoredField("size", fileSize);
    // Create a document object
  Document document = new Document();
    // add a domain to the document object
  document.add(fieldName);
  document.add(fieldPath);
  document.add(fieldContent);
    //document.add(fieldSize);
  document.add(fieldSizeValue);
  document.add(fieldSizeStore);
    // 5, the document object written index Library
  indexWriter.addDocument (document);
}
    // 6, close indexwriter objects
  indexWriter.close ();
}

  


Seven, maintaining an index of library
1, add documents

 

private IndexWriter index writer;

  @Before
public void init() throws Exception {
    // Create a IndexWriter object, you need to use IKAnalyzer as analyzer
  index writer =
    new IndexWriter(FSDirectory.open(new File("C:\\temp\\index").toPath()),
    new IndexWriterConfig(new IKAnalyzer()));
}

  @Test
public void addDocument() throws Exception {
    // Create a IndexWriter object, you need to use IKAnalyzer as analyzer
  IndexWriter index writer =
  new IndexWriter(FSDirectory.open(new File("C:\\temp\\index").toPath()),
  new IndexWriterConfig(new IKAnalyzer()));
    // Create a Document object
  Document document = new Document();
    // add a domain to the document object
  document.add (new TextField ( "name", "the newly added files", Field.Store.YES));
  document.add (new TextField ( "content", "the newly added file content", Field.Store.NO));
  document.add(new StoredField("path", "c:/temp/helo"));
    // write the document index database
  indexWriter.addDocument (document);
    // close the index database
  indexWriter.close ();
}

  @Test
public void deleteAllDocument() throws Exception {
    // delete all documents
  indexWriter.deleteAll ();
    // close the index database
  indexWriter.close ();
}

  @Test
public void deleteDocumentByQuery() throws Exception {
  indexWriter.deleteDocuments(new Term("name", "apache"));
  indexWriter.close ();
}

  

2, delete the document
1) Remove all

  @Test
public void deleteAllDocument() throws Exception {
    // delete all documents
  indexWriter.deleteAll ();
    // close the index database
  indexWriter.close ();
}

  



2) According to the query, keyword delete documents

  @Test
public void deleteDocumentByQuery() throws Exception {
    indexWriter.deleteDocuments(new Term("name", "apache"));
  indexWriter.close ();
}

  

3, modify the document
  to modify the principle is added after the first delete

  @Test
public void updateDocument() throws Exception {
    // Create a new document object
Document document = new Document();
    // add a domain to the document object
  document.add (new TextField ( "name", "after the update document", Field.Store.YES));
  document.add (new TextField ( "name1", "document 2 after update", Field.Store.YES));
  document.add (new TextField ( "name2", "document 3 after update", Field.Store.YES));
    // update
  indexWriter.updateDocument(new Term("name", "spring"), document);
    // close the index database
  indexWriter.close ();
}

  



Eight, the index database queries
    1, using Query subclasses
      1) TermQuery
        query based on keyword
        domains need to specify the keywords and queries to query

      2) RangeQuery
range queries

    private IndexReader indexReader;
    private IndexSearcher index searcher;
    @Before
public void init() throws Exception {
    indexReader = DirectoryReader.open(FSDirectory.open(new File("C:\\temp\\index").toPath()));
    indexSearcher = new IndexSearcher(indexReader);
}

  @Test
public void testRangeQuery() throws Exception {
      // create a Query object
    Query query = LongPoint.newRangeQuery("size", 0l, 100l);
    printResult (QUERY);
}

    private void printResult(Query query) throws Exception {
      // execute the query
    TopDocs topDocs = indexSearcher.search(query, 10);
    System.out.println ( "Total number of records:" + topDocs.totalHits);
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (ScoreDoc doc:scoreDocs){
      // get the document id
    int docId = doc.doc;
      // get the document object based on the id
    Document document = indexSearcher.doc(docId);
    System.out.println(document.get("name"));
    System.out.println(document.get("path"));
    System.out.println(document.get("size"));
      //System.out.println(document.get("content"));
    System.out.println ( "----------------- lonely dividing line");
}
    indexReader.close();
}

  

2, QueryPaser query
      may want to query the contents of the first word, and then scan the results based on sub-word
      add a jar package
      luncene-queryparser-7.4.0.jar

    @Test
public void testQueryParser() throws Exception {
    // Create a QueryPaser objects, two parameters
  QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
    // Parameter 1: default search, Parameter 2: parser object
    // create a Query object using the object QueryPaser
  Query query = queryParser.parse ( "lucene is a Java development tool kit full text search");
    // execute the query
  printResult (QUERY);
}

  

Guess you like

Origin www.cnblogs.com/ketty/p/12106922.html