Lucene full-text search (a)

Lucene full-text search process

Creating an index

  • For documentation
    • The original document: To search based on those data, then the data is the original document.
    • Search engines: use crawlers to obtain the original document
    • Search: data in the database.
    • Local search: io files directly on the stream to read the disk. * Analysis of documents (each document split into different domains, then each domain participle)
  • Construction of objects Trem
    • Each keyword is encapsulated into an object Term (Term contains two parts: a domain where the keywords (field names), keyword itself (field value)
    • The Trem string segmentation according to space, get a list of words, the word unified converted to lowercase, remove punctuation, remove stop words
  • Construction of Document Object
    • Corresponding to each original document to create a Document object
    • Each object contains a plurality of document Trem (i.e., the step to be stored or indexed Trem)
  • Creating an index
    • Create a list of keywords based on the index, the index is saved to the library
    • Index Library: correspondence between the index, document objects, keywords and documentation
    • Find a document by words, such an index structure called inverted index structure.

Query Index

  • User query interface
    • Where the user enters the query conditions (for example: Baidu search box)
  • The packaged as a keyword query object
    • Domain to query, to search for keywords
  • Execute the query
    • Search domain according to the corresponding keyword to query.
    • Find keywords, find the corresponding documents according to keywords
  • Rendering results
    • According to locate the document object id document
    • For keywords highlighted, paging process, and ultimately presented to the user to see.

note:

  • Lucene data is not stored in the structure, but will be indexed according to the tree can index Term, then each query the index tree to find the corresponding Document (Each document has a unique number, which is the document id)
  • To save structured data or relational database

Preparing the Environment

  • rely

     <dependency>
                <groupId>org.apache.lucene</groupId>
                <artifactId>lucene-core</artifactId>
                <version>7.4.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.lucene</groupId>
                <artifactId>lucene-queryparser</artifactId>
                <version>7.4.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.lucene</groupId>
                <artifactId>lucene-analyzers-common</artifactId>
                <version>7.4.0</version>
      </dependency>
      <!--使用FileUtils-->
      <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-io</artifactId>
            <version>1.3.2</version>
        </dependency>
    
  • IK tokenizer (Chinese word breaker)

    Link: https: //pan.baidu.com/s/1Rsled6iBdgOZAbjzxS_BCA
    extraction code: f5wh
    copy the contents of this open Baidu network disk phone App, the operation more convenient oh

    Here Insert Picture Description
    IKAnalyzer.cfg.xml must be in a path classpath, and hotword.dic (hot words), stopword.dic (stop words) is specified by a path IKAnalyzer.cfg.xml

Increase in the index Library

step

  • Create a Directory object that specifies the index database stored position.
  • Create a Directory object based IndexWriter objects
  • Read files on the disk, corresponding to each file to create a document object.
  • Add a domain to a document object
  • The document object into the index database
  • Close indexwriter objects
public void addDocument() throws Exception {
		/**
		 * 1、创建一个Director对象,指定索引库保存的位置。
         * 把索引库保存在内存中:Directory directory = new RAMDirectory();
         * 把索引库保存在磁盘:
         */
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        /**
         * 2、基于Directory对象创建一个IndexWriter对象
         * IndexWriterConfig config=new IndexWriterConfig() //不指定为IKAnalyzer会默认使用自带的英文分词器
         */
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, config);
        /**
         * 3、读取磁盘上的文件,对应每个文件创建一个文档对象。
         */
        File dir = new File("C:\\temp\\searchsource");
        File[] files = dir.listFiles();
        for (File f : files) {
            //取文件名
            String fileName = f.getName();
            //文件的路径
            String filePath = f.getPath();
            //文件的内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(f);
            /**
             * 4、创建Field
             * 参数1:域的名称,参数2:域的内容,参数3:是否存储
             */
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size", fileSize);
            //创建文档对象
            Document document = new Document();
            //向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSizeValue);
            /**
             * 5、把文档对象写入索引库
             */
            indexWriter.addDocument(document);
        }
        /**
         * 6、关闭indexwriter对象
         */
        indexWriter.close();
 }
  • Filed common type
    Here Insert Picture Description

  • Stored Property Description

    • If any fields you want (that is, only when the search) index in the index database only, and without obtaining the value of the field at the time of the search to the corresponding Document (ie documet.get () to get any value)
  • Indexed Property Description

    • This field is indexed at the time of the query (for example: url, path, etc. would not have done the index, so it should be used StroedFiled)
  • Analyzed Property Description

    • This field indicates whether a word, a sentence can not be split because some (such as ID number)

Word Description

  • Test word
 public void testTokenStream() throws Exception {
        //1)创建一个Analyzer对象,StandardAnalyzer对象
		// Analyzer analyzer = new StandardAnalyzer(); //自带的英文分词器
        Analyzer analyzer = new IKAnalyzer();
        //2)使用分析器对象的tokenStream方法获得一个TokenStream对象
        //第一个参数指定分词字段(没有document使用的话不用指定)
        //第二个参数是分词的句子
        TokenStream tokenStream = analyzer.tokenStream("", "Lucene是一款高性能的、可扩展的信息检索(IR)工具库。信息检索是指文档搜索、文档内信息搜索或者文档相关的元数据搜索等操作。");
        //3)向TokenStream对象中设置一个每个词类型引用,相当于数一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4)调用TokenStream对象的rest方法。如果不调用抛异常
        tokenStream.reset();
        //5)使用while循环遍历TokenStream对象
        while(tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6)关闭TokenStream对象
        tokenStream.close();
    }
  • Modify or disable word hot words

    • Increase the hot words (word when the word as a whole): hotword.dic
    • Increase stop words (the sentence will delete the word after word): stopword.dic
  • note

    • and ext_stopword.dic hotword.dic format file is UTF-8, no attention is the BOM UTF-8 encoding.
      That ban on the use windows Notepad to edit the dictionary file extension

    • Use EditPlus.exe saved as no BOM of UTF-8 encoding format, as follows:

      Here Insert Picture Description

Delete index Library

  • Delete the entire index Library
 public void deleteAllDocument() throws Exception {
        //创建一个IndexWriter对象,使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引库
        indexWriter.close();
    }
  • The query to delete
public void deleteDocumentByQuery() throws Exception {
        //创建一个IndexWriter对象,使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

Update the index Library

 public void updateDocument() throws Exception {
        //创建一个IndexWriter对象,使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //创建一个新的文档对象
        Document document = new Document();
        //向文档对象中添加域
        document.add(new TextField("name", document.get("name"), Field.Store.YES));
        document.add(new TextField("name1", "新增字段name2", Field.Store.YES));
        //更新操作
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭索引库
        indexWriter.close();
    }

Explanation

  • Lucene is to delete look to find Document, and then increase
  • Lucene's Document no structure, so can add any Trem (field), but also find a whole Document by Trem Similarly Find
Published 109 original articles · won praise 47 · views 30000 +

Guess you like

Origin blog.csdn.net/weixin_43934607/article/details/104416482