[Teach you full-text search by hand] [Add, delete, modify, check] of Lucene index

foreword

  Those who are engaged in retrieval should know a little about Lucene. It is open source and easy to use. The official API is enough to write some small DEMOs. And according to the inverted index, fast retrieval is achieved. This article simply implements operations such as incrementally adding indexes, deleting indexes, querying by keywords, and updating indexes.

  At present, the unpleasant part of using Bo Pig is that when reading the content of the file for full-text retrieval, you need to write the reading process yourself (this solr helps us implement it for free). Moreover, the process of creating an index is relatively slow, and there is still a lot of room for optimization. This should be carefully studied.

 

create index

  When Lucene is creating an index, according to the previous blog, the general process has been explained, and here is a brief description:

Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.close();

      1 Create a Directory and get the index directory

  2 Create a lexer, create an IndexWriter object

  3 Create a document object to store data

  4 Close IndexWriter, submit

/**
     * build index
     *
     * @param args
     */
    public static void index() throws Exception {
        
        String text1 = "hello,man!";
        String text2 = "goodbye,man!";
        String text3 = "hello,woman!";
        String text4 = "goodbye,woman!";
        
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);

        Document doc1 = new Document();
        doc1.add(new TextField("filename", "text1", Store.YES));
        doc1.add(new TextField("content", text1, Store.YES));
        indexWriter.addDocument(doc1);
        
        Document doc2 = new Document();
        doc2.add(new TextField("filename", "text2", Store.YES));
        doc2.add(new TextField("content", text2, Store.YES));
        indexWriter.addDocument(doc2);
        
        Document doc3 = new Document();
        doc3.add(new TextField("filename", "text3", Store.YES));
        doc3.add(new TextField("content", text3, Store.YES));
        indexWriter.addDocument(doc3);
        
        Document doc4 = new Document();
        doc4.add(new TextField("filename", "text4", Store.YES));
        doc4.add(new TextField("content", text4, Store.YES));
        indexWriter.addDocument(doc4);
        
        indexWriter.commit();
        indexWriter.close();

        Date date2 = new Date();
        System.out.println("Time to create index: " + (date2.getTime() - date1.getTime()) + "ms\n");
    }

 

Incremental add index

  Lucene has the function of incrementally adding indexes. When adding an index without affecting the previous index, it will automatically merge the index files at any time.

/**
     * add index
     *
     * @throws Exception
     */
    public static void insert() throws Exception {
        String text5 = "hello,goodbye,man,woman";
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);

        Document doc1 = new Document();
        doc1.add(new TextField("filename", "text5", Store.YES));
        doc1.add(new TextField("content", text5, Store.YES));
        indexWriter.addDocument(doc1);

        indexWriter.commit();
        indexWriter.close();

        Date date2 = new Date();
        System.out.println("Increase index time: " + (date2.getTime() - date1.getTime()) + "ms\n");
    }

 

drop index

  Lucene also calls its delete method through IndexWriter to delete the index. We can delete all content related to this keyword through a keyword. If you just want to delete a document, then it is best to put a unique ID field, and delete it through this ID field.

/**
     * delete index
     *
     * @param str deleted keyword
     * @throws Exception
     */
    public static void delete(String str) throws Exception {
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);
        
        indexWriter.deleteDocuments(new Term("filename",str));  
        
        indexWriter.close();
        
        Date date2 = new Date();
        System.out.println("Deleting index time: " + (date2.getTime() - date1.getTime()) + "ms\n");
    }

 

update index

  Lucene has no real update operation. Through a fieldname, the index corresponding to this field can be updated, but in essence, it deletes the index first and then re-establishes it.

/**
     * update index
     *
     * @throws Exception
     */
    public static void update() throws Exception {
        String text1 = "update,hello,man!";
        Date date1 = new Date();
         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
         directory = FSDirectory.open(new File(INDEX_DIR));

         IndexWriterConfig config = new IndexWriterConfig(
                 Version.LUCENE_CURRENT, analyzer);
         indexWriter = new IndexWriter(directory, config);
         
         Document doc1 = new Document();
        doc1.add(new TextField("filename", "text1", Store.YES));
        doc1.add(new TextField("content", text1, Store.YES));
        
        indexWriter.updateDocument(new Term("filename","text1"), doc1);
        
         indexWriter.close();
         
         Date date2 = new Date();
         System.out.println("更新索引耗时:" + (date2.getTime() - date1.getTime()) + "ms\n");
    }

 

通过索引查询关键字

  Lucene的查询方式有很多种,这里就不做详细介绍了。它会返回一个ScoreDoc的集合,类似ResultSet的集合,我们可以通过域名获取想要获取的内容。

/**
     * 关键字查询
     * 
     * @param str
     * @throws Exception
     */
    public static void search(String str) throws Exception {
        directory = FSDirectory.open(new File(INDEX_DIR));
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);

        QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content",analyzer);
        Query query = parser.parse(str);

        ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
            Document hitDoc = isearcher.doc(hits[i].doc);
            System.out.println(hitDoc.get("filename"));
            System.out.println(hitDoc.get("content"));
        }
        ireader.close();
        directory.close();
    }

 

全部代码

package test;

import java.io.File;
import java.util.Date;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class TestLucene {
    // 保存路径
    private static String INDEX_DIR = "D:\\luceneIndex";
    private static Analyzer analyzer = null;
    private static Directory directory = null;
    private static IndexWriter indexWriter = null;

    public static void main(String[] args) {
        try {
//            index();
            search("man");
//            insert();
//            delete("text5");
//            update();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    /**
     * 更新索引
     * 
     * @throws Exception
     */
    public static void update() throws Exception {
        String text1 = "update,hello,man!";
        Date date1 = new Date();
         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
         directory = FSDirectory.open(new File(INDEX_DIR));

         IndexWriterConfig config = new IndexWriterConfig(
                 Version.LUCENE_CURRENT, analyzer);
         indexWriter = new IndexWriter(directory, config);
         
         Document doc1 = new Document();
        doc1.add(new TextField("filename", "text1", Store.YES));
        doc1.add(new TextField("content", text1, Store.YES));
        
        indexWriter.updateDocument(new Term("filename","text1"), doc1);
        
         indexWriter.close();
         
         Date date2 = new Date();
         System.out.println("更新索引耗时:" + (date2.getTime() - date1.getTime()) + "ms\n");
    }
    /**
     * 删除索引
     * 
     * @param str 删除的关键字
     * @throws Exception
     */
    public static void delete(String str) throws Exception {
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);
        
        indexWriter.deleteDocuments(new Term("filename",str));  
        
        indexWriter.close();
        
        Date date2 = new Date();
        System.out.println("删除索引耗时:" + (date2.getTime() - date1.getTime()) + "ms\n");
    }
    /**
     * 增加索引
     * 
     * @throws Exception
     */
    public static void insert() throws Exception {
        String text5 = "hello,goodbye,man,woman";
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);

        Document doc1 = new Document();
        doc1.add(new TextField("filename", "text5", Store.YES));
        doc1.add(new TextField("content", text5, Store.YES));
        indexWriter.addDocument(doc1);

        indexWriter.commit();
        indexWriter.close();

        Date date2 = new Date();
        System.out.println("增加索引耗时:" + (date2.getTime() - date1.getTime()) + "ms\n");
    }
    /**
     * 建立索引
     * 
     * @param args
     */
    public static void index() throws Exception {
        
        String text1 = "hello,man!";
        String text2 = "goodbye,man!";
        String text3 = "hello,woman!";
        String text4 = "goodbye,woman!";
        
        Date date1 = new Date();
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        directory = FSDirectory.open(new File(INDEX_DIR));

        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        indexWriter = new IndexWriter(directory, config);

        Document doc1 = new Document();
        doc1.add(new TextField("filename", "text1", Store.YES));
        doc1.add(new TextField("content", text1, Store.YES));
        indexWriter.addDocument(doc1);
        
        Document doc2 = new Document();
        doc2.add(new TextField("filename", "text2", Store.YES));
        doc2.add(new TextField("content", text2, Store.YES));
        indexWriter.addDocument(doc2);
        
        Document doc3 = new Document();
        doc3.add(new TextField("filename", "text3", Store.YES));
        doc3.add(new TextField("content", text3, Store.YES));
        indexWriter.addDocument(doc3);
        
        Document doc4 = new Document();
        doc4.add(new TextField("filename", "text4", Store.YES));
        doc4.add(new TextField("content", text4, Store.YES));
        indexWriter.addDocument(doc4);
        
        indexWriter.commit();
        indexWriter.close();

        Date date2 = new Date();
        System.out.println("创建索引耗时:" + (date2.getTime() - date1.getTime()) + "ms\n");
    }

    /**
     * 关键字查询
     * 
     * @param str
     * @throws Exception
     */
    public static void search(String str) throws Exception {
        directory = FSDirectory.open(new File(INDEX_DIR));
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);

        QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content",analyzer);
        Query query = parser.parse(str);

        ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
            Document hitDoc = isearcher.doc(hits[i].doc);
            System.out.println(hitDoc.get("filename"));
            System.out.println(hitDoc.get("content"));
        }
        ireader.close();
        directory.close();
    }
}

 

参考资料

  http://www.cnblogs.com/xing901022/p/3933675.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326681624&siteId=291194637