Lucene 6.0 combat (1) - create index

Lucene 6.0 combat (1) - create index

 Word count: 10k | Reading time ≈ 0:10

introduction

Lucene 6.0 was released on April 8, 2016, and the minimum Java version required is Java 8 .

It is believed that most companies' databases need to adopt some strategies such as sub-database and sub-table, and for some specific business requirements, it is cumbersome to retrieve specific data from different tables in different databases, and Lucene can solve certain problems. For some special requirements, first establish a full index for the data in different tables in different databases, and then write the data to be retrieved into a separate table for other business demanders to query. In the future, you only need to do incremental indexing and write into the data sheet.

In view of the fact that I have been working on Lucene-related aspects recently, and I always prefer to use the latest version, and there are very few such resources on the Internet, I will sort out some key points and examples. This article mainly introduces Lucene 6.0 from a practical perspective. The use of , does not involve too many principles, but some core points will also be mentioned.

Why is Lucene so popular?

Lucene is an efficient, Java-based full-text retrieval library. There are two main types of data in life: structured data and unstructured data. Commonly used XML, JSON, databases, etc. are structured data, and unstructured data is also called full-text data, and this full-text data is where Lucene comes in. There are two main processes of full-text retrieval, index creation (Indexing) and search index (Search).

Lucene is a basic implementation of many search engines and is used by many large companies, such as Netflix, MySpace, LinkedIn, Twitter, IBM, etc. You can have a general understanding of Lucene through the following characteristics

  • 150GB of data can be indexed per hour on modern hardware
  • Index a 20GB text file, the resulting index file is about 4-6GB
  • Only 1MB of heap memory is required
  • Customizable sorting model
  • Supports multiple query types
  • Search by specific field
  • Sort by a specific field
  • Near real-time indexing and searching
  • Faceting,Grouping,Highlighting,Suggestions等

Given so many powerful features and popularity of Lucene, there are many search technologies based on Lucene, two of the most popular are Apache Solr and Elastic search , of course there are many other Lucene implementations in different languages:

store index

The index is created by Lucene according to a specific format, and the created index must be stored on the file system. The most basic abstract implementation class for Lucene to store the index in the file system is BaseDirectory, which inherits from Directory. BaseDirectory has two The main implementation class:

  • FSDirectory: stores index files on the file system, there are six subclasses, the following are three commonly used subclasses
    • SimpleFSDirectory: Implemented using Files.newByteChannel, it does not support concurrency well, it will perform synchronization operations when multiple threads read the same file
    • NIOFSDirectory: Use FileChannel in Java NIO to read the same file to avoid synchronous operations, but due to the Sun JRE bug on the Windows platform, it is not recommended on the Windows platform
    • MMapDirectory: use memory mapped IO when reading, this is a great option if your virtual memory is large enough to accommodate the index file size
  • RAMDirectory: Temporarily store index files in memory, only good for small indexes, large indexes will have frequent GC

Under normal circumstances, if the index file is stored on the file system, we do not need to choose to use a subclass of FSDirectory, just use the open(Path path) method in FSDirectory. The English doc is as follows:

Creates an FSDirectory instance, trying to pick the best implementation given the current environment. The directory returned uses the NativeFSLockFactory. The directory is created at the named location if it does not yet exist.

This method can automatically select an optimal implementation subclass according to the current system environment. The selection strategy is

  • Returns MMapDirectory for Linux, MacOSX, Solaris, Windows 64-bit JREs
  • For other non-Windows JREs, return NIOFSDirectory
  • For other JREs on Windows, return SimpleFSDirectory

For now, MMapDirectory is a better implementation. It uses virtual memory and mmap to access disk files. The general approach relies on system calls to copy data between the file system cache and the Java heap. So how can we directly access the file system cache? That's what mmap does!

Simply put, MMapDirectory treats the lucene index as a swap file. The mmap() system call tells the OS to map the entire index file into the virtual address space, so Lucene thinks the index is in memory. Then Lucene can access the index file on disk just like accessing a large byte[] data (in Java this data is encapsulated in the ByteBuffer interface). When Lucene accesses the index in the virtual space, it does not need any system calls. The MMU (memory management unit) and TLB (translation lookaside buffers, which cache frequently accessed pages) in the CPU will handle all the mapping work. If the data is still on disk, the MMU will issue an interrupt and the OS will load the data into the filesystem cache. If the data is already in the cache, the MMU/TLB will directly map the data to the memory, which only needs to access the memory, which is very fast. Programmers don't need to care about paging in/out, all this is left to the OS. Moreover, there is no concurrent interference in this case. The only problem is that the byte[] encapsulated by Java's ByteBuffer is slightly slower, but this interface can only be used to use mmap in Java. Another great advantage is that all memory issues are handled by the OS, so there is no GC problem.

index core class

The following classes are required to perform a simple indexing process:

  • IndexWriter : responsible for creating an index or opening an existing index
  • IndexWriterConfig : Holds all configuration items for creating IndexWriter
  • Directory : Describes the storage location of the Lucene index, and its subclasses are responsible for specifying the storage path of the index
  • Analyzer : Responsible for text analysis, extracting lexical units from indexed text files. For the text analyzer Analyzer, you need to pay attention to which Analyzer is used for index creation, and which Analyzer query is used when querying, otherwise the query result will be incorrect.
  • Document : Represents a collection of fields (Field), you can understand the Document object as a virtual document - such as a Web page, E-mail message or text file
  • Field : Each document in the index contains one or more differently named fields, each with a domain name and corresponding field value
  • FieldType : Describes various properties of Field, which are needed when a specific Field type (such as StringField, TextField) is not used

create index

There are three ways to create an index, which are specified by IndexWriterConfig.OpenMode, which are

  • CREATE: Create a new index or overwrite an existing index
  • APPEND: open an existing index
  • CREATE_OR_APPEND: Create a new index if it does not exist, append the index if it exists
/**
 * 创建索引写入器
 *
 * @param indexPath
 * @param create
 * @throws IOException
 */
public IndexWriter getIndexWriter(String indexPath, boolean create) throws IOException {
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
    if (create) {
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    } else {
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
    }
    Directory directory = FSDirectory.open(Paths.get(indexPath));
    IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
    return indexWriter;
}


If it is only used for testing, the index file can also be stored in the memory. In this case, you need to use RAMDirectory

public class LuceneDemo {
    private Directory directory;
    private String[] ids = {"1", "2"};
    private String[] unIndex = {"Netherlands", "Italy"};
    private String[] unStored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};
    private String[] text = {"Amsterdam", "Venice"};
    private IndexWriter indexWriter;
    private IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
    @Test
    public void createIndex() throws IOException {
        directory = new RAMDirectory();
        //指定将索引创建信息打印到控制台
        indexWriterConfig.setInfoStream(System.out);
        indexWriter = new IndexWriter(directory, indexWriterConfig);
        indexWriterConfig = (IndexWriterConfig) indexWriter.getConfig();
        FieldType fieldType = new FieldType();
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        fieldType.setStored(true);//存储
        fieldType.setTokenized(true);//分词
        for (int i = 0; i < ids.length; i++) {
            Document document = new Document();
            document.add(new Field("id", ids[i], fieldType));
            document.add(new Field("country", unIndex[i], fieldType));
            document.add(new Field("contents", unStored[i], fieldType));
            document.add(new Field("city", text[i], fieldType));
            indexWriter.addDocument(document);
        }
        indexWriter.commit();
    }
}

NOTES : The commit() method is automatically called when the close() method of IndexWriter is called, and the flush() method is automatically called when the commit() method is called. So generally you don't need to do this

indexWriter.flush();

indexWriter.commit();

indexWriter.close();


The console output index creation information is as follows:

IFD 0 [2016-05-19T07:10:21.127Z; main]: init: current segments file is “segments”; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@691a7f8f
IFD 0 [2016-05-19T07:10:21.167Z; main]: delete []
IFD 0 [2016-05-19T07:10:21.167Z; main]: now checkpoint “” [0 segments ; isCommit = false]
IFD 0 [2016-05-19T07:10:21.167Z; main]: delete []
IFD 0 [2016-05-19T07:10:21.167Z; main]: 0 msec to checkpoint
IW 0 [2016-05-19T07:10:21.167Z; main]: init: create=true
IW 0 [2016-05-19T07:10:21.168Z; main]:


DW 0 [2016-05-19T07:10:21.271Z; main]: main finishFullFlush success=true
IW 0 [2016-05-19T07:10:21.271Z; main]: startCommit(): start
IW 0 [2016-05-19T07:10:21.271Z; main]: skip startCommit(): no changes pending
IFD 0 [2016-05-19T07:10:21.271Z; main]: delete []
IW 0 [2016-05-19T07:10:21.271Z; main]: commit: pendingCommit == null; skip
IW 0 [2016-05-19T07:10:21.271Z; main]: commit: took 0.4 msec
IW 0 [2016-05-19T07:10:21.271Z; main]: commit: done
IW 0 [2016-05-19T07:10:21.271Z; main]: rollback
IW 0 [2016-05-19T07:10:21.271Z; main]: all running merges have aborted
IW 0 [2016-05-19T07:10:21.271Z; main]: rollback: done finish merges
DW 0 [2016-05-19T07:10:21.271Z; main]: abort
DW 0 [2016-05-19T07:10:21.271Z; main]: done abort success=true
IW 0 [2016-05-19T07:10:21.271Z; main]: rollback: infos=_0(6.0.0):c2
IFD 0 [2016-05-19T07:10:21.271Z; main]: now checkpoint “_0(6.0.0):c2” [1 segments ; isCommit = false]
IFD 0 [2016-05-19T07:10:21.272Z; main]: delete []
IFD 0 [2016-05-19T07:10:21.272Z; main]: 0 msec to checkpoint
IFD 0 [2016-05-19T07:10:21.272Z; main]: delete []
IFD 0 [2016-05-19T07:10:21.272Z; main]: delete []

delete document

The IndexWriter provides an interface to delete Documents from the index, which are

  • deleteDocuments(Query… queries): delete all Documents matching the query
  • deleteDocuments(Term... terms): delete all Documents that contain terms
  • deleteAll(): delete all Documents in the index

NOTES: The deleteDocuments(Term... terms) method only accepts the Term parameter, and Term only provides the following four constructors

  • Term (String fld, BytesRef bytes)
  • Term (String fld, BytesRefBuilder bytesBuilder)
  • Term(String fld, String text)
  • Term(String fld)

So we can't use deleteDocuments(Term... terms) to delete some Fields with non-String values, such as IntPoint, LongPoint, FloatPoint, DoublePoint, etc. At this time, it is necessary to delete the Document containing some specific types of Field by passing the Query instance.

@Test
public void testDelete() throws IOException {
    RAMDirectory ramDirectory = new RAMDirectory();
    IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(new StandardAnalyzer()));
    Document document = new Document();
    document.add(new IntPoint("ID", 1));
    indexWriter.addDocument(document);
    indexWriter.commit();
    //无法删除ID为1的
    indexWriter.deleteDocuments(new Term("ID", "1"));
    indexWriter.commit();
    DirectoryReader open = DirectoryReader.open(ramDirectory);
    IndexSearcher indexSearcher = new IndexSearcher(open);
    Query query = IntPoint.newExactQuery("ID", 1);
    TopDocs search = indexSearcher.search(query, 10);
    //命中,1,说明并未删除
    System.out.println(search.totalHits);
    //使用Query删除
    indexWriter.deleteDocuments(query);
    indexWriter.commit();
    indexSearcher = new IndexSearcher(DirectoryReader.openIfChanged(open));
    search = indexSearcher.search(query, 10);
    //未命中,0,说明已经删除
    System.out.println(search.totalHits);
}

Reference
[1]  http://www.cnblogs.com/huangfox/p/3616298.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325160981&siteId=291194637