Lucene 一种索引数据库

Lucene作为一个企业级搜素引擎总会让人感觉到很惊讶,但是归根接地这个程序的本质就是让人去使用,而不是让人不知所措的去应用,本文将会按照源码的底层进行Lucene的讲解,当然讲解所用到的lucene版本也会是最新版本,现在的时间是2018年8月11日18:30:07,所以以后这篇文章可能会落后还请大家谅解。

Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。Lucene是一套用于全文检索和搜寻的开源程式库,由Apache软件基金会支持和提供。Lucene提供了一个简单却强大的应用程式接口,能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言,Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库,虽然与搜索引擎有关,但不应该将信息检索程序库与搜索引擎相混淆。

下面的几个是lucene的核心包

analysis 分词

codecs 编解码

document 存储类

geo 官网介绍是lucene的一个底层数据结构, 有时间会做详细的解析

index 索引方法

search 搜索方法

store 就是那个 

RAMDirectory 内存存储方式

fsdirectory 硬盘存储方式

mmapdiretory 虚拟内存存储方式

util 工具类 (包含)

这个是官网所带的一个代码,简单的讲讲
定义分词器 这个部分可以换成自己定义的分词器 lucene上有一个对opennlp支持的
jar包但是并没有试过
    Analyzer analyzer = new StandardAnalyzer();
这个位置我们定义一个存储方式,就是三个类型的存储最经常的使用
分别是
RAMDirectory 内存存储方式

fsdirectory 硬盘存储方式 需要注意这个方法需要写入当前的文件存储位置。

mmapdiretory 虚拟内存存储方式

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
   
定义一个写入配置类,并将我们之前定义好的分词类给这个写入配置类
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
   
这个就是索引写入的东西 在IndexWrite下具有
    IndexWriter iwriter = new IndexWriter(directory, config);
这里是一个终点

    Document doc = new Document();
    String text = "This is the text to be indexed.";
第一个参数决定了数据在查询的时候需要查询哪个字段,第二个字段代表字段的内容,
第三个参数决定数据在Lucene中是否进行保存操作(ps第三个参数我会在后面进行详细
的讲解)

    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
这个时候我们在document类中加入了数据我们就可以用上面定义好的IndexWriter类
把数据存入我们的Lucene的数据库中。

    iwriter.addDocument(doc);
这个位置需要注意一下,我们在使用完这个类之后一定要及时将这个类进行关闭操作,
未关闭的时候,我们的jvm监控中会看到比较让程序难以接受的问题。

    iwriter.close();
    这下面的方法就是Lucene的查询的方法。
    // Now search the index:
首先打开一个读取的类,在读取的类上加入上面我们定义的字典也就是我们的数据存入到
哪里的一个方法

    DirectoryReader ireader = DirectoryReader.open(directory);
 
这个时候我们就可以打开我们的索引搜索类加载入我们的索引搜索
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    
QueryParser 是Lucene为我们封装好的一个查询接口 其中会包括很多的查询方法
    需要查询的字段名称以及分词器定义,分词这个方面需要注意一下,建议用海量分词器

    word分词器 hanlp分词器
QueryParser parser = new QueryParser("fieldname", analyzer);

这个位置就是需要按照业务进行传参的位置
    Query query = parser.parse("text");

我们按照需要的字段 注意query的语法 这个地方的语法可以进行多重的查询定义,这个后面
会详细的讲到。而这个字段的返回就确定了我们在数据库中的数据,最后的是对文档相似
度的一个评分
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    
断言 返回数据的长度1
    assertEquals(1, hits.length);
    // Iterate through the results:
迭代器实现数据的转换从我们取到的数据中将我们需要的数据返回到document对象中

    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
注意一定要关闭我们的索引对象和字典对象,不然会有内存溢出的风险

    ireader.close();
    directory.close();

昂我在放一段源代码,大家可以按照自己的需求进行开发

所有的技术都是为了人们的应用而不是让技术变得高不可攀,Lucene也一样,Lucene是一个数据库,这个数据库中的核心就是这个document,这个东西是一种数据结构。在Lucene的document下

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.document;


import java.util.*;

import org.apache.lucene.index.IndexReader;  // for javadoc
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.search.IndexSearcher;  // for javadoc
import org.apache.lucene.search.ScoreDoc; // for javadoc
import org.apache.lucene.util.BytesRef;

/** Documents are the unit of indexing and search.
 *
 * A Document is a set of fields.  Each field has a name and a textual value.
 * A field may be {@link org.apache.lucene.index.IndexableFieldType#stored() stored} with the document, in which
 * case it is returned with search hits on the document.  Thus each document
 * should typically contain one or more stored fields which uniquely identify
 * it.
 *
 * <p>Note that fields which are <i>not</i> {@link org.apache.lucene.index.IndexableFieldType#stored() stored} are
 * <i>not</i> available in documents retrieved from the index, e.g. with {@link
 * ScoreDoc#doc} or {@link IndexReader#document(int)}.
 */

public final class Document implements Iterable<IndexableField> {

  private final List<IndexableField> fields = new ArrayList<>();

  /** Constructs a new document with no fields. */
  public Document() {}

  @Override
  public Iterator<IndexableField> iterator() {
    return fields.iterator();
  }

  /**
   * <p>Adds a field to a document.  Several fields may be added with
   * the same name.  In this case, if the fields are indexed, their text is
   * treated as though appended for the purposes of search.</p>
   * <p> Note that add like the removeField(s) methods only makes sense 
   * prior to adding a document to an index. These methods cannot
   * be used to change the content of an existing index! In order to achieve this,
   * a document has to be deleted from an index and a new changed version of that
   * document has to be added.</p>
   */
  public final void add(IndexableField field) {
    fields.add(field);
  }
  
  /**
   * <p>Removes field with the specified name from the document.
   * If multiple fields exist with this name, this method removes the first field that has been added.
   * If there is no field with the specified name, the document remains unchanged.</p>
   * <p> Note that the removeField(s) methods like the add method only make sense 
   * prior to adding a document to an index. These methods cannot
   * be used to change the content of an existing index! In order to achieve this,
   * a document has to be deleted from an index and a new changed version of that
   * document has to be added.</p>
   */
  public final void removeField(String name) {
    Iterator<IndexableField> it = fields.iterator();
    while (it.hasNext()) {
      IndexableField field = it.next();
      if (field.name().equals(name)) {
        it.remove();
        return;
      }
    }
  }
  
  /**
   * <p>Removes all fields with the given name from the document.
   * If there is no field with the specified name, the document remains unchanged.</p>
   * <p> Note that the removeField(s) methods like the add method only make sense 
   * prior to adding a document to an index. These methods cannot
   * be used to change the content of an existing index! In order to achieve this,
   * a document has to be deleted from an index and a new changed version of that
   * document has to be added.</p>
   */
  public final void removeFields(String name) {
    Iterator<IndexableField> it = fields.iterator();
    while (it.hasNext()) {
      IndexableField field = it.next();
      if (field.name().equals(name)) {
        it.remove();
      }
    }
  }


  /**
  * Returns an array of byte arrays for of the fields that have the name specified
  * as the method parameter.  This method returns an empty
  * array when there are no matching fields.  It never
  * returns null.
  *
  * @param name the name of the field
  * @return a <code>BytesRef[]</code> of binary field values
  */
  public final BytesRef[] getBinaryValues(String name) {
    final List<BytesRef> result = new ArrayList<>();
    for (IndexableField field : fields) {
      if (field.name().equals(name)) {
        final BytesRef bytes = field.binaryValue();
        if (bytes != null) {
          result.add(bytes);
        }
      }
    }
  
    return result.toArray(new BytesRef[result.size()]);
  }
  
  /**
  * Returns an array of bytes for the first (or only) field that has the name
  * specified as the method parameter. This method will return <code>null</code>
  * if no binary fields with the specified name are available.
  * There may be non-binary fields with the same name.
  *
  * @param name the name of the field.
  * @return a <code>BytesRef</code> containing the binary field value or <code>null</code>
  */
  public final BytesRef getBinaryValue(String name) {
    for (IndexableField field : fields) {
      if (field.name().equals(name)) {
        final BytesRef bytes = field.binaryValue();
        if (bytes != null) {
          return bytes;
        }
      }
    }
    return null;
  }

  /** Returns a field with the given name if any exist in this document, or
   * null.  If multiple fields exists with this name, this method returns the
   * first value added.
   */
  public final IndexableField getField(String name) {
    for (IndexableField field : fields) {
      if (field.name().equals(name)) {
        return field;
      }
    }
    return null;
  }

  /**
   * Returns an array of {@link IndexableField}s with the given name.
   * This method returns an empty array when there are no
   * matching fields.  It never returns null.
   *
   * @param name the name of the field
   * @return a <code>IndexableField[]</code> array
   */
  public IndexableField[] getFields(String name) {
    List<IndexableField> result = new ArrayList<>();
    for (IndexableField field : fields) {
      if (field.name().equals(name)) {
        result.add(field);
      }
    }

    return result.toArray(new IndexableField[result.size()]);
  }
  
  /** Returns a List of all the fields in a document.
   * <p>Note that fields which are <i>not</i> stored are
   * <i>not</i> available in documents retrieved from the
   * index, e.g. {@link IndexSearcher#doc(int)} or {@link
   * IndexReader#document(int)}.
   */
  public final List<IndexableField> getFields() {
    return fields;
  }
  
   private final static String[] NO_STRINGS = new String[0];

  /**
   * Returns an array of values of the field specified as the method parameter.
   * This method returns an empty array when there are no
   * matching fields.  It never returns null.
   * For {@link IntField}, {@link LongField}, {@link
   * FloatField} and {@link DoubleField} it returns the string value of the number. If you want
   * the actual numeric field instances back, use {@link #getFields}.
   * @param name the name of the field
   * @return a <code>String[]</code> of field values
   */
  public final String[] getValues(String name) {
    List<String> result = new ArrayList<>();
    for (IndexableField field : fields) {
      if (field.name().equals(name) && field.stringValue() != null) {
        result.add(field.stringValue());
      }
    }
    
    if (result.size() == 0) {
      return NO_STRINGS;
    }
    
    return result.toArray(new String[result.size()]);
  }

  /** Returns the string value of the field with the given name if any exist in
   * this document, or null.  If multiple fields exist with this name, this
   * method returns the first value added. If only binary fields with this name
   * exist, returns null.
   * For {@link IntField}, {@link LongField}, {@link
   * FloatField} and {@link DoubleField} it returns the string value of the number. If you want
   * the actual numeric field instance back, use {@link #getField}.
   */
  public final String get(String name) {
    for (IndexableField field : fields) {
      if (field.name().equals(name) && field.stringValue() != null) {
        return field.stringValue();
      }
    }
    return null;
  }
  
  /** Prints the fields of a document for human consumption. */
  @Override
  public final String toString() {
    StringBuilder buffer = new StringBuilder();
    buffer.append("Document<");
    for (int i = 0; i < fields.size(); i++) {
      IndexableField field = fields.get(i);
      buffer.append(field.toString());
      if (i != fields.size()-1) {
        buffer.append(" ");
      }
    }
    buffer.append(">");
    return buffer.toString();
  }
}

我们可以看到在源码中即具有迭代器又具有两个较为特殊的类。

一本书会具有目录和正文,Lucene恰巧就是对这个结构进行的实现。

field

就是一种存储形式,在这个类里面我们将所需要索引的字段名,字段数据,是否保存进行写入。

getBinaryValues

而这个就浅显易懂的是按照我们搜索的内容从而返回我们所需要的数据集合。

BytesRef[]

这个返回的数据类型是不是有点懵,那么我们看一下相关的源码。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.util;


import java.util.Arrays;
import java.util.Comparator;

/** Represents byte[], as a slice (offset + length) into an
 *  existing byte[].  The {@link #bytes} member should never be null;
 *  use {@link #EMPTY_BYTES} if necessary.
 *
 * <p><b>Important note:</b> Unless otherwise noted, Lucene uses this class to
 * represent terms that are encoded as <b>UTF8</b> bytes in the index. To
 * convert them to a Java {@link String} (which is UTF16), use {@link #utf8ToString}.
 * Using code like {@code new String(bytes, offset, length)} to do this
 * is <b>wrong</b>, as it does not respect the correct character set
 * and may return wrong results (depending on the platform's defaults)!
 */
public final class BytesRef implements Comparable<BytesRef>,Cloneable {
  /** An empty byte array for convenience */
  public static final byte[] EMPTY_BYTES = new byte[0]; 

  /** The contents of the BytesRef. Should never be {@code null}. */
  public byte[] bytes;

  /** Offset of first valid byte. */
  public int offset;

  /** Length of used bytes. */
  public int length;

  /** Create a BytesRef with {@link #EMPTY_BYTES} */
  public BytesRef() {
    this(EMPTY_BYTES);
  }

  /** This instance will directly reference bytes w/o making a copy.
   * bytes should not be null.
   */
  public BytesRef(byte[] bytes, int offset, int length) {
    this.bytes = bytes;
    this.offset = offset;
    this.length = length;
    assert isValid();
  }

  /** This instance will directly reference bytes w/o making a copy.
   * bytes should not be null */
  public BytesRef(byte[] bytes) {
    this(bytes, 0, bytes.length);
  }

  /** 
   * Create a BytesRef pointing to a new array of size <code>capacity</code>.
   * Offset and length will both be zero.
   */
  public BytesRef(int capacity) {
    this.bytes = new byte[capacity];
  }

  /**
   * Initialize the byte[] from the UTF8 bytes
   * for the provided String.  
   * 
   * @param text This must be well-formed
   * unicode text, with no unpaired surrogates.
   */
  public BytesRef(CharSequence text) {
    this(new byte[UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * text.length()]);
    length = UnicodeUtil.UTF16toUTF8(text, 0, text.length(), bytes);
  }
  
  /**
   * Expert: compares the bytes against another BytesRef,
   * returning true if the bytes are equal.
   * 
   * @param other Another BytesRef, should not be null.
   * @lucene.internal
   */
  public boolean bytesEquals(BytesRef other) {
    assert other != null;
    if (length == other.length) {
      int otherUpto = other.offset;
      final byte[] otherBytes = other.bytes;
      final int end = offset + length;
      for(int upto=offset;upto<end;upto++,otherUpto++) {
        if (bytes[upto] != otherBytes[otherUpto]) {
          return false;
        }
      }
      return true;
    } else {
      return false;
    }
  }

  /**
   * Returns a shallow clone of this instance (the underlying bytes are
   * <b>not</b> copied and will be shared by both the returned object and this
   * object.
   * 
   * @see #deepCopyOf
   */
  @Override
  public BytesRef clone() {
    return new BytesRef(bytes, offset, length);
  }
  
  /** Calculates the hash code as required by TermsHash during indexing.
   *  <p> This is currently implemented as MurmurHash3 (32
   *  bit), using the seed from {@link
   *  StringHelper#GOOD_FAST_HASH_SEED}, but is subject to
   *  change from release to release. */
  @Override
  public int hashCode() {
    return StringHelper.murmurhash3_x86_32(this, StringHelper.GOOD_FAST_HASH_SEED);
  }

  @Override
  public boolean equals(Object other) {
    if (other == null) {
      return false;
    }
    if (other instanceof BytesRef) {
      return this.bytesEquals((BytesRef) other);
    }
    return false;
  }

  /** Interprets stored bytes as UTF8 bytes, returning the
   *  resulting string */
  public String utf8ToString() {
    final char[] ref = new char[length];
    final int len = UnicodeUtil.UTF8toUTF16(bytes, offset, length, ref);
    return new String(ref, 0, len);
  }

  /** Returns hex encoded bytes, eg [0x6c 0x75 0x63 0x65 0x6e 0x65] */
  @Override
  public String toString() {
    StringBuilder sb = new StringBuilder();
    sb.append('[');
    final int end = offset + length;
    for(int i=offset;i<end;i++) {
      if (i > offset) {
        sb.append(' ');
      }
      sb.append(Integer.toHexString(bytes[i]&0xff));
    }
    sb.append(']');
    return sb.toString();
  }

  /** Unsigned byte order comparison */
  @Override
  public int compareTo(BytesRef other) {
    return utf8SortedAsUnicodeSortOrder.compare(this, other);
  }
  
  private final static Comparator<BytesRef> utf8SortedAsUnicodeSortOrder = new UTF8SortedAsUnicodeComparator();

  public static Comparator<BytesRef> getUTF8SortedAsUnicodeComparator() {
    return utf8SortedAsUnicodeSortOrder;
  }

  private static class UTF8SortedAsUnicodeComparator implements Comparator<BytesRef> {
    // Only singleton
    private UTF8SortedAsUnicodeComparator() {};

    @Override
    public int compare(BytesRef a, BytesRef b) {
      final byte[] aBytes = a.bytes;
      int aUpto = a.offset;
      final byte[] bBytes = b.bytes;
      int bUpto = b.offset;
      
      final int aStop = aUpto + Math.min(a.length, b.length);
      while(aUpto < aStop) {
        int aByte = aBytes[aUpto++] & 0xff;
        int bByte = bBytes[bUpto++] & 0xff;

        int diff = aByte - bByte;
        if (diff != 0) {
          return diff;
        }
      }

      // One is a prefix of the other, or, they are equal:
      return a.length - b.length;
    }    
  }

  /** @deprecated This comparator is only a transition mechanism */
  @Deprecated
  private final static Comparator<BytesRef> utf8SortedAsUTF16SortOrder = new UTF8SortedAsUTF16Comparator();

  /** @deprecated This comparator is only a transition mechanism */
  @Deprecated
  public static Comparator<BytesRef> getUTF8SortedAsUTF16Comparator() {
    return utf8SortedAsUTF16SortOrder;
  }

  /** @deprecated This comparator is only a transition mechanism */
  @Deprecated
  private static class UTF8SortedAsUTF16Comparator implements Comparator<BytesRef> {
    // Only singleton
    private UTF8SortedAsUTF16Comparator() {};

    @Override
    public int compare(BytesRef a, BytesRef b) {

      final byte[] aBytes = a.bytes;
      int aUpto = a.offset;
      final byte[] bBytes = b.bytes;
      int bUpto = b.offset;
      
      final int aStop;
      if (a.length < b.length) {
        aStop = aUpto + a.length;
      } else {
        aStop = aUpto + b.length;
      }

      while(aUpto < aStop) {
        int aByte = aBytes[aUpto++] & 0xff;
        int bByte = bBytes[bUpto++] & 0xff;

        if (aByte != bByte) {

          // See http://icu-project.org/docs/papers/utf16_code_point_order.html#utf-8-in-utf-16-order

          // We know the terms are not equal, but, we may
          // have to carefully fixup the bytes at the
          // difference to match UTF16's sort order:
          
          // NOTE: instead of moving supplementary code points (0xee and 0xef) to the unused 0xfe and 0xff, 
          // we move them to the unused 0xfc and 0xfd [reserved for future 6-byte character sequences]
          // this reserves 0xff for preflex's term reordering (surrogate dance), and if unicode grows such
          // that 6-byte sequences are needed we have much bigger problems anyway.
          if (aByte >= 0xee && bByte >= 0xee) {
            if ((aByte & 0xfe) == 0xee) {
              aByte += 0xe;
            }
            if ((bByte&0xfe) == 0xee) {
              bByte += 0xe;
            }
          }
          return aByte - bByte;
        }
      }

      // One is a prefix of the other, or, they are equal:
      return a.length - b.length;
    }
  }
  
  /**
   * Creates a new BytesRef that points to a copy of the bytes from 
   * <code>other</code>
   * <p>
   * The returned BytesRef will have a length of other.length
   * and an offset of zero.
   */
  public static BytesRef deepCopyOf(BytesRef other) {
    BytesRef copy = new BytesRef();
    copy.bytes = Arrays.copyOfRange(other.bytes, other.offset, other.offset + other.length);
    copy.offset = 0;
    copy.length = other.length;
    return copy;
  }
  
  /** 
   * Performs internal consistency checks.
   * Always returns true (or throws IllegalStateException) 
   */
  public boolean isValid() {
    if (bytes == null) {
      throw new IllegalStateException("bytes is null");
    }
    if (length < 0) {
      throw new IllegalStateException("length is negative: " + length);
    }
    if (length > bytes.length) {
      throw new IllegalStateException("length is out of bounds: " + length + ",bytes.length=" + bytes.length);
    }
    if (offset < 0) {
      throw new IllegalStateException("offset is negative: " + offset);
    }
    if (offset > bytes.length) {
      throw new IllegalStateException("offset out of bounds: " + offset + ",bytes.length=" + bytes.length);
    }
    if (offset + length < 0) {
      throw new IllegalStateException("offset+length is negative: offset=" + offset + ",length=" + length);
    }
    if (offset + length > bytes.length) {
      throw new IllegalStateException("offset+length out of bounds: offset=" + offset + ",length=" + length + ",bytes.length=" + bytes.length);
    }
    return true;
  }
}

这里面我们能看到这就是对我们数据库中的字节流进行编解码的一个工具,我现在其实更多的就是因为卡在这个环节无法自拔。

这个里面我们将会详细的讲到这个QueryParser究竟应该怎么样去进行合理的使用。

查询解析器和解析框架是官网中的介绍,那么QueryParser是如何实现的呢。

接下来我就用一段代码去讲解 QueryParser究竟是如何进行实现的。

package HelloLucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.sandbox.queries.regex.RegexQuery;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;

import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;

public final class SelectLucene {
    private final static String dir = "./lucene";

    /**
     * 词条搜索(单个关键字查找)
     * 主要对象是TermQuery,调用方式如下:
     * Term term=new Term(字段名, 搜索关键字);
     * Query query=new TermQuery(term);
     * Hits hits=searcher.search(query);
     *
     * @param file    字段名
     * @param keyWord 搜索关键字
     * @return 文档集合
     * @throws Exception 抛出异常
     */
    public List<Document> termQuery(String file, String keyWord) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        Term term = new Term( file, keyWord );
        Query query = new TermQuery( term );
        TopDocs topDocs = searcher.search( query, 1000 );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        LinkedHashMap<Float, Document> documentLinkedHashMap = new LinkedHashMap<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            float score = scDoc.score; //相似度
            documentLinkedHashMap.put( score, document );
        }
        List<Document> documents = new ArrayList<>( documentLinkedHashMap.values() );
        reader.close();
        return documents;
    }

    /**
     * 组合搜索(允许多个关键字组合搜索)
     *
     * @param combinatorialSearches 多条件数据返回
     * @param backCount             返回条数
     * @throws Exception 抛出异常
     */
    public List<Document> booleanQuery(List<CombinatorialSearch> combinatorialSearches, int backCount) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        List<BooleanClause> booleanClauses = new ArrayList<>();
        combinatorialSearches.forEach( combinatorialSearch -> {
            Query query1 = new TermQuery( new Term( combinatorialSearch.getFileName(), combinatorialSearch.getContext() ) );
            BooleanClause bc1 = new BooleanClause( query1, combinatorialSearch.getStrategy() );
            booleanClauses.add( bc1 );

        } );
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        booleanClauses.forEach( builder::add );
        BooleanQuery boolQuery = builder.build();
        TopDocs topDocs = searcher.search( boolQuery, backCount );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        LinkedHashMap<Float, Document> documentLinkedHashMap = new LinkedHashMap<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            float score = scDoc.score; //相似度
            documentLinkedHashMap.put( score, document );
        }
        List<Document> documents = new ArrayList<>( documentLinkedHashMap.values() );

        reader.close();
        return documents;
    }

    /**
     * 范围搜索(允许搜索指定范围内的关键字结果)
     * 主要对象是TermRangeQuery,调用方式如下:
     * TermRangeQuery rangequery=new TermRangeQuery(字段名, 起始值, 终止值, 起始值是否包含边界, 终止值是否包含边界);
     * Hits hits=searcher.search(rangequery);
     * 此方法中的参数是Boolean类型的,表示是否包含边界 。
     * true 包含边界
     * false不包含边界
     *
     * @param rangeSearch 范围查询
     * @throws Exception 抛出异常
     */

    public List<Document> rangeQuery(RangeSearch rangeSearch) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        TermRangeQuery timeQuery = new TermRangeQuery( rangeSearch.getSearchName(), new BytesRef( rangeSearch.getLimitLow() ), new BytesRef( rangeSearch.getLimitHigh() ), rangeSearch.getLowerBoundBoundary(), rangeSearch.getUpperBoundBoundary() );
        TopDocs topDocs = searcher.search( timeQuery, 1000 );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }

    /**
     * 前缀搜索(搜索起始位置符合要求的结果)
     * 主要对象是PrefixQuery,调用方式如下:
     * Term term=new Term(字段名, 搜索关键字);
     * PrefixQuery prefixquery=new PrefixQuery(term);
     * Hits hits=searcher.search(prefixquery);
     *
     * @param fileName 搜索库的名字
     * @param text     需要搜索前缀的数据
     * @throws Exception 文件查询时异常处理
     */

    public List<Document> prefixQuery(String fileName, String text) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        Term term = new Term( fileName, text );
        PrefixQuery prefixquery = new PrefixQuery( term );
        TopDocs topDocs = searcher.search( prefixquery, 1000 );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;

    }

    /**
     * 短语搜索(根据零碎的短语组合成新的词组进行搜索)
     * 其中setSlop的参数是设置两个关键字之间允许间隔的最大值。
     *
     * @param RecoverCount 返回条数
     * @throws Exception 查询时抛出异常
     */

    public List<Document> phraseQuery(List<PhraseSearch> phraseSearches, int RecoverCount) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        PhraseQuery.Builder builder = new PhraseQuery.Builder();
        phraseSearches.forEach( phraseSearch -> builder.add( new Term( phraseSearch.getFileName(), phraseSearch.getContext() ), phraseSearch.getInterval() ) );
        PhraseQuery pq = builder.build();
        TopDocs topDocs = searcher.search( pq, RecoverCount );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }

    /**
     * 多短语搜索(先指定一个前缀关键字,然后其他的关键字加在此关键字之后,组成词语进行搜索)
     * 主要对象是MultiPhraseQuery,调用方式如下:
     * Term term=new Term(字段名,前置关键字);
     * Term term1=new Term(字段名,搜索关键字);
     * Term term2=new Term(字段名,搜索关键字);
     * MultiPhraseQuery multiPhraseQuery=new MultiPhraseQuery();
     * multiPhraseQuery.add(term);
     * multiPhraseQuery.add(new Term[]{term1, term2});
     * Hits hits=searcher.search(multiPhraseQuery);
     *
     * @throws Exception
     */

    public List<Document> multiPhraseQuery() throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        //查询“计张”、“计钦”组合的关键词,先指定一个前缀关键字,然后其他的关键字加在此关键字之后,组成词语进行搜索
        Term term = new Term( "name", "计" ); //前置关键字
        Term term1 = new Term( "name", "张" ); //搜索关键字
        Term term2 = new Term( "name", "钦" ); //搜索关键字
        MultiPhraseQuery multiPhraseQuery = new MultiPhraseQuery();
        multiPhraseQuery.add( term );
        multiPhraseQuery.add( new Term[]{term1, term2} );
        TopDocs topDocs = searcher.search( multiPhraseQuery, 1000 );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }

    /**
     * 模糊搜索(顾名思义)
     * 主要对象是FuzzyQuery,调用方式如下:
     * Term term=new Term(字段名, 搜索关键字);
     * FuzzyQuery fuzzyquery=new FuzzyQuery(term,参数);
     * Hits hits=searcher.search(fuzzyquery);
     * 此中的参数是表示模糊度,是小于1的浮点小数,比如0.5f
     *
     * @throws Exception
     */

    public List<Document> fuzzyQuery(FuzzySearch fuzzySearch, int returnCount) throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        Term term = new Term( fuzzySearch.getFileName(), fuzzySearch.getContext() );
        FuzzyQuery fuzzyquery = new FuzzyQuery( term, fuzzySearch.getFuzziness() );
        TopDocs topDocs = searcher.search( fuzzyquery, returnCount );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }

    /**
     * 通配符搜索(顾名思义)
     * 主要对象是:WildcardQuery,调用方式如下:
     * Term term=new Term(字段名,搜索关键字+通配符);
     * WildcardQuery wildcardquery=new WildcardQuery(term);
     * Hits hits=searcher.search(wildcardquery);
     * 其中的通配符分两种,即*和?
     * * 表示任意多的自负
     * ?表示任意一个字符
     *
     * @throws Exception
     */
    public List<Document> wildcardQuery() throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );
        Term term = new Term( "name", "三?" );
        WildcardQuery wildcardQuery = new WildcardQuery( term );
        TopDocs topDocs = searcher.search( wildcardQuery, 1000 );
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }

    /**
     * 正则表达式搜索(顾名思义,这个类引入lucene-queries-3.5.0.jar包)
     * 主要对象是:RegexQuery,调用方式如下
     * String regex = ".*";
     * Term term = new Term (search_field_name, regex);
     * RegexQuery query = new RegexQuery (term);
     * TopDocs hits = searcher.search (query, 100);
     *
     * @throws Exception
     */

    public List<Document> regexQuery() throws Exception {
        Directory directory = FSDirectory.open( Paths.get( dir ) );
        IndexReader reader = DirectoryReader.open( directory );
        IndexSearcher searcher = new IndexSearcher( reader );

        String regex = "林*";
        Term term = new Term( "name", regex );
        RegexQuery query = new RegexQuery( term );

        TopDocs topDocs = searcher.search( query, 1000 );


        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        List<Document> documents = new ArrayList<>();
        for (ScoreDoc scDoc : scoreDocs) {
            Document document = searcher.doc( scDoc.doc );
            documents.add( document );
        }
        reader.close();
        return documents;
    }
}

猜你喜欢

转载自blog.csdn.net/weixin_41046245/article/details/81608187