Lucene总结(四):使用Lucene进行中文分词和高亮显示

前面一直说的都是英文的查询,但其实常用的还是中文查询,中文和英文又是不一样的,当然底层的原理都是一样的。所以这一篇讲解中文分词和高亮显示。

中文分词

首先要准备一个中文的分词器的jar包。

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-smartcn</artifactId>
    <version>5.3.1</version>
</dependency>

准备数据:

private Integer ids[] = {1,2,3};
private String dates[] = {"昨天","今天","明天"};
private String descs[] = {"昨天是雨天","今天是阴天","明天是晴天"};

建立索引:

public class IndexTest3 {

    private Directory dir;

    private Integer ids[] = {1,2,3};
    private String dates[] = {"昨天","今天","明天"};
    private String descs[] = {"昨天是雨天","今天是阴天","明天是晴天"};

    @Test
    public void index(String indexDir) throws Exception{
        dir = FSDirectory.open(Paths.get(indexDir));
        IndexWriter writer = getWriter();
        for(int i = 0; i < ids.length; i++){
            Document document = new Document();
            document.add(new IntField("id", ids[i], Field.Store.YES));
            document.add(new StringField("date",dates[i], Field.Store.YES));
            document.add(new TextField("desc",descs[i], Field.Store.YES));
            writer.addDocument(document);
        }
        writer.close();
    }

    public IndexWriter getWriter()throws Exception{
        SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter writer = new IndexWriter(dir, config);
        return writer;
    }

    public static void main(String[] args) throws Exception {
        new IndexTest3().index("D:\\resource");
    }
}

这里其实和前面讲解的差不多,只是数据是中文的以及分词器是中文分词器。已经建立好了索引。下面使用搜索功能。

public class testChineseSearcher {

    public static void search(String indexDir, String q) throws Exception{

        Directory directory = FSDirectory.open(Paths.get(indexDir));
        IndexReader reader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(reader);
        SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
        QueryParser parser = new QueryParser("desc", analyzer);
        Query query = parser.parse(q);

        long startTime = System.currentTimeMillis();
        TopDocs hits = searcher.search(query, 10);
        long endTime = System.currentTimeMillis();
        System.out.println("匹配" + q + "共耗时" + (endTime-startTime) + "毫秒");
        System.out.println("查询到" + hits.totalHits + "条记录");

        for(ScoreDoc scoreDoc : hits.scoreDocs){//取出每条查询结果
            Document doc = searcher.doc(scoreDoc.doc); //scoreDoc.doc相当于docID,根据这个docID来获取文档
            System.out.println(doc.get("date"));
            System.out.println(doc.get("desc"));
        }
        reader.close();

    }
    public static void main(String[] args) {
        String indexDir = "D:\\masterSpring\\code\\chapter18\\src\\resource";
        String q = "今天"; //查询这个字符
        try {
            search(indexDir, q);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

结果如下:

匹配今天共耗时41毫秒
查询到1条记录
今天
今天是阴天

下面说一下高亮显示。

高亮显示

还是和以前一样,先准备jar包:

<dependency>
   <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>5.3.1</version>
</dependency>

高亮显示就在上面的代码上测试了。

public class testChineseSearcher {

    public static void search(String indexDir, String q) throws Exception{

        //上面的省略了。
        System.out.println("查询到" + hits.totalHits + "条记录");

        SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<b><font color=green>","</font></b>"); //如果不指定参数的话,默认是加粗,即<b><b/>
        QueryScorer scorer = new QueryScorer(query);//计算得分,会初始化一个查询结果最高的得分
        Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); //根据这个得分计算出一个片段
        Highlighter highlighter = new Highlighter(simpleHTMLFormatter, scorer);
        highlighter.setTextFragmenter(fragmenter); //设置一下要显示的片段

        for(ScoreDoc scoreDoc : hits.scoreDocs){//取出每条查询结果
            Document doc = searcher.doc(scoreDoc.doc); //scoreDoc.doc相当于docID,根据这个docID来获取文档
            System.out.println(doc.get("data"));
            System.out.println(doc.get("desc"));
            String desc = doc.get("desc");
            //显示高亮
            if(desc != null) {
                TokenStream tokenStream = analyzer.tokenStream("desc", new StringReader(desc));
                String summary = highlighter.getBestFragment(tokenStream, desc);
                System.out.println(summary);
            }
        }
        reader.close();

    }
    public static void main(String[] args) {
        String indexDir = "D:\\masterSpring\\code\\chapter18\\src\\resource";
        String q = "今天"; //查询这个字符
        try {
            search(indexDir, q);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

结果如下:

匹配今天共耗时35毫秒
查询到1条记录
今天
今天是阴天
今天是阴天
从代码上也能看的出来,SimpleHTMLFormatter定义了高亮的显示方式的,默认是加粗的。这里贴一下源码。

public class SimpleHTMLFormatter implements Formatter {
    private static final String DEFAULT_PRE_TAG = "<B>";
    private static final String DEFAULT_POST_TAG = "</B>";
    private String preTag;
    private String postTag;

    public SimpleHTMLFormatter(String preTag, String postTag) {
        this.preTag = preTag;
        this.postTag = postTag;
    }

    public SimpleHTMLFormatter() {
        this("<B>", "</B>");
    }

    public String highlightTerm(String originalText, TokenGroup tokenGroup) {
        if (tokenGroup.getTotalScore() <= 0.0F) {
            return originalText;
        } else {
            StringBuilder returnBuffer = new StringBuilder(this.preTag.length() + originalText.length() + this.postTag.length());
            returnBuffer.append(this.preTag);
            returnBuffer.append(originalText);
            returnBuffer.append(this.postTag);
            return returnBuffer.toString();
        }
    }
}

从源码可以看的出来SimpleHTMLFormatter的原理是:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式。

那么这里的评分就是QueryScorer ,评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为

public float getTokenScore() {  
    position += posIncAtt.getPositionIncrement();//记录出现的位置  
    String termText = termAtt.toString();  

    WeightedSpanTerm weightedSpanTerm;  

    if ((weightedSpanTerm = fieldWeightedSpanTerms.get(  
              termText)) == null) {  
      return 0;  
    }  

    if (weightedSpanTerm.positionSensitive &&  
          !weightedSpanTerm.checkPosition(position)) {  
      return 0;  
    }  

    float score = weightedSpanTerm.getWeight();//获取权重  

    // found a query term - is it unique in this doc?  
    if (!foundTerms.contains(termText)) {//结果排重处理  
      totalScore += score;  
      foundTerms.add(termText);  
    }  

    return score;  
  }

以上就是中文分词和高亮显示。


希望对你有帮助,如有疑问或见解,欢迎提出,一起进步。

猜你喜欢

转载自blog.csdn.net/RebelHero/article/details/80231966
今日推荐