前面一直说的都是英文的查询,但其实常用的还是中文查询,中文和英文又是不一样的,当然底层的原理都是一样的。所以这一篇讲解中文分词和高亮显示。
中文分词
首先要准备一个中文的分词器的jar包。
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>5.3.1</version>
</dependency>
准备数据:
private Integer ids[] = {1,2,3};
private String dates[] = {"昨天","今天","明天"};
private String descs[] = {"昨天是雨天","今天是阴天","明天是晴天"};
建立索引:
public class IndexTest3 {
private Directory dir;
private Integer ids[] = {1,2,3};
private String dates[] = {"昨天","今天","明天"};
private String descs[] = {"昨天是雨天","今天是阴天","明天是晴天"};
@Test
public void index(String indexDir) throws Exception{
dir = FSDirectory.open(Paths.get(indexDir));
IndexWriter writer = getWriter();
for(int i = 0; i < ids.length; i++){
Document document = new Document();
document.add(new IntField("id", ids[i], Field.Store.YES));
document.add(new StringField("date",dates[i], Field.Store.YES));
document.add(new TextField("desc",descs[i], Field.Store.YES));
writer.addDocument(document);
}
writer.close();
}
public IndexWriter getWriter()throws Exception{
SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(dir, config);
return writer;
}
public static void main(String[] args) throws Exception {
new IndexTest3().index("D:\\resource");
}
}
这里其实和前面讲解的差不多,只是数据是中文的以及分词器是中文分词器。已经建立好了索引。下面使用搜索功能。
public class testChineseSearcher {
public static void search(String indexDir, String q) throws Exception{
Directory directory = FSDirectory.open(Paths.get(indexDir));
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
QueryParser parser = new QueryParser("desc", analyzer);
Query query = parser.parse(q);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(query, 10);
long endTime = System.currentTimeMillis();
System.out.println("匹配" + q + "共耗时" + (endTime-startTime) + "毫秒");
System.out.println("查询到" + hits.totalHits + "条记录");
for(ScoreDoc scoreDoc : hits.scoreDocs){//取出每条查询结果
Document doc = searcher.doc(scoreDoc.doc); //scoreDoc.doc相当于docID,根据这个docID来获取文档
System.out.println(doc.get("date"));
System.out.println(doc.get("desc"));
}
reader.close();
}
public static void main(String[] args) {
String indexDir = "D:\\masterSpring\\code\\chapter18\\src\\resource";
String q = "今天"; //查询这个字符
try {
search(indexDir, q);
} catch (Exception e) {
e.printStackTrace();
}
}
}
结果如下:
匹配今天共耗时41毫秒
查询到1条记录
今天
今天是阴天
下面说一下高亮显示。
高亮显示
还是和以前一样,先准备jar包:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>5.3.1</version>
</dependency>
高亮显示就在上面的代码上测试了。
public class testChineseSearcher {
public static void search(String indexDir, String q) throws Exception{
//上面的省略了。
System.out.println("查询到" + hits.totalHits + "条记录");
SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<b><font color=green>","</font></b>"); //如果不指定参数的话,默认是加粗,即<b><b/>
QueryScorer scorer = new QueryScorer(query);//计算得分,会初始化一个查询结果最高的得分
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); //根据这个得分计算出一个片段
Highlighter highlighter = new Highlighter(simpleHTMLFormatter, scorer);
highlighter.setTextFragmenter(fragmenter); //设置一下要显示的片段
for(ScoreDoc scoreDoc : hits.scoreDocs){//取出每条查询结果
Document doc = searcher.doc(scoreDoc.doc); //scoreDoc.doc相当于docID,根据这个docID来获取文档
System.out.println(doc.get("data"));
System.out.println(doc.get("desc"));
String desc = doc.get("desc");
//显示高亮
if(desc != null) {
TokenStream tokenStream = analyzer.tokenStream("desc", new StringReader(desc));
String summary = highlighter.getBestFragment(tokenStream, desc);
System.out.println(summary);
}
}
reader.close();
}
public static void main(String[] args) {
String indexDir = "D:\\masterSpring\\code\\chapter18\\src\\resource";
String q = "今天"; //查询这个字符
try {
search(indexDir, q);
} catch (Exception e) {
e.printStackTrace();
}
}
}
结果如下:
匹配今天共耗时35毫秒
查询到1条记录
今天
今天是阴天
今天是阴天
从代码上也能看的出来,SimpleHTMLFormatter定义了高亮的显示方式的,默认是加粗的。这里贴一下源码。
public class SimpleHTMLFormatter implements Formatter {
private static final String DEFAULT_PRE_TAG = "<B>";
private static final String DEFAULT_POST_TAG = "</B>";
private String preTag;
private String postTag;
public SimpleHTMLFormatter(String preTag, String postTag) {
this.preTag = preTag;
this.postTag = postTag;
}
public SimpleHTMLFormatter() {
this("<B>", "</B>");
}
public String highlightTerm(String originalText, TokenGroup tokenGroup) {
if (tokenGroup.getTotalScore() <= 0.0F) {
return originalText;
} else {
StringBuilder returnBuffer = new StringBuilder(this.preTag.length() + originalText.length() + this.postTag.length());
returnBuffer.append(this.preTag);
returnBuffer.append(originalText);
returnBuffer.append(this.postTag);
return returnBuffer.toString();
}
}
}
从源码可以看的出来SimpleHTMLFormatter的原理是:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式。
那么这里的评分就是QueryScorer ,评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为
public float getTokenScore() {
position += posIncAtt.getPositionIncrement();//记录出现的位置
String termText = termAtt.toString();
WeightedSpanTerm weightedSpanTerm;
if ((weightedSpanTerm = fieldWeightedSpanTerms.get(
termText)) == null) {
return 0;
}
if (weightedSpanTerm.positionSensitive &&
!weightedSpanTerm.checkPosition(position)) {
return 0;
}
float score = weightedSpanTerm.getWeight();//获取权重
// found a query term - is it unique in this doc?
if (!foundTerms.contains(termText)) {//结果排重处理
totalScore += score;
foundTerms.add(termText);
}
return score;
}
以上就是中文分词和高亮显示。
希望对你有帮助,如有疑问或见解,欢迎提出,一起进步。