Lucene笔记一

本文的核心内容：索引文件的文件结构，建立索引与检索索引的步骤，更改索引与删除索引。分词原理的探究，标准分词器与中文分词器。

概述：什么是全文检索技术

全文检索是指计算机索引程序通过扫描文章中的"每一个词"，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时，检索程序就根据事先建立的索引进行查找，并将查找的结果反馈给用户的检索方式。

Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。

索引文档结构

索引 index

段 segment

文档 document

Document 是用来描述文档的，这里的文档可以指一个 HTML 页面，一封电子邮件，或者是一个文本文件。一个 Document 对象由多个 Field 对象组成的。可以把一个 Document 对象想象成数据库中的一个记录，而每个 Field 对象就是记录的一个字段。

域 field

Field 对象是用来描述一个文档的某个属性的，比如一封电子邮件的标题和内容可以用两个 Field 对象分别描述。

词元 term

Term对象是用来描述每个Field的内容根据分词规则拆分的词组。索引是建立在词元上的。

一：实现Demo建立索引与检索索引，更改删除索引【简单的Demo】

引入Maven坐标

<properties>  
        <lucene.version>5.5.0</lucene.version>  
    </properties>  

<dependencies>  
    <dependency>  
        <groupId>junit</groupId>  
        <artifactId>junit</artifactId>  
        <version>4.9</version>  
        <scope>test</scope>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.lucene</groupId>  
        <artifactId>lucene-core</artifactId>  
        <version>${lucene.version}</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.lucene</groupId>  
        <artifactId>lucene-analyzers-common</artifactId>  
        <version>${lucene.version}</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.lucene</groupId>  
        <artifactId>lucene-queryparser</artifactId>  
        <version>${lucene.version}</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.lucene</groupId>  
        <artifactId>lucene-highlighter</artifactId>  
        <version>${lucene.version}</version>  
    </dependency>  
    <dependency>
		<groupId>org.wltea.analyzer</groupId>
		<artifactId>IK-Analyzer</artifactId>
		<version>5.0</version>
	</dependency>

</dependencies>

全文检索创建索引三部曲：需要检索的数据（ Document ）、分析器（ Analyzer ）、创建索引（ index ）

示例以Lucene自带的标准分词器为分词规则。

/**
 * @Description
 * @Author Maps
 * @Time 2018/8/4 15:20
 * @Version 1.0
 */

public class IndexCreate {
	public static void main(String[] args)throws Exception {
		//1.  指定索引文件存储位置 保存在磁盘
		FSDirectory fsDirectory = FSDirectory.open(Paths.get("D:\\Lucene\\001"));
		//2.  创建分词器  【这里采用标准分词器】
		StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
		//3.  创建索引写入配置对象
		IndexWriterConfig indexWriterConfig = new IndexWriterConfig(standardAnalyzer);
		//4.  创建索引写入器
		IndexWriter indexWriter = new IndexWriter(fsDirectory, indexWriterConfig);
		//5.  创建索引文档对象  添加域 即添加需要存储的字段信息
		Document document = new Document();
		document.add(new IntField("id",3, Field.Store.YES));
		document.add(new TextField("title","中国简述", Field.Store.YES));
		document.add(new TextField("content","中国是个迷人的国家，如果你从小喜欢书法、诗词、历史、美食，你就会发现这个国家从三千年前到现在，其实没有变过。", Field.Store.YES));
		indexWriter.addDocument(document);
		Document document1 = new Document();
		document1.add(new IntField("id",4,Field.Store.YES));
		document1.add(new TextField("title"," 迷人的中国 ",Field.Store.YES));
		document1.add(new TextField("content","1949年10月1日，在北京天安门广场举行开国大典，毛泽东在天安门城楼上宣告中华人民共和国中央人民政府成立，中华人民共和国正式成立。", Field.Store.YES));
		indexWriter.addDocument(document1);
		indexWriter.flush();
		indexWriter.commit();
		
	}
}

全文检索检索索引四部曲：关键词（ Keyword ）、分析器（ Analyzer ）、检索索引（ Searcher ）、返回结果（ Results ）

/**
 * @Description
 * @Author Maps
 * @Time 2018/8/4 15:20
 * @Version 1.0
 */

public class IndexSearch {
	public static void main(String[] args)throws Exception {
		//  索引文件目录
		FSDirectory fsDirectory = FSDirectory.open(Paths.get("D:\\Lucene\\001"));
		//  创建 Reader
		DirectoryReader directoryReader = DirectoryReader.open(fsDirectory);
		//  创建索引检索器
		IndexSearcher indexSearcher = new IndexSearcher(directoryReader);
		//1.  检索关键词

		String keyword ="中";
		//2.  分词技术
		//3.  检索索引
		//  第一个参数： Query  代表查询的条件 第二个参数：查几条
		// 创建一个词元对象 第一个参数：代表检索的域名 第二个参数：需要检索的词元
		Term term = new Term("content",keyword);
		//  创建 TermQuery  基于词元的查询
		TermQuery query = new TermQuery(term);
		//4.  返回结果 封装查询的结果 查询符合条件的文档数 符合条件结果
		TopDocs topDocs = indexSearcher.search(query, 8);
		System.out.println(" 总命中数： "+topDocs.totalHits);
		// socreDocs  封装了文档的编号 文档得分
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			System.out.println(" 文档的得分： "+scoreDoc.score);
			System.out.println(" 文档的编号： "+scoreDoc.doc);
			//  通过文档编号 获取文档对象
			Document document = indexSearcher.doc(scoreDoc.doc);
			//  指定域名获取域值
			System.out.println(document.get("id")+" "+document.get("title")+" "+document.get("content"));
		}
		directoryReader.close();
	}
}

根据 '中'字检索content域的效果截图

更改索引

/**
 * @Description
 * @Author Maps
 * @Time 2018/8/5 21:02
 * @Version 1.0
 */

public class IndexUpdate {
	public static void main(String[] args)throws Exception {
		//  索引文件目录
		FSDirectory fsDirectory = FSDirectory.open(Paths.get("D:\\Lucene\\001"));
		IndexWriter indexWriter = new IndexWriter(fsDirectory,new IndexWriterConfig(new StandardAnalyzer()));
		//创建新的文档内容
		Document document = new Document();
		document.add(new IntField("id",1,Field.Store.YES));
		document.add(new TextField("title"," 我是最可爱的人 ",Field.Store.YES));
		document.add(new TextField("content"," 北京欢迎您，欢迎你来到中国的首都。", Field.Store.YES));
		//  第一个参数： term 对象 域中符合条件的文档
		Term term = new Term("title","中");
		//  第一个参数：符合条件的原始 document
		//  第二个参数：新的 document  
		indexWriter.updateDocument(term,document);
		indexWriter.flush();
		//  提交改操作
		indexWriter.commit();
	}
}

删除索引

/**
 * @Description
 * @Author Maps
 * @Time 2018/8/5 21:09
 * @Version 1.0
 */

public class IndexDelete {
	
	public static void main(String[] args) throws Exception{
		//索引文件目录
		FSDirectory fsDirectory = FSDirectory.open(Paths.get("D:\\Lucene\\001"));
		IndexWriter indexWriter = new IndexWriter(fsDirectory,new IndexWriterConfig(new StandardAnalyzer()));
		//删除索引索引
		indexWriter.deleteAll();
		//删除一个文档 删除索引文档时 也会删除垃圾文档或者没有索引的文档
		//Term term = new Term("title"," 中 ");
		//indexWriter.deleteDocuments(term);
		indexWriter.flush();
		indexWriter.commit();
	}
	
}

二：分词原理的探究【Filter过滤链】

/**
 * @Description  分词器通过加Filter 实现分词，自定义扩展分词，自定义扩展停止分词
 * @Author Maps
 * @Time 2018/8/5 21:13
 * @Version 1.0
 */

public class WordAnalyzer {
	
	public static void main(String[] args)throws Exception {
		try {
			Analyzer analyzer = null;
			analyzer = new StandardAnalyzer();
			//analyzer = new SimpleAnalyzer();

			// tokenStream  分词信息 term 内容 位置信息 偏移量信息 类型信息
			// stream  写入索引文件
			String text = "I am Map then, 北京欢迎您！ Woking";
			// text ---> Reader ---> compents tokenizer  分词 filter
			// 过滤器 StandardFilter 、 LowerCaseFilter 、 StopFilter --->tokenStrem
			//  创建一个自定义分词器 或者 自定义过滤器（敏感词过滤器）
			//  敏感词？ 敏感词库 自定义一个敏感词 Filter 毛主席 --->  毛主席 --->  根据分词词元（毛主席）
			// 去词库 (Map<String,String>) 查找是不是 ---> throw RuntimeException
			TokenStream tokenStream = analyzer.tokenStream("content", text);
			tokenStream.reset();
			//  给 tokenStream 添加相关的属性 获取内容
			CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
			while (tokenStream.incrementToken()) {
				//  对 tokenStream 中的词元 进行遍历 有下一个词元 返回 true  无 false
				System.out.println(charTermAttribute + " | ");
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

三：中文分词【IK，庖丁解牛，SmartChinese】

标准分词器对中文分词支持不友好。

smartCN Lucene 提供中文分词器但是不支持扩展。

IK分词器对中文分词效果好，其次支持扩展，如停用词汇，自定义分词等。它还能智能分词，即将分词再次拆分，默认为false。【示例使用IK 5.0 】下载IK5.0的Jar 通过Maven命令手动导入Maven仓库中。

public class ChineseWord {
	/* @Description:引入IK分词器的配置文件，配置扩展词信息与停用词信息
	 * @Param: 
	 * @Return: 
	 * @Author: Maps
	 * @Date: 2018/8/5 21:02
	 */
			
	public static void main(String[] args) throws Exception{
		//  中文分词器： mmseg4j 、 IKAnalyzer 、庖丁解牛
		try {
			Analyzer analyzer = null;
			// smartCN Lucene 提供中文分词器 但是不支持扩展
			//analyzer = new SmartChineseAnalyzer();
			//IK分词 默认不采用智能
			//采用智能分词将词元再次分解
			analyzer = new IKAnalyzer(true);
			String text = "山治是个海贼中的绅士。小草帽爱好和平，喜爱玩梦幻西游，英雄联盟等游戏。";
			TokenStream tokenStream = analyzer.tokenStream("content", text);
			tokenStream.reset();
			//  给 tokenStream 添加相关的属性 获取内容
			CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
			while (tokenStream.incrementToken()) { //  对 tokenStream 中的词元 进行遍历 有下一个词元 返回 true  无 false
				System.out.println(charTermAttribute + " | ");
			}
		} catch (Exception e)
		{
			e.printStackTrace();
		}
	}
	
}

猜你喜欢