一、Tika介绍

Tika是2008年由Apache组织开发的项目，主要用于打开不同的文档。Tika的下载请点击这里。在处理文档索引的时候，有时候会碰到pdf，html，word这种非纯文本的内容，这些内容怎么来建立索引呢，这就要用到Tika了，Tika像一个桥梁一样，连通了IndexWriter和上层的各种文件类型。

二、Tika的使用

使用java -jar命令来打开刚刚下载的jar包。打开一个word文档，在View标签下，点击Formatted Text，可以查看纯文本的信息，点击其他的可以看到其他样式的信息，自行点开查看即可。

三、项目中的使用

导入上面下载的tika-app-1.19.1.jar，先来看一种不使用Tika情况下对文档的索引，索引之后，发现成功了，也没有报错，好吧，那我们通过Luke来看看它索引的什么东西吧。我们可以发现，索引信息并没有什么价值，因为我们看不懂。之后，我们写一个Tika的测试例子，将一个pdf中的文字提取出来。

package com.wsy;

import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import java.io.*;
import java.util.Date;

public class TikaTest {
    private static File file = new File("C:\\Users\\Chris\\Desktop\\自如租房合同.pdf");

    public void indexWithOutTika() {
        try {
            Directory directory = FSDirectory.open(new File("E:\\Lucene\\IndexLibrary"));
            IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new MMSegAnalyzer()));
            Document document = new Document();
            document.add(new Field("content", new FileReader(file)));
            indexWriter.addDocument(document);
            indexWriter.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public String fileToText(File file) {
        Parser parser = new AutoDetectParser();
        InputStream inputStream = null;
        try {
            inputStream = new FileInputStream(file);
            ContentHandler contentHandler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            // 可以自定义metadata的值，同理，可以设置很多的值，这里就不演示了
            // 对于下面输出时候已经有了的值，不能修改，对于没有展示的值，可以自定义设置
            metadata.set(Metadata.AUTHOR, "王劭阳");
            ParseContext parseContext = new ParseContext();
            parseContext.set(Parser.class, parser);
            parser.parse(inputStream, contentHandler, metadata, parseContext);
            for (String name : metadata.names()) {
                System.out.println(name + "-->" + metadata.get(name));
            }
            return contentHandler.toString();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return null;
    }

    public static void main(String[] args) {
        TikaTest tikaTest = new TikaTest();
        tikaTest.indexWithOutTika();
        String text = tikaTest.fileToText(file);
        System.out.println(text);
    }
}

Lucene笔记31-Lucene的扩展-Tika介绍

一、Tika介绍

二、Tika的使用

三、项目中的使用

猜你喜欢