TextExtract(1)Tika Basic

TextExtract(1)Tika Basic

1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.

Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser

There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true

Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui

And we can choose files and change the view to see different contents we get from the files.

2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class TestFunMain {

    static final String file = "/opt/data/resume/3-resume.pdf";

    public static void main(String[] args) {
        // Create a Tika instance with the default configuration
        Tika tika = new Tika();
        // Parse all given files and print out the extracted text content
        String text = null;
        try {
            text = tika.parseToString(new File(file));
        } catch (IOException | TikaException e) {
            e.printStackTrace();
        }
        System.out.print(text);
    }
}

Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TestFunMain {

    static final String file = "/opt/data/resume/3-duffy.pdf";

    public static void main(String[] args) {
        Tika tika = new Tika();
        String text = null;
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler();
        ParseContext context = new ParseContext();
        Metadata metadata = new Metadata();

        // fetch the content
        try {
            text = tika.parseToString(new File(file));
        } catch (IOException | TikaException e) {
            e.printStackTrace();
        }
        // System.out.print(text);

        // fetch the meta
        try {
            parser.parse(new FileInputStream(file), handler, metadata, context);
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
        // System.out.println(handler.toString());

        String[] metadataNames = metadata.names();

        for (String name : metadataNames) {
            // System.out.println(name + ": " + metadata.get(name));
        }

        // identify language
        try {
            parser.parse(new FileInputStream(file), handler, metadata,
                    new ParseContext());
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        }
        LanguageIdentifier object = new LanguageIdentifier(handler.toString());
        System.out.println("Language name :" + object.getLanguage());
    }
}

References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8

books
Tika in Action.pdf

http://m.yiibai.com/tika/tika_content_extraction.html

猜你喜欢

转载自sillycat.iteye.com/blog/2248936