TextExtract(1)Tika Basic
1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.
Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser
There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true
Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui
And we can choose files and change the view to see different contents we get from the files.
2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-resume.pdf";
public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}
Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-duffy.pdf";
public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);
// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}
// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
Tika in Action.pdf
http://m.yiibai.com/tika/tika_content_extraction.html
TextExtract(1)Tika Basic
猜你喜欢
转载自sillycat.iteye.com/blog/2248936
今日推荐
周排行