I want to parse the content not just the metadata of a jpg picture.
The following code is the test class
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.ocr.TesseractOCRConfig; import org.apache.tika.parser.ocr.TesseractOCRParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class JpegParse { public static void main(final String[] args) throws IOException, SAXException, TikaException, InterruptedException { File file = new File("/path/to/menu.jpg"); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(file); ParseContext pcontext = new ParseContext(); TesseractOCRConfig config = new TesseractOCRConfig(); config.setLanguage("chi"); config.setTesseractPath("/path/to/tesseract-ocr"); pcontext.set(TesseractOCRConfig.class, config); TesseractOCRParser JpegParser = new TesseractOCRParser(); pcontext.set(TesseractOCRParser.class, JpegParser); JpegParser.parse(inputstream, handler, metadata, pcontext); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for (String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } System.out.println("Contents of the document:" + handler.toString()); } }
Note:
config.setTesseractPath("/path/to/tesseract-ocr");
must be parent dir includes tessdata dir.
And tesseract cmd must be linked in this dir
#ln -s /usr/local/bin/tesseract /path/to/tesseract-ocr
Preferences
https://wiki.apache.org/tika/TikaOCR
http://www.kaiyuanba.cn/html/1/131/227/7891.htm