Apache Tika 1.24 release, content extraction tool collection

Apache Tika 1.24 released, Tika is a content extraction tool set (a toolkit for text extracting). It integrates the POI and Pdfbox, and extraction work provides a unified interface for text. Secondly, Tika also provides a convenient extension API, used to enrich its support for third-party file formats.

The main updates are as follows:

  • Drew Noakes update metadata extractor
  • Enabling optional extraction structure tags in a PDF (alpha level)
  • --extract mode Tika application now to STDOUT
  • Add the optional parser for the PDF Preflight
  • Some improvement zip format based detection
  • The upgrade metadata extractor to 2.13.0 
  • Upgrade to the POI 4.1.2
  • XMP extracted from the PSD file
  • XMLProfiler added in the PDF as an optional parser to configure XFA and XMP
  • PDF is extracted from DCT filter depends on the image inline
  • Upgrading to PDFBox 2.0.19
  • Fixed ASM parser configuration error
  • Upgrade to Java-libpst 0.9.3
  • Fixed XLIFF12Parser failure of ToXMLHandler 

Update Description:  https://downloads.apache.org/tika/CHANGES-1.24.txt

Guess you like

Origin www.oschina.net/news/114241/apache-tika-1-24-released