Apache Tika 1.22 release, content extraction tool collection

Apache Tika 1.22 has been released, Tika is a content extraction tool set (a toolkit for text extracting). It integrates the POI and Pdfbox, and extraction work provides a unified interface for text. Secondly, Tika also provides a convenient extension API, used to enrich its support for third-party file formats.

The new version contains many improvements and bug fixes, the main updates are as follows:

  • Note: Known return: PDFBOX-4587 - code point between 0xF000 and 0XF0000 the PDF password will cause an exception
  • Add parser (TIKA-2909) for the HWP v5 file
  • Repair closed order flow, in order to avoid the TesseractOCRParser "could not close temporary resources" exception (TIKA-2908)
  • Improve the performance AutoDetectReader (TIKA-1568) by the encoder buffer detector
  • RTFParser output preventing impermissible combination of labels (TIKA-2889)
  • Repair RereadableInputStream to release all resources (TIKA-2903)
  • Tika-eval module implemented in the language based on the detector OpenNLP custom language identifier; add 18 languages, and add the common word list (TIKA-2790) for all 121 kinds of languages
  • Repair MimeTypesReader.releaseParser (NPE (TIKA-2896)) in
  • Repair RTFParser to extract more (TIKA-2883)
  • ClientSubmitTime added to the metadata extracted from the PST files (TIKA-2898)
  • Improved StreamingZipContainerDetector (TIKA-2886) for xltx, xltm and several other file formats

Announce

Download: https://tika.apache.org/download.html

Guess you like

Origin www.oschina.net/news/108785/apache-tika-1-22-released