Apache Tika 1.22 has been released, Tika is a content extraction tool set (a toolkit for text extracting). It integrates the POI and Pdfbox, and extraction work provides a unified interface for text. Secondly, Tika also provides a convenient extension API, used to enrich its support for third-party file formats.
The new version contains many improvements and bug fixes, the main updates are as follows:
- Note: Known return: PDFBOX-4587 - code point between 0xF000 and 0XF0000 the PDF password will cause an exception
- Add parser (TIKA-2909) for the HWP v5 file
- Repair closed order flow, in order to avoid the TesseractOCRParser "could not close temporary resources" exception (TIKA-2908)
- Improve the performance AutoDetectReader (TIKA-1568) by the encoder buffer detector
- RTFParser output preventing impermissible combination of labels (TIKA-2889)
- Repair RereadableInputStream to release all resources (TIKA-2903)
- Tika-eval module implemented in the language based on the detector OpenNLP custom language identifier; add 18 languages, and add the common word list (TIKA-2790) for all 121 kinds of languages
- Repair MimeTypesReader.releaseParser (NPE (TIKA-2896)) in
- Repair RTFParser to extract more (TIKA-2883)
- ClientSubmitTime added to the metadata extracted from the PST files (TIKA-2898)
- Improved StreamingZipContainerDetector (TIKA-2886) for xltx, xltm and several other file formats
Download: https://tika.apache.org/download.html