Lucene index viewing tool luke and text extraction tool Tika

  luke can easily view the index information of lucene, and of course, you can also view the index information in solr and es (based on lucene implementation).

       Before viewing the index, pay attention to the lucene version. A higher version of lucene may not be opened with a lower version of luke tool.

       Remember that the index repair function can also be implemented with luke in the past, and the segment with errors will be deleted and backed up before use.

       The use of luke will be added later.

 

       Tika is a text extraction tool that can extract content from word, pdf, excel and other files, and provide data sources for es, etc. The picture information can only analyze the title size, and it is not necessary to record the RGB color information.

       Tika identifies the document type and encoding type based on the "magic number" of the file. Class files similar to java all start with CF BB. Standard documents, which can be identified based on the preceding bytes.

       When Tika recognizes Chinese, there may be garbled characters. I remember that the document mentioned that it may be a problem that the recognition of GB2312 character set has a probability of error. Have a chance to look at it in detail.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324445702&siteId=291194637