tika或pdf基础信息

版权声明:本文为博主原创文章,支持转载,但转载时请务必在明显位置,给出原文连接。 https://blog.csdn.net/john1337/article/details/85228527

通过下面的代码就可以获取一个pdf文件的基础信息:

        try{
              BodyContentHandler handler = new BodyContentHandler();
              Metadata metadata = new Metadata();
              FileInputStream inputstream = new FileInputStream(new File("D:/apache_software/solr/solr-7.5.0/example/exampledocs/solr-word.pdf"));
              ParseContext pcontext = new ParseContext();
              
              //parsing the document using PDF parser
              PDFParser pdfparser = new PDFParser(); 
              pdfparser.parse(inputstream, handler, metadata,pcontext);
              
              //getting the content of the document
              System.out.println("Contents of the PDF :" + handler.toString());
              
              //getting metadata of the document
              System.out.println("Metadata of the PDF:");
              String[] metadataNames = metadata.names();
              
              for(String name : metadataNames) {
                 System.out.println(name+ " : " + metadata.get(name));
              }            
        }catch(Exception ex){
            ex.printStackTrace();
        }

输出结果:

Metadata of the PDF:
date : 2008-11-13T13:35:51Z
pdf:PDFVersion : 1.3
xmp:CreatorTool : Microsoft Word
Keywords : solr, word, pdf
subject : solr word
AAPL:Keywords : solr, word, pdf
dc:creator : Grant Ingersoll
dcterms:created : 2008-11-13T13:35:51Z
Last-Modified : 2008-11-13T13:35:51Z
dcterms:modified : 2008-11-13T13:35:51Z
dc:format : application/pdf; version=1.3
title : solr-word
Last-Save-Date : 2008-11-13T13:35:51Z
meta:save-date : 2008-11-13T13:35:51Z
dc:title : solr-word
pdf:encrypted : false
modified : 2008-11-13T13:35:51Z
cp:subject : solr word
Content-Type : application/pdf
creator : Grant Ingersoll
meta:author : Grant Ingersoll
dc:subject : solr, word, pdf
meta:creation-date : 2008-11-13T13:35:51Z
created : Thu Nov 13 21:35:51 CST 2008
xmpTPg:NPages : 1
Creation-Date : 2008-11-13T13:35:51Z
meta:keyword : solr, word, pdf
Author : Grant Ingersoll
producer : Mac OS X 10.5.5 Quartz PDFContext

这也是为什么tika导入pdf文件时会有下面的配置:

      <entity name="pdf" processor="TikaEntityProcessor"
              url="${file.fileAbsolutePath}" format="text">

        <field column="Author" name="author" meta="true"/>
        <!-- in the original PDF, the Author meta-field name is upper-cased,
          but in Solr schema it is lower-cased
         -->

        <field column="title" name="title" meta="true"/>
        <field column="dc:format" name="format" meta="true"/>

        <field column="text" name="text"/>

      </entity>

猜你喜欢

转载自blog.csdn.net/john1337/article/details/85228527
今日推荐