POI read word document

Recently I made a word document import function, but because the project an emergency, so do very rough. Finally the weekend, on their own line and for a code that he wanted to make a generic tool to prepare for the future when direct adhesive used.

Outline

POI origin

Apache POI is an open source project, whose initial intention is processed based on standard Office Open XML (OOXML) and Microsoft OLE 2 Compound Document Format (OLE2) of various file formats of documents, and has support for reading and writing. JAVA can be said to be the tool of choice in the treatment OFFICE document.

HWPF and XWPF

POI operate two main modules is the word document HWPFand XWPF.
HWPFOperating Microsoft Word 97 (-2007) file standard API entry. It also supports a limited read-only function on the older version of Word 6 and Word 95 files.
XWPFMicrosoft Word 2007 document is the operating standard API entry.

Read word document

In fact, POI for reading and writing the word document provides a number of API available, only the most simple to read text paragraph by demo here, for pictures or reading a table reading a later update.


maven dependence

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>24.0-jre</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.17</version>
        </dependency>
public static <T> List<String> readWordFile(String path) {
        List<String> contextList = Lists.newArrayList();
        InputStream stream = null;
        try {
            stream = new FileInputStream(new File(path));
            if (path.endsWith(".doc")) {
                HWPFDocument document = new HWPFDocument(stream);
                WordExtractor extractor = new WordExtractor(document);
                String[] contextArray = extractor.getParagraphText();
                Arrays.asList(contextArray).forEach(context -> contextList.add(CharMatcher.whitespace().removeFrom(context)));
                extractor.close();
                document.close();
            } else if (path.endsWith(".docx")) {
                XWPFDocument document = new XWPFDocument(stream).getXWPFDocument();
                List<XWPFParagraph> paragraphList = document.getParagraphs();
                paragraphList.forEach(paragraph -> contextList.add(CharMatcher.whitespace().removeFrom(paragraph.getParagraphText())));
                document.close();
            } else {
                LOGGER.debug("此文件{}不是word文件", path);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (null != stream) try {
                stream.close();
            } catch (IOException e) {
                e.printStackTrace();
                LOGGER.debug("读取word文件失败");
            }
        }
        return contextList;
    }

  

Guess you like

Origin www.cnblogs.com/EarlyBridVic/p/12143272.html