POI reads word (word 2003 and word 2007)

 

    When making a system for a customer recently, the user asked to be able to import word files. Now there are several versions of microsoft word 97, 2003, and 2007, and the formats for storing data in these three versions are quite different. Now 97 has basically been withdrawn from the market, and almost no one uses this version, so only the 2003 version and the 2007 version are considered in our system, because we only need to be able to read the text content in word, and the text style in it , pictures and other information can be ignored, and there is no need to directly manipulate the word file, so we choose to use the POI of apache to read.

 

    It is relatively simple to read the word file of the 2003 version (.doc), and only need two jar packages, poi-3.5-beta6-20090622.jar and poi-scratchpad-3.5-beta6-20090622.jar, while the 2007 version (.docx) is more troublesome. The trouble I am talking about is not when we write code, but because there are many jar packages to be imported. There are as many as 7 as follows:
 1. openxml4j-bin-beta.jar
 2. poi -3.5-beta6-20090622.jar
 3. poi-ooxml-3.5-beta6-20090622.jar
 4. dom4j-1.6.1.jar
 5. geronimo-stax-api_1.0_spec-1.0.jar
 6. ooxml-schemas-1.0 .jar
 7. xmlbeans-2.3.0.jar
where 4-7 is the jar package that poi-ooxml-3.5-beta6-20090622.jar depends on (ooxml in poi-bin-3.5-beta6-20090622.tar.gz -lib directory).

 

    Before writing the code, we have to download the required jar packages. We only need to download poi-bin-3.5-beta6-20090622.tar.gz and openxml4j-bin-beta.jar , because other jar packages required can be Found in poi-bin-3.5-beta6-20090622.tar.gz, the following is the download address:
poi-bin-3.5-beta6-20090622.tar.gz: http://apache.etoak.com/poi/dev/ bin/poi-bin-3.5-beta6-20090622.tar.gz
openxml4j-bin-beta.jar: http://mirror.optus.net/sourceforge/o/op/openxml4j/openxml4j-bin-beta.jar
 
    below is When reading the Java code of the word file, it is worth noting that: POI will not read the picture information in the word file when reading the word file, and for the 2007 version of word (.docx), if the word file contains Table, all the data in the table will be at the end of the read string.

 

 

 

    import java.io.File;  
    import java.io.FileInputStream;  
    import java.io.InputStream;  
      
    import org.apache.poi.POIXMLDocument;  
    import org.apache.poi.POIXMLTextExtractor;  
    import org.apache.poi.hwpf.extractor.WordExtractor;  
    import org.apache.poi.openxml4j.opc.OPCPackage;  
    import org.apache.poi.xwpf.extractor.XWPFWordExtractor;  
      
    /**
     * POI is a test class for reading the text content in word 2003 and word 2007<br />
     * @createDate 2009-07-25
     * @author Carl He
     */  
    public class Test {  
        public static void main(String[] args) {  
            try {  
                //word 2003: The picture will not be read  
                  InputStream is = new FileInputStream(new File("c://files//2003.doc"));  
                WordExtractor ex = new WordExtractor(is);  
                String text2003 = ex.getText();  
                System.out.println(text2003);  
      
                //word 2007 pictures will not be read, the data in the table will be placed at the end of the string  
                OPCPackage opcPackage = POIXMLDocument.openPackage("c://files//2007.docx");  
                POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);  
                String text2007 = extractor.getText();  
                System.out.println(text2007);  
                  
            } catch (Exception e) {  
                e.printStackTrace ();  
            }  
        }  
    }  

 

If you want to download the complete sample code, you can download it here . This rar package contains all the jar packages and word 2003 and word 2007 sample files that POI needs to read word 2003 and word 2007.

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326639289&siteId=291194637