Java extract text and graphics in Word

This article describes methods to extract or read Word documents and pictures by this Chinese of Java. Here extract text and images, including simultaneous extraction of the body of the document as well as among the header, footer text and pictures.

Use tools: as Free  Spire.Doc for the Java (free version)

Jar file import process (Reference):

Method 1 : Download the jar file package. After the download, unzip the file and the lib folder under the Spire.Doc.jar file into java program. Introducing reference to the following effects:

Method 2 : can be introduced by maven. Reference introduction method .

Test document as follows:

 

Java code sample (for reference)

[Example 1 ] Extraction Word text

Import com.spire.doc *. ;
 Import java.io.FileWriter;
 Import java.io.IOException; 

public  class ExtractText {
     public  static  void main (String [] args) throws IOException {
         // load test document 
        the Document DOC = new new the Document (); 
        doc.loadFromFile ( "Test.docx" ); 

        // Get saved text String 
        String text = doc.getText (); 

        // String to write the Txt 
        writeStringToTxt (text, "extracting text .txt" ); 
    } 
    public  static  void writeStringToTxt(String content, String txtFileName) throws IOException {

        FileWriter fWriter= new FileWriter(txtFileName,true);
        try {
            fWriter.write(content);
        }catch(IOException ex){
            ex.printStackTrace();
        }finally{
            try{
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Text extraction results:

 

 

[Example 2] extracted in Word Pictures

import com.spire.doc.Document;
import com.spire.doc.documents.DocumentObjectType;
import com.spire.doc.fields.DocPicture;
import com.spire.doc.interfaces.ICompositeObject;
import com.spire.doc.interfaces.IDocumentObject;
import javax.imageio.ImageIO;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;

public class ExtractImg {
    public  static  void main (String [] args) throws IOException {
         // load Word documents 
        the Document the Document = new new the Document (); 
        document.loadFromFile ( "Test.docx" ); 

        // create a Queue 
        Queue Nodes = new new LinkedList (); 
        nodes.add (document); 

        // Create Object List 
        List Images = new new the ArrayList (); 

        // traverse the child objects in the document 
        the while (nodes.size ()> 0 ) { 
            ICompositeObject Node = (ICompositeObject) nodes.poll () ;
             for (int i = 0; i < node.getChildObjects().getCount(); i++) {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject) {
                    nodes.add((ICompositeObject) child);

                    //获取图片并添加到List
                    if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
                        DocPicture picture = (DocPicture) child;
                        images.add(picture.getImage());
                    }
                }
            }
        }

        //Save image as a PNG format file 
        for ( int i = 0; i <images.size (); i ++ ) { 
            File File = new new File (String.format ( "Picture -% d.png" , i)); 
            ImageIO. Write ((the RenderedImage) images.get (I), "PNG" , File); 
        } 

    } 
}

Picture extraction results:

 

 

 

(This article End)

 

Please indicate the source!

 

Guess you like

Origin www.cnblogs.com/Yesi/p/11611888.html