This article will introduce to read the PDF document text and pictures through a Java program methods. Respectively, calling the method extractText () and extractImages () to read.
Use tools : Free Spire.PDF for Java (free version)
Jar file gets imported:
Method 1 : through the official website to download jar package. After the download, unzip the file and the lib folder under the Spire.Pdf.jar file into java program. After introducing the following figure:
Method 2 : by maven mounted introducing warehouse.
Java code examples
import com.spire.pdf.*; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.FileWriter; import java.io.IOException; public class ExtractText { public static void main(String[]args) throws Exception { //加载测试文档 PdfDocument pdf = new PdfDocument("sample.pdf"); //实例化StringBuilder类 StringBuilder sb = new StringBuilder(); //Define a variable of type int int index = 0 ; // iterate PDF document page PdfPageBase Page; for ( int I = 0; I <pdf.getPages () getCount ();. I ++ ) { Page = pdf.getPages () .get (I); // call extractText () method extracts the text sb.append (page.extractText ( to true )); FileWriter Writer; the try { // the written text in the object to StringBuilder TXT Writer = new new FileWriter ( " ExtractText.txt " ); writer.Write (sb.toString ()); writer.flush (); } the catch (IOException E) { e.printStackTrace (); } // call extractImages image acquisition method for (the BufferedImage Image: page.extractImages ()) { // the specified name of the output image, image format specifies File = Output new new File (String.format ( "Image_% d.png", index ++ )); ImageIO.write (Image, "PNG" , Output); } } pdf.close (); } }
Read the text and images effects:
(This article End)