At work, when we use PDF documents, we often spend more time processing documents because the documents are not easy to manipulate and edit. For developers, we need to use a convenient method to operate PDF documents. So how should we achieve when extracting PDF text and images? This article will introduce how to extract text and images by using Free Spire.PDF , a free PDF control. Controls are available here .
Note: After downloading and installing this component, the dll file can be obtained in the folder Bin after the compressed package is decompressed. Take care to add references in the project program.
Original document:
1. Extract PDF text
C#
//Create a PdfDocument class object and load the PDF samle PdfDocument doc = new PdfDocument(); doc.LoadFromFile("sample.pdf"); //Instantiate the StringBuilder class StringBuilder buffer = new StringBuilder(); // Traverse the document and extract the text foreach (PdfPageBase page in doc.Pages) { buffer.Append(page.ExtractText()); } doc.Close(); // save the document String fileName = "TextInPdf.txt"; File.WriteAllText(fileName, buffer.ToString()); buffer = null;
Run the program to generate the documentation:
2. Extract pictures
C#
//Create a PdfDocument class object and load the PDF sample PdfDocument doc = new PdfDocument(); doc.LoadFromFile("sample.pdf"); / / Declare an IList class, the element is image IList<Image> images = new List<Image>(); / / Traverse the PDF document to diagnose whether there are pictures, and extract the pictures foreach (PdfPageBase page in doc.Pages) { if (page.ExtractImages() != null) { foreach (Image image in page.ExtractImages()) { images.Add(image); } } } doc.Close(); // Traverse the extracted images, save and name the image int index = 0; foreach (Image image in images) { String imageFileName = String.Format("Image-{0}.png", index++); image.Save(imageFileName, ImageFormat.Png); }
After extracting the image:
(This article is reproduced from http://www.cnblogs.com/Yesi/p/4203686.html )