PDF processing control Aspose.PDF function demonstration: use C# to realize mutual conversion between PDF and TXT formats

Aspose.PDF  is an advanced PDF processing API that can easily generate, modify, convert, render, protect and print documents in cross-platform applications. No need to use Adobe Acrobat. Additionally, the API provides compression options, table creation and manipulation, graphics and image functionality, extensive hyperlink functionality, stamp and watermark tasks, extended security controls, and custom font handling.

Aspose API supports popular file format processing and allows exporting or converting various types of documents to fixed layout file formats and most commonly used image/multimedia formats. 

PDF files are popular because they support text, images, animations, videos, and many other annotations.

However, text is the most important part of most PDF documents. In this article, we will use C#.NET to convert PDF to TXT file and convert TXT file to PDF format. This article includes:

  • Convert PDF to TXT file without formatting using C# or VB.NET
  • Convert PDF to TXT file using formatting routines using C# or VB.NET
  • Convert TXT file to PDF programmatically using C# or VB.NET

Currently, the .NET version of Aspose.PDF is upgraded to v20.9, which enhances the conversion performance from TIFF to PDF and fixes many bugs such as LZW decoder failure. Interested friends can click the button below to download the latest version.

Convert PDF to TXT file without formatting using C# or VB.NET

First, we'll convert the PDF to text without any formatting routines. Text content will be converted as-is. So, from the input PDF file, the output text will not follow any formatting. The following steps need to be followed to convert PDF to TXT efficiently and reliably.

  • Load the input PDF document
  • Initialize an instance of the StringBuilder class
  • Iterate through each page of the PDF document
  • Read text using TextDevice and Raw mode
  • Save the output text as a TXT file

The code snippet below shows how to convert PDF to TXT file using C# or VB in .NET Framework:

// Open document
Document pdfDocument = new Document(dataDir + "MultiColumnPdf.pdf");
StringBuilder builder = new StringBuilder();
// String to hold extracted text
string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)
{
    using (MemoryStream textStream = new MemoryStream())
    {
        // Create text device
        TextDevice textDevice = new TextDevice();

        // Set different options
        TextExtractionOptions options = new
        TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
        textDevice.ExtractionOptions = options;

        // Convert the page and save text to the stream
        textDevice.Process(pdfPage, textStream);

        // Close memory stream
        textStream.Close();

        // Get text from memory stream
        extractedText = Encoding.Unicode.GetString(textStream.ToArray());
    }
    builder.Append(extractedText);
}

dataDir = dataDir + "PDF_to_TXT_Raw.txt";
// Save the text file
File.WriteAllText(dataDir, builder.ToString());

Convert PDF to TXT file using formatting routines using C# or VB.NET

You can easily render the text content of a PDF document as a TXT file using C# by following these steps:

  • Load source PDF file
  • start a string variable
  • Read through each page using TextFormattingMode.Pure
  • Save the converted TXT file

The following code snippet shows how to convert PDF format to TXT file using C# or VB.NET language:

// Open document
Document pdfDocument = new Document(dataDir + "MultiColumnPdf.pdf");
StringBuilder builder = new StringBuilder();
// String to hold extracted text
string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)
{
    using (MemoryStream textStream = new MemoryStream())
    {
        // Create text device
        TextDevice textDevice = new TextDevice();

        // Set different options
        TextExtractionOptions options = new
        TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
        textDevice.ExtractionOptions = options;

        // Convert the page and save text to the stream
        textDevice.Process(pdfPage, textStream);

        // Close memory stream
        textStream.Close();

        // Get text from memory stream
        extractedText = Encoding.Unicode.GetString(textStream.ToArray());
    }
    builder.Append(extractedText);
}

dataDir = dataDir + "PDF_to_TXT_Pure.txt";
// Save the text file
File.WriteAllText(dataDir, builder.ToString());

Visual comparison of PURE and RAW text conversions

The screenshot below is a visual comparison of the two methods we just discussed. You'll notice that the native mode (rightmost window) displays text in the same format as the PDF file (leftmost window).

PDF processing control Aspose.PDF function demonstration: use C# to realize mutual conversion between PDF and TXT formats

Convert TXT file to PDF programmatically using C# or VB.NET

TXT files usually contain a lot of text content. You can easily convert TXT files to PDF files using Aspose.PDF for .NET API. Just follow the steps below to perform text to PDF conversion:

  • Create an instance of the TextReader class
  • Initialize the PDF document and add blank pages
  • Instantiate the TextBuilder object
  • Read each line of text from the input TXT file
  • Save the output PDF file

The code snippet below illustrates how to programmatically convert a TXT file containing text to a PDF document using C# or VB.NET language:

// Read input TXT file
System.IO.TextReader tr = new StreamReader(dataDir + "Test.txt", Encoding.UTF8, true);

// Initialize new Document
Document doc = new Document();

// Add blank page
Page page = doc.Pages.Add();
String strLine;

// Initiate TextBuilder object
TextBuilder builder = new TextBuilder(page);
double x = 100; double y = 100;
while ((strLine = tr.ReadLine()) != null)
{
 TextFragment text = new TextFragment(strLine);
 text.Position = new Position(x, y);
 if ( y >= page . PageInfo . Height - 72 )
 {
  y = 100;
  page = doc.Pages.Add();
  builder = new TextBuilder(page);
 }
 else
 {
  y += 15;
 }
 builder.AppendText(text);
}

// Save output PDF file
doc.Save(dataDir + "TexttoPDF.pdf");
tr.Close();

Guess you like

Origin blog.csdn.net/m0_67129275/article/details/132226937