[ruby]OCR识别PDF

  • gem install docsplit
gem install docsplit
  • Install GraphicsMagick
# mac
brew install graphicsmagick
  • Install Poppler
# ubuntu 
apt-get install poppler-utils poppler-data
# mac
 brew install poppler
  • (Optional) Install Ghostscript
    Ghostscript is required to convert PDF and Postscript files.
# mac
 brew install ghostscript
  • (Optional) Install Tesseract:
    Without Tesseract installed, you’ll still be able to extract text from documents, but you won’t be able to automatically OCR them.
# mac
brew install tesseract
  • (Optional) Install pdftk
    Without pdftk installed, you can use Docsplit, but won’t be able to split apart a multi-page PDF into single-page PDFs.
# ubuntu
apt-get install pdftk
  • (Optional) Install LibreOffice.
# ubuntu
apt-get install libreoffice

使用

The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
–output or -o can be passed to any command in order to store the generated files in a directory of your choosing.

images --size --format --pages --density Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
–size or -s will specify the desired image resolution, --density or -d will specify the DPI to rasterize the images at during conversion by GraphicsMagick, and --format or -f will select the format of the final images.

docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])

text --pages --ocr --no-ocr --no-clean --language --no-orientation-detection Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you’d like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR’d text — to disable this, pass the --no-clean flag.

By default Tesseract ships only with english extraction data. If any additional language models are installed you can select one using the --language flag. If Tesseract’s orientation detection model Docsplit will automatically use it unless you specify not to with the --no-orientation-detection.

提取文字成中文

# 可能需要安装  tesseract-lang
# extract_text_from_pdfs.rb
docs = Dir[File.expand_path(File.dirname(__FILE__)) + "/pdfs/*.pdf"]
Docsplit.extract_text(docs, ocr: true, language: 'chi_sim', output: './text/')

docsplit text path/to/doc.pdf --pages all --language deu
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')

pages --pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you’d like to generate.

docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)

pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that LibreOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, LibreOffice will lazy-load the code that processes it — subsequent conversions will be much faster.

docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')

author, date, creator, keywords, producer, subject, title, length
Ruby: extract_…
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.

docsplit title path/to/stooges.pdf => Disorder in the Court Docsplit.extract_length('path/to/stooges.pdf') => 36

Internals

Under the hood, Docsplit is a thin wrapper around the excellent GraphicsMagick, Poppler, PDFTK, Tesseract, and LibreOffice libraries. Poppler is used to extract text and metadata from PDF documents, PDFTK is used to split them apart into pages, and GraphicsMagick is used to generate the page images (internally, it’s rendering them with GhostScript). LibreOffice and GraphicsMagick convert documents and images to PDF. Tesseract provides the transparent OCR fallback support, if the document is a simple scan, and the file doesn’t contain any embedded text.

Because documents need to be in PDF format before any metadata, text, or images are extracted, it’s faster to use docsplit pdf to convert it up front, if you’re planning to run more than one extraction. Otherwise Docsplit will write out the PDF version to a temporary file before proceeding with each command.

文档链接

Guess you like

Origin blog.csdn.net/qq_41037744/article/details/114818270