Extract text from PDF(I have link to PDF) in ruby
I have a link like
http://www.downloads.com/help.pdf
I want to download this, and parse it to get th开发者_如何学Goe text content.
How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.
require 'yomu'
Yomu.new(file_path).text
You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.
DocRipper uses pdftotext under the hood and avoids Java dependencies.
require 'doc_ripper'
DocRipper::rip('/path/to/file.pdf') => "Pdf text"
You can read remote files using the Ruby standard library:
require 'open-uri'
require 'doc_ripper'
tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)
精彩评论