开发者

Extract text from PDF(I have link to PDF) in ruby

I have a link like

      http://www.downloads.com/help.pdf

I want to download this, and parse it to get th开发者_如何学Goe text content.

How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text


You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader

Or the command-line utility pdftotext.


The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.

require 'yomu'
Yomu.new(file_path).text


You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.

DocRipper uses pdftotext under the hood and avoids Java dependencies.

require 'doc_ripper'

DocRipper::rip('/path/to/file.pdf') => "Pdf text"

You can read remote files using the Ruby standard library:

require 'open-uri'
require 'doc_ripper'

tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜