开发者

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that 开发者_运维百科requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?

Thanks!


Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)


It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/

That may not be the best way yet, but maybe it will grease the wheels a bit.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜