Convert Microsoft Office documents to Text [closed]
We don’t allow questions seeki开发者_JAVA百科ng recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this questionI'm looking for a library (or command line tool) to turn MS Office documents into either plaintext or HTML (for conversion to text).
It must run on Linux (not via Wine!).
I found antiword, but the last release was 2005, so it won't read the new Office 2007 formats.
I need it to read Word, Excel and Powerpoint documents
The new office 2007 format is just (ZIP) compressed XML.
All the text (in at least the .docx format) is located (once you decompress the file) in the word folder, document.xml file. Strip it from all the XML tags and you'll get the text. You'll lose the formatting no doubt, but if you want to do text indexing or something like it format isn't relevant anyway. The order is preserved.
I haven't analyzed Excel and Powerpoint but the approach should be similar. Excel might be trickier, depending on how are the cells stored in the XML file.
The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:
curl http://.../document.doc \
| java -jar tika-app-x.y.jar --text \
| grep -q keyword
PyODConverter for automating OpenOffice. Use it to do the conversions.
OONinja example converting Doc to PDF but any OpenOffice supported imports or exports should work. Also has the advantage of working Headless if required.
other options include, Abiword or you really just want to deal with command line WvWare but I don't think it supports Docx,
You can use Autonomy Keyview with the appropriate licence to use in your application. It seems to be extremely powerful and can extract text from almost everything; we use it to identify text within arbitrary format files.
I've no idea what the licensing terms are, but they're available from your account manager :)
精彩评论