is there a window program that can convert word (.doc and .docx) into text
I need a window program to convert word file (.doc) into text. Something like "anitiword" for windows.
I need it because I need to convert word fil开发者_Go百科e into text and use Lucence to index it and I am in a windows environment :(
Thanks for all your help!!!
Yes. That program is called MS Word.
Open the file in Word via COM, and save it as text programmatically. On the other hand, is Lucene not able to read Word documents natively?
if you really need a program, here's one. Have not tried, but you can give it a shot. Otherwise, you can just use COM / vbscript.
Using POI (http://poi.apache.org/) you should be able to index the old binary DOC formats. Relevant code snippets can be found on http://kalanir.blogspot.com/2008/08/how-to-index-microsoft-format-documents.html.
And for DOCX, since that's basically a ZIP file which contains a bunch of XML and resource files, it should be relatively easy to find the XML file containing the actual text (I think it's word/document.xml) and indexing the text contained in it (after stripping off all XML data)...
You can use the OpenXML SDK to easily strip the text out of DOCX files. Does not work with .doc though--you probably need to use MS Word and COM for that.
精彩评论