How to convert .doc or .docx files to .txt
I'm wondering how you can convert Wo开发者_StackOverflowrd .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.
If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is for Text Extraction applications such as web spiders, index builders, and content management systems.
P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)
Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();
You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice). You can also use JODConverter.
精彩评论