How to Extract docx (Word 2007 above) using Apache POI
Hai, i'm using Apache POI 3.6 I've already created some code..
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
wordxExtractor = new XWPFWordExtractor(doc);
text = wordxExtractor.getText();
System.out.println("adding docx " + file);
d.add(new Field("content", text, Field.Store.NO, Field.Index.ANALYZED));
unfortunately, it generated error..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/Documen开发者_开发技巧tException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:98)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:199)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:178)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:53)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:98)
at org.apache.lucene.demo.Indexer.indexDocs(Indexer.java:153)
at org.apache.lucene.demo.Indexer.main(Indexer.java:88)
It seemed that it used Constructor
XWPFWordExtractor(OPCPackage container)
but not this one ->
XWPFWordExtractor(XWPFDocument document)
Any wondering why?? Or any idea how I can extract the .docx then convert it into a String?
You need to Add dom4j Library to your claspath or your project libraries
It looks like you don't have all of the dependencies on your classpath.
If you look at http://poi.apache.org/overview.html you'll see that dom4j is a required library when working with the OOXML files. From the exception you got, it seems that you don't have it... If you look in the POI binary download, you should find it in the ooxml-libs subdirectory.
You could try docx4j instead; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java
精彩评论