How to extract formatting information of word document using Apache POI?
I am using Apache POI for extracting formatting information from MS word files.
I want to extract information like whether paragraph is having bullet, background 开发者_Go百科color, forecolor, alignment, etc.
There is not much documentation or tutorials available for this. Javadoc also does not contain much helpful information.
Where can I get tutorials/good documentation which can help me in learning Apache POI API??
For HWPF (.doc), the classes you probably want are:
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/ParagraphProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/CharacterProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/model/StyleDescription.html
Depending on the exact property you want, it may be on the paragraph or the character properties.
The best example I can think of for reading a word document with HWPF and getting text, checking styles and formatting etc is WordExtractor from Apache Tika: https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
(XWPF for .docx is similar)
精彩评论