开发者

How to extract formatting information of word document using Apache POI?

I am using Apache POI for extracting formatting information from MS word files.

I want to extract information like whether paragraph is having bullet, background 开发者_Go百科color, forecolor, alignment, etc.

There is not much documentation or tutorials available for this. Javadoc also does not contain much helpful information.

Where can I get tutorials/good documentation which can help me in learning Apache POI API??


For HWPF (.doc), the classes you probably want are:

  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/ParagraphProperties.html
  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/CharacterProperties.html
  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/model/StyleDescription.html

Depending on the exact property you want, it may be on the paragraph or the character properties.

The best example I can think of for reading a word document with HWPF and getting text, checking styles and formatting etc is WordExtractor from Apache Tika: https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

(XWPF for .docx is similar)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜