开发者

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML 开发者_C百科format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.

Any help will be much valuable.


I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by

XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
   // it is hidden
}

Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.


The following code snippet helps in identifying if the text is hidden

POIFSFileSystem fs = null;

    boolean isHidden = false;
    try {
        fs = new POIFSFileSystem(new FileInputStream(filesname));
        HWPFDocument doc = new HWPFDocument(fs);
        WordExtractor we = new WordExtractor(doc);

        String[] paragraphs = we.getParagraphText();

        System.out.println("Word Document has " + paragraphs.length
                + " paragraphs");
        Range range = doc.getRange();

        for (int k = 0; k < range.numParagraphs(); k++) {

            org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
                    .getParagraph(k);
            paragraph.text().trim();
            paragraph.text().replaceAll("\\cM?\r?\n", "");

            for (int j = 0; j < paragraph.numCharacterRuns(); j++) {

                org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
                        .getCharacterRun(j);

                if (cr.isVanished()) {
                    // it is hidden
                    System.out.println("text is hidden ");
                    isHidden = true;
                    break;
                }

            }
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜