开发者

Malformed XML/HTML parsing

I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.

    textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE

(this is actually meant to be a html textarea tag) I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...

But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.

I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...

SO essentially my question is there any other approach 开发者_运维技巧to take in order to get the text I need from a malformed html?


You should be able to parse your documents wit JTidy directly, without having to convert them to XHTML. I did it on several occasions, granted a while ago, but it worked for me fine and with quite ugly HTML.

EDIT: Another option that I looked at, last time I needed to parse HTML files, was TagSoup. I couldn't use it in a commercial product because of its GPL licence, but if you just need this functionality as an internal tool, it might work for you

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜