Malformed XML/HTML parsing

2023-03-30 21:58 问答作者：

I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.

    textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE

(this is actually meant to be a html textarea tag) I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...

But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.

I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...

SO essentially my question is there any other approach 开发者_运维技巧to take in order to get the text I need from a malformed html?

You should be able to parse your documents wit JTidy directly, without having to convert them to XHTML. I did it on several occasions, granted a while ago, but it worked for me fine and with quite ugly HTML.

EDIT: Another option that I looked at, last time I needed to parse HTML files, was TagSoup. I couldn't use it in a commercial product because of its GPL licence, but if you just need this functionality as an internal tool, it might work for you

继续阅读：dom jtidy xhtml

Malformed XML/HTML parsing

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？