Malformed XML/HTML parsing
I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.
textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE
(this is actually meant to be a html textarea tag) I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...
But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.
I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...
SO essentially my question is there any other approach 开发者_运维技巧to take in order to get the text I need from a malformed html?
You should be able to parse your documents wit JTidy directly, without having to convert them to XHTML. I did it on several occasions, granted a while ago, but it worked for me fine and with quite ugly HTML.
EDIT: Another option that I looked at, last time I needed to parse HTML files, was TagSoup. I couldn't use it in a commercial product because of its GPL licence, but if you just need this functionality as an internal tool, it might work for you
精彩评论