开发者

HTML data extraction

I'm accessing some website and I need to extract some data. To be more specific - from this part:

<input type="hidden" value="1" name="d520783895194bd08750e47c744d553d">

I need to extract the "name" part. I heard that reular expressions are not the best solution, so开发者_如何学Python I'd like to ask what is the best way to access this piece of data I need.


After parsing a website with NekoHTML or TagSoup (which should take care of the fact that your input field tag is not closed), I suggest to use a xpath expression:

//input[@type='hidden'][@value=1]/@name

In groovy you will apply it in form of GPath.


Use a Html parsing library, they fix malformed Html a make it easy to navigate the document to find and update elements. Here is a link to a list of Java/Groovy implementations:

http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/

Looks like NekoHTML and TagSoup are popular, but I haven't used either or Groovy for that matter. But I have used Html Parsers in other languages.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜