开发者

Extraction of HTML Tags using Java

I wanted to extract the various HTML tags available from the source code of a web page is the开发者_开发百科re any method in Java to do that or do HTML parser support this?

I want to seperate all the HTML tags .


Java comes with an XML parser with similar methods to the DOM in JavaScript:

DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();

The document builder can take many different inputs (input stream, raw html string, etc).

http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html

The cyber neko parser is also good if you need more.


Check out CyberNeko HTML Parser.


You can use regular expressions. If your html is valid XML -- you can use XML parser


I've used HTMLParser in one project, was pretty happy with it.

Edit: If you check the samples page, the parser sample does pretty much what you're asking for.


You can write your own util method to extract tags.

Check for < and /> or > for complete tag and write those tags to another file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜