Extraction of HTML Tags using Java
I wanted to extract the various HTML tags available from the source code of a web page is the开发者_开发百科re any method in Java to do that or do HTML parser support this?
I want to seperate all the HTML tags .
Java comes with an XML parser with similar methods to the DOM in JavaScript:
DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();
The document builder can take many different inputs (input stream, raw html string, etc).
http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html
The cyber neko parser is also good if you need more.
Check out CyberNeko HTML Parser.
You can use regular expressions. If your html is valid XML -- you can use XML parser
I've used HTMLParser in one project, was pretty happy with it.
Edit: If you check the samples page, the parser sample does pretty much what you're asking for.
You can write your own util
method to extract tags.
Check for <
and />
or >
for complete tag and write those tags to another file.
精彩评论