How to use HTML Parser to get complete information about all tags in the HTML page
I am using HTML Parser to develop an application. The code below is not able to get the entire set of tags in the page. There are some tags which are missed out and the attributes and text body of them are also missed out. Please help me to explain why is this happening.....or suggest me other way....
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toS开发者_Go百科tring()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
}
I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.
General idea:
You first have to know in what tags (div
, meta
, span
, etc) the information you want are in, and know the attributes to identify those tags. Example :
<span class="price"> $7.95</span>
if you are looking for this "price", then you are interested in span
tags with class
"price".
HTML Parser has a filter-by-attribute functionality.
filter = new HasAttributeFilter("class", "price");
When you parse using a filter, you will get a list of Nodes
that you can do a instanceof
operation on them to determine if they are of the type you are interested in, for span
you'd do something like
if (node instanceof Span) // or any other supported element.
See list of supported tags here.
An example with HTML Parser to grab the meta tag that has description about a site:
Tag Sample :
<meta name="description" content="Amazon.com: frankenstein: Books"/>
Code:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
//<meta name="description" content="Some texte about the site." />
HasAttributeFilter filter = new HasAttributeFilter("name", "description");
try {
parser.setResource("http://www.youtube.com");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
// Prints: "YouTube is a place to discover, watch, upload and share videos."
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
As per the comments:
actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???
Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow
on an User-Agent
of *
, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.
Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.
Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.
You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever. I believe you would have better results using, as an example, NekoHtml.
Or another simple library you can use is jtidy that can clean up your html before parsing it. Hope this helps.
http://sourceforge.net/projects/jtidy/
Ciao!
精彩评论