Crawl Only HTML Pages
I want to crawl onyl html pages so when I changed the regular expression here in this code.. it is still crawling some xml page also.. Any suggestions why is it happening..
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile("(.(html))");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.somehost.com/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Text length: " + text.length());
System.out.println("Number of links: " + links.size());
System.out.println("Docid of parent page: " + parentDocid);开发者_开发百科
System.out.println("=============");
}
}
The extension is meaningless on the web - especially with newer "SEO"-type paths. You have to analyze it's content-type.
You can do this by requesting (with the HTTP GET
or possibly HEAD
method) each URL and analyze its response headers. If the Content-Type response header is not what you want, you don't have to download it - otherwise it's what you want to look at.
Edit: HTML should have text/html
as content-type, XHTML is application/xhtml+xml
(but note that the latter may be subject to content-negotiation, which is usually dependent on the content of your accept header and the user agent in the request).
You can find all the information about the HTTP headers here.
精彩评论