How to crawl a web page for files of certain size
I need to crawl a list of several thousand hosts and f开发者_Python百科ind at least two files rooted there that are larger than some value, given as an argument. Can any popular (python based?) tool possibly help?
Here is an example of how you can get the filesize of an file on a HTTP server.
import urllib2
def sizeofURLResource(url):
"""
Return the size of an resource at 'url' in bytes
"""
info = urllib2.urlopen(url).info()
return info.getheaders("Content-Length")[0]
There is also an library for building web scrapers here: http://dev.scrapy.org/ but I don't know much about it(just googled honestly).
Here is how I did it. See the code below.
import urllib2
url = 'http://www.ueseo.org'
r = urllib2.urlopen(url)
print len(r.read())
精彩评论