How to scrape images from a web site with javascript and servlets
I have a web page that has the following content (I've changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical):
<HTML>
<BODY>
<script type="text/javascript" src="http://localhost/servlet?publicKey=abcdefg12345678&"></script>
</BODY>
</HTML>
The resulting page displays an image when viewed in a browser and I'm trying to scrape that image. After I 开发者_StackOverflowscrape the image I attempt to index the images (see www.tineye.com for the idea of image search engine) and store them. If anybody knows how to scrape images from such web sites please let me know.
Note: the src does not contain ANY information about the image... it only calls the given servlet with a public key as the parameter. What I've posted above is EXACTLY what I see when I click View->Page Source in my browser (Firefox). Of course I've changed the actual URL and the public key for privacy issues, otherwise everything is identical.
I've seem similar techniques used for some banners: http://coldjava.hypermart.net/servlets/banner.htm
The JavaScript is probably manipulating the DOM and adding an image. Therefore the image (.jpg, .png or .gif) should be somewhere inside the JavaScript file, and should look something like this:
var image = new Image("/path/to/image.jpg");
You can use Regular Expressions to filter the path and filename out of the javascript code.
Instead of saving a local copy of the HTML file, you should save a local copy of the JavaScript file to see how exactly it's adding the image to the HTML file's DOM. That should let you figure out how to construct requests to get the images you need.
精彩评论