Automatically refresh and download Asirra images
If you're unfamiliar with Asirra, it's a CAPTCHA technique developed by microsoft that uses the identification of cats and dogs rather than a string of text for human verification.
I'd like to use their database of millions of pictures of cats and dogs for some machine learning experiments, and so I'm trying to write a script that will automatically refresh their site and download 12 images at a regular interval. Unfortunately, I'm a novice when it comes to JavaScript.
The problem is, for very obvious security reasons, it's hard to find the actual url of the image because it's all behind obfuscated javascript. I tried using Curl to see what html was returned using a terminal app, and it's the same deal - just javascript. So, using a script, how do I get access the actual images? Obviously the images are being transferred to my computer since they're showing up on my screen, but I don't know how to capture those images using a script.
Also a problem is that I don't want the smaller images that first load, I need the larger ones that only show up only when you mouse over them, so I guess I need to overwrite that javascript function to give the larger images to me via the script as well.
开发者_运维百科I'd prefer something in Python or C#, but I'll take anything - thanks!
Edit: Their public corpus doesn't have near enough images for my uses, so that won't work. Also, I'm not asking necessarily for you to write me my script, just some guidance on how to access the full-size images using a script.
Try using their public corpus http://research.microsoft.com/en-us/projects/asirra/corpus.aspx
While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra.com/examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads FILE=*
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).
精彩评论