What datastore should I use to store temporary data from crawlers?
My crawler is crawling all websites and getting metadata informat开发者_开发百科ion from them. I then will run a script to sanitize the URLs and store them in Amazon RDS.
My problem is what datastore should I use to store data for sanitization purpose (Delete unwanted URLs). I don't want the crawler to hit the Amazon RDS which would slow it down.
Should I be using Amazon SimpleDB? Then I can read from SimpleDB, sanitize the URL and move it to Amazon RDS.
You can always use a db, but the issue is with the disk access. Everytime you would be doing a disk access to read a bunch of URLs sanitize them and again write them to another db which is another disk access. This process is OK if you aren't concerned about performance.
One solution is you can use any data structure as simple as a list, store a bunch or URLs have a thread which wakes up when the list hits a threshold cleans up the URLs and then you can write these URLs to the Amazon RDS.
精彩评论