
Using Web crawler for price comparison

I need a open source java based web crwaler which I开发者_运维问答 can extend for price comparison? How do I do the price comparison? Is there any open source code for that?

Take a look at web harvest, you will have to use it's slightly odd and peculiar syntax for processing web pages, but it should be fairly to extend it to do some price comparison:


Building something that scrapes price information from a large number of different sites is going to be a lot of work, whether you scrape from the stores themselves or from existing comparison sites.

  • Everyone's website layout will be different, requiring you to configure your crawler separately for each one.

  • Some websites may present the price information in ways that make scraping difficult; e.g. using AJAX.

Some website owners will put the relevant pages into their robots.txt files to tell you to stay away. And if you ignore that, there are various things they can do to make life difficult for you.

Scraping lots of people's websites without permission is likely to make you unpopular. It might attract threats of lawsuits, or actual lawsuits from people who perceive that you are harming their business model. Or other responses ...

Are you really sure you want to do this? Really??

Any reason you can't just get your data from one of the hundreds of price comparison sites already out there? Seems like would be simpler to scrape nextag or froogle or whatever instead of writing a crawler to scrape billions of store websites.

Nobody wants their site to get overloaded without getting any benefit. I think you should create a crawler for your need. However, be aware that most of them may block you or make your responses slower. you need to behave like you are not one and eating their bandwidth...

Someone here wrote about the legal issues. The legal issues are not simple. Stephen C wrote about lawsuits but that goes both ways. There is a large body of law related to anti-competitive conduct. If someone wants their prices to be not reported because they are involved in price-fixing or making false claims, then the websites themselves face severe penalties. The law is not something to trivially quote. You can google price fixing and see the large fines already imposed on countless companies.





验证码 换一张
取 消

