Best way to generate feeds from pages that doesn't have RSS support
The best example I saw so far is the http://www.instapaper.com/ . They can get the text from any page.
In my case, I need to get the text and also generate a list considering that I will have one page with the news list of each site.
For example, nytimes.com (just an example). I have to get all links and get the text if it exists. Also, maybe I need to specify some URL criteria, like generate feeds from links where 开发者_运维知识库the url contains something like "/[year]/[month/[day]/[category]/post-name".
I don't want the complete code, just the concept and best approach. Any ideias?
This is painful but the only good solution is to use an HTML parser and parse all the hrefs. I recommend using a library that allows you to easily select all hrefs. I have heard of this one http://code.google.com/p/phpquery/ but never used it. What you would do is load each page and then select all hrefs.
There is really no easier way. If you changed your technology to something like java or python, then you can leverage multi-thread power and speed up the process. Of course once you analyze, save the data in some database so you can later reference it.
Hope this helps.
精彩评论