How to programmatically turn any webpage into an RSS feed? [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this questionThere are many websites and blog which provide RSS feeds, but on the other hand there are also many which do not. I want to turn that type of web page into RSS feeds.
I found some solutions using through Google like Feed43, Page2rss, Dapper etc, but I want an Open Source project which can perform this task or any tutorial explaining about it.
开发者_StackOverflow社区Please give me suggestions and if you can explain, you are most welcome.
My preferable language is PHP.
There's nothing magic about RSS. I suggest you read this tutorial to understand how to build an RSS feed from scratch:
http://www.xul.fr/en-xml-rss.html
Then use your PHP skills to build one from your content. A generic HTML-to-RSS scraper can be found online by searching for "html to rss converter" or whatever, but most of these will be hosted solutions and the RSS feeds they produce aren't that great. A good RSS feed requires understanding the content that you're syndicating, not just the raw HTML. IMHO.
In general there is not going to be any "one size fites all" solution to something like this. You'll have to examine the HTML structure of the blog you want to build an RSS feed from, then parse out the content you are interested in, and stick it into an RSS feed.
Here's some PHP things to help get you started:
Parsing HTML:
- DOMDocument (swiss-army-knife of HTML/XML parsing)
- SimpleXML (easy to use, but requires valid XML)
- Tidy (can be used to clean up bad HTML)
Understanding RSS Feeds:
- http://en.wikipedia.org/wiki/RSS
To construct them with PHP, you can once again use DOMDocument or SimpleXML. Another option is, depending on the format of the HTML you want to convert into RSS, you may be able to create an XSLT stylesheet to transform it.
There is no simple or concrete answer to this question, but I will get you started.
First, you need to build a crawler of sorts. Typically, you are going to want this to be multi-threaded and run in the background on your server. This might be as simple as forking PHP processes on the server, but you might find a more efficient way, depending on how much traffic you expect.
Now probably the best way to start would be to read the DOM. See http://php.net/manual/en/class.domdocument.php Look for headings and try to associate them with the paragraphs below them. Beware though that probably less than half the sites out there (and likely far fewer from the ones that don't already have a feed) don't structure their site in an organized way. But, it is a place to start.
There are plenty of element attributes too you can use, such as alt text. Also, in time you may find a lot of sites using a particular template that you can write code to handle directly.
You should also have something to read existing feeds. If a site has a feed, no sense in generating one for it, right? Use SimplePie to get started, but there are alternatives you don't like it. http://simplepie.org/
Once you have parsed the page, you'll want a database backend to track it and changes and what not.
From there, you need something to generate the feed. There are plenty of OOP classes for doing this. Often times, I just write my own, but that is up to you.
If you build sites with the simple symphony cms then yes, its very easy. See this snippet of a tutorial. Learn here
精彩评论