开发者

parsing specific content from a url(webpage) in php, javascript

I use some RSS feeds. Some of them don't have a description for their articles.

In order not to show just the title and no description for those articles, I would like to show for example the first two paragraphs of the actual article.

I experimented with 开发者_Python百科stripos, file_get_contents but I have a problem. In most pages it works fine, but in other pages it grabs the first <p> tag (which can be for example a paragraph in the sidebar) and is irrelevant to the article that is mentioned in the RSS feed.

Any idea about how to grab the main content from a URL strictly in PHP or JavaScript?

Thanks in advance.


The first idea that comes to mind is to remove tags from within the p and then only use that section if the length of actual text within the paragraph is greater than a certain threshold. Maybe check for a certain number of [.?!] also. If the number isn't there, then go to the next one.


You may also want to try scraping, which allows you to 'scrape' a page and parse its contents. http://simplehtmldom.sourceforge.net/ has a jQuery-like syntax and should quickly allow you to get just the content you want.

Scraping comes with its own caveats, as some sites may not look kindly on your harvesting of data and may block your attempts. You may want to look into the pluses and minuses of this method, but it is certainly powerful.

There's also info on scraping RSS feeds here: http://blog.5ubliminal.com/posts/rsscraping-scraping-rss-with-php-dom-xpath/, which I haven't tried.

EDIT: Wrikken's link is better than mine. Some good alternatives there.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜