Remove Ads in RSS feed
I have a local intranet site I am developing on which I want to display some rss feeds from other sites. Currently is is built on the Concrete5 CMS and I am using an RSS displayer plugin to display the feeds. The plugin uses SimplePie to parse the feed. By default, the plugin displays the entire RSS content. I've tweaked the plugin (SimplePie) to display only a title with link, date, and the first image in each post/entry.
I found this function that I pass $item->get_content()
to in order to get the first image's source:
function getFirstImage($text) {
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
$pattern = "/<img[^>]+\>/i";
preg_match($pattern, $text, $matches);
$text = $matches[0];
return $text;
}
function scrapeImage($text) {
$pattern = '/src=[\'"]?([^\'" >]+)[\'" >]/';
preg_match($pattern, $text, $link);
$link = $link[1];
$link = urldecode($link);
return $link;
}
It works fine, the problem is that some of the feeds have ads in them which are sometimes placed before the actual post content, therefore this function returns the url of an ad. Obviously these RSS ads are targeted at people who use rss readers, but for displaying them on a site, they are very annoying.
If I try to target exact tags besides <img>
within preg_match()
I feel it will only work for the specific feed that I've taken the tag from. (For example, if I try to use preg_match()
开发者_JAVA百科to find only images inside <p>
tags)
How can I get the first image from the actual post that isn't an ad without having to change the code for each feed I want to display?
I'm not sure if this would work for your situation but usually ad images come from a different domain or sub-domain than the regular content. You could try to filter out images based on the domain or sub-domain in the URL being different then the domain or sub-domain of the rss feed.
精彩评论