how to dynamically filter website content using PHP
I'm currently looking for solution to dynamically filter website content. By "dynamic" I mean I would calculate the percentage of the bad words i.e. shit
, f**k
, etc over the whole words on the first page. Say the website is allowed if the percentage is no more than 30%. How do I make it search each word on the first page and match them with the bad words list then divide by the total number of the words so then I would be able to get the percentage? The rationale is not to make a content filter but to just block the website should even a single word in the page matches with the bad words list. I have got this though, but it is of static.
$filename = "filters.txt";
$fp = @fopen($filename, 'r');
if ($fp) {
$array = explode("\n", fread($fp, filesize($filename)));
foreach($array as $key => $val){
list($before,$after) = split("~",$val);
$input = preg_replace($before,$after,$input);
}
}
*filter.txt contains the list of bad words
Thanx Erisco!
Tried this but it doesnt seem to work thou.
function get_content($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_seto开发者_StackOverflow中文版pt ($ch, CURLOPT_HEADER, 0);
ob_start();
curl_exec ($ch);
curl_close ($ch);
$string = ob_get_contents();
ob_end_clean();
return $string;
}
/* $toLoad is from Browse.php */
$sourceOfWebpage = get_content($toLoad);
$textOfWebpage = strip_tags($sourceOfWebpage);
/* array: Obtained by your filter.txt file */
// Open the filters file and filter all of the results.
$filename = "filters.txt";
$badWords = @fopen($filename, 'r');
if ($badWords) {
$array = explode("\n", fread($fp, filesize($filename)));
foreach($array as $key => $val){
list($before,$after) = split("~",$val);
$input = preg_replace($before,$after,$input);
}
}
/* float: Some decimal value */
$allowedBadWordsPercent = 0.30;
$numberOfWords = str_word_count($textOfWebpage);
$numberOfBadWords = 0;
str_ireplace($badWords, '', $sourceOfWebpage, $numberOfBadWords);
if ($numberOfBadWords != 0) {
$badWordsPercent = $numberOfWords / $numberOfBadWords;
} else {
$badWordsPercent = 0;
}
if ($badWordsPercent > $allowedBadWordsPercent) {
echo 'This is a naughty webpage';
}
This is the rough idea of what I'd do. You could argue that using str_ireplace() purely for the count is devious though. I am not sure if there is a more direction function without busting out the regexp.
/* string: Obtained by CURL or similar */
$sourceOfWebpage;
$textOfWebpage = strip_tags($sourceOfWebpage);
/* array: Obtained by your filter.txt file */
$badWords;
/* float: Some decimal value */
$allowedBadWordsPercent = 0.30;
$numberOfWords = str_word_count($textOfWebpage);
$numberOfBadWords = 0;
str_ireplace($badWords, '', $sourceOfWebpage, $numberOfBadWords);
if ($numberOfBadWords != 0) {
$badWordsPercent = $numberOfWords / $numberOfBadWords;
} else {
$badWordsPercent = 0;
}
if ($badWordsPercent > $allowedBadWordsPercent) {
echo 'This is a naughty webpage';
}
精彩评论