Dynamic Create Keywords and Description from URL - php Stopwords problem
I am hoping that you can help me.
I have created the following script, the purpose of the script is to dynamic create a meta description tag, as well as keywords tag from a page.
I still need to apply a size limit to the meta description as well as the keywords tag. Respectively: $str2
and $key
The $key prints out the meta keywords for me, these are relative to the page. My question is how can I remove all the stopwords from the $key
variable?
<?php
$url = 'http://localhost/index.asp';
$url_content = file_get_contents($url);
//$url_content = strip_tags($url_content);
$str = $url_content;
/**
* Remove HTML tags, including invisible text such as style and
* script code, and embedded objects. Add line breaks around
* block-level tags to prevent word joining after tag removal.
*/
$str = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@<h1[^>]*?.*?</h1>@siu',
'@<a[^>]*?.*?</a>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
开发者_如何转开发 ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0", "\n\$0", "\n\$0",
),
$str );
$str1 = strip_tags($str);
$str2 = strip_tags($str);
echo $str2.'<hr />';
$words = str_word_count(strtolower($str1),1);
$numWords = count($words);
//array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
$word_count = (array_count_values($words));
arsort($word_count);
foreach ($word_count as $key=>$val) {
echo "$key, ";
}
?>
I have found this script
function del_stop_words($kw){
$kw = array_map('strtolower',array_diff($kw,array("")));
$sw = explode("\r\n",file_get_contents('http://localhost/stopwords.txt'));
return array_values(array_diff($kw,$sw));
}
but are not 100% sure how to integrate it with the script above. I have already created the stopwords file. I just need the fundtion to strip the stopwords from the $key
variable.
Thanks
Try this:
$key = "I want to remove some bad words from my text, like sex racist etc...";
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt')));
$key = str_replace($swords, "", $key );
echo $key; // echo's "I want to remove some bad words from my text, like etc..."
your complete code will code like this:
<?php
$url = 'http://localhost/index.asp';
$url_content = file_get_contents($url);
//$url_content = strip_tags($url_content);
$str = $url_content;
/**
* Remove HTML tags, including invisible text such as style and
* script code, and embedded objects. Add line breaks around
* block-level tags to prevent word joining after tag removal.
*/
$str = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@<h1[^>]*?.*?</h1>@siu',
'@<a[^>]*?.*?</a>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0", "\n\$0", "\n\$0",
),
$str );
$str1 = strip_tags($str);
$str2 = strip_tags($str);
echo $str2.'<hr />';
$words = str_word_count(strtolower($str1),1);
$numWords = count($words);
//array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
$word_count = (array_count_values($words));
arsort($word_count);
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt'))); // add this outside the loop
foreach ($word_count as $key=>$val) {
echo str_replace($swords, "", $key ).", ";
}
?>
精彩评论