开发者

Dynamic Create Keywords and Description from URL - php Stopwords problem

I am hoping that you can help me.

I have created the following script, the purpose of the script is to dynamic create a meta description tag, as well as keywords tag from a page.

I still need to apply a size limit to the meta description as well as the keywords tag. Respectively: $str2 and $key

The $key prints out the meta keywords for me, these are relative to the page. My question is how can I remove all the stopwords from the $key variable?

   <?php 
    $url = 'http://localhost/index.asp';
    $url_content = file_get_contents($url);
    //$url_content = strip_tags($url_content);
    $str = $url_content;

    /**
     * Remove HTML tags, including invisible text such as style and
     * script code, and embedded objects.  Add line breaks around
     * block-level tags to prevent word joining after tag removal.
     */
        $str = preg_replace(
            array(
              // Remove invisible content
                '@<head[^>]*?>.*?</head>@siu',
                '@<style[^>]*?>.*?</style>@siu',
                '@<script[^>]*?.*?</script>@siu',
                '@<object[^>]*?.*?</object>@siu',
                '@<embed[^>]*?.*?</embed>@siu',
                '@<applet[^>]*?.*?</applet>@siu',
                '@<noframes[^>]*?.*?</noframes>@siu',
                '@<noscript[^>]*?.*?</noscript>@siu',
                '@<noembed[^>]*?.*?</noembed>@siu',
                '@<h1[^>]*?.*?</h1>@siu',
                '@<a[^>]*?.*?</a>@siu',
              // Add line breaks before and after blocks
                '@</?((address)|(blockquote)|(center)|(del))@iu',
                '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
                '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
                '@</?((table)|(th)|(td)|(caption))@iu',
                '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
                '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
                '@</?((frameset)|(frame)|(iframe))@iu',
            ),
            array(
          开发者_如何转开发      ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
                "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
                "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            ),
            $str );
        $str1 =  strip_tags($str);
        $str2 =  strip_tags($str);
        echo $str2.'<hr />';

    $words = str_word_count(strtolower($str1),1);
    $numWords = count($words);
    //array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
    $word_count = (array_count_values($words));
    arsort($word_count);

    foreach ($word_count as $key=>$val) {
    echo "$key, ";
    }
    ?>

I have found this script

function del_stop_words($kw){
 $kw = array_map('strtolower',array_diff($kw,array("")));
 $sw = explode("\r\n",file_get_contents('http://localhost/stopwords.txt'));
 return array_values(array_diff($kw,$sw));
 }

but are not 100% sure how to integrate it with the script above. I have already created the stopwords file. I just need the fundtion to strip the stopwords from the $key variable.

Thanks


Try this:

$key = "I want to remove some bad words from my text, like sex racist etc...";
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt')));
$key = str_replace($swords, "", $key );
echo $key; // echo's "I want to remove some bad words from my text, like etc..."

your complete code will code like this:

<?php 
    $url = 'http://localhost/index.asp';
    $url_content = file_get_contents($url);
    //$url_content = strip_tags($url_content);
    $str = $url_content;

    /**
     * Remove HTML tags, including invisible text such as style and
     * script code, and embedded objects.  Add line breaks around
     * block-level tags to prevent word joining after tag removal.
     */
        $str = preg_replace(
            array(
              // Remove invisible content
                '@<head[^>]*?>.*?</head>@siu',
                '@<style[^>]*?>.*?</style>@siu',
                '@<script[^>]*?.*?</script>@siu',
                '@<object[^>]*?.*?</object>@siu',
                '@<embed[^>]*?.*?</embed>@siu',
                '@<applet[^>]*?.*?</applet>@siu',
                '@<noframes[^>]*?.*?</noframes>@siu',
                '@<noscript[^>]*?.*?</noscript>@siu',
                '@<noembed[^>]*?.*?</noembed>@siu',
                '@<h1[^>]*?.*?</h1>@siu',
                '@<a[^>]*?.*?</a>@siu',
              // Add line breaks before and after blocks
                '@</?((address)|(blockquote)|(center)|(del))@iu',
                '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
                '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
                '@</?((table)|(th)|(td)|(caption))@iu',
                '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
                '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
                '@</?((frameset)|(frame)|(iframe))@iu',
            ),
            array(
                ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
                "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
                "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            ),
            $str );
        $str1 =  strip_tags($str);
        $str2 =  strip_tags($str);
        echo $str2.'<hr />';

    $words = str_word_count(strtolower($str1),1);
    $numWords = count($words);
    //array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
    $word_count = (array_count_values($words));
    arsort($word_count);

$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt'))); // add this outside the loop


    foreach ($word_count as $key=>$val) {
    echo str_replace($swords, "", $key ).", ";
    }
    ?>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜