开发者

Automated website categorization

I want to create this engine which will categorize websites based on their meta keyword attribute.

Extracting of keyword from the website has been easy as well as connecting with the database. The problem that I am facing is the algorithm how to to match the 'keyword' extracted from the website with the predefined set of strings.

Please help me. I am using PHP scripts to implement this.

//say I have $pattern as the meta keyword extracted from web开发者_Python百科 page (ignore the syntax – please me)
$pattern=<news, current affairs, breaking news, sports, entertainment, daily news, local news>

// and set of predefined string to match with..
$keywords=<----something----->

What logic should I use to match $pattern with $keywords? Does preg_match_all() or 'ereg' function work for me? Kindly help me out guys.


$keyword=array('local news','art','local','world','tech','entertainment','news','tech','top stories','in the news','front page','bbc news','week in a glance','week in pictures','top stories'); //$keyword has predefined array of strings $all_meta_tags=get_meta_tags("http://abcnews.go.com/"); $array=$all_meta_tags['keywords'];//store 'keyword' attribute values in $keyword_meta

Now i have to match contents of $array with $keyword.....the results should give me matched items of $array which are present in $keyword


You would need to use something like http://en.wikipedia.org/wiki/Naive_Bayes_classifier

I have used this system to classify jobs scraped from job sites before with rather good success. Writing the code was a bitch, have fun :D

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜