开发者

Searching keywords(from a matrix) in a string(around 500 char)

Hey, basically what i am trying to do is automatically assign Tags to a user inp开发者_开发知识库ut string. Now i have 5 tags to be assigned. Each tag will have around 10 keywords. A String can only be assigned one tag. In order to assign tag to string, i need to search for words matching keywords for all the five tags. Example:

TAGS:     Keywords
Drink:    Beer, whiskey, drinks, drink, pint, peg.....
Fitness:  gym, yoga, massage, exercise......
Apparels: men's shirt, shirt, dress......
Music:    classical, western, sing, salsa.....
Food:     meal, grilled, baked, delicious.......

User String: Take first step to reach your fitness goals, Pay Rs 199 for Aerobics, Yoga, Kick Boxing, Bollywood Dance and more worth Rs 1000 at The very Premium F Chisel Bounce, Koramangala.


Now i need to decide upon a tag for the above string. I need an time efficient algorithm for this problem. I don't know how to go about matching keywords for strings but i do have a thought about deciding tag. I was thinking to maintain an array count for each tag and as a keyword is matched count for respective tag is increased. if at any time count for any tag reaches 5 we can stop and decide on that tag only this will save us from searching the whole thing.

Please give any advice you have on this. I will be using php just so you know. thanks


Interesting topic! What you are looking for is something similar to latent semantic indexing. There is questing here.


If the number of tags and keywords is small I would save me writing a complex algorithm and simply do:

$tags = array(
    'drink' => array('beer', 'whiskey', ...),
    ...
);
$string = 'Take first step ...';
$bestTag = '';
$bestTagCount = 0;
foreach ($tags as $tag => $keywords) {
    $count = 0;
    foreach ($keywords as $keyword) {
        $count += substr_count($string, $keyword);
    }
    if ($count > $bestTagCount) {
        $bestTagCount = $count;
        $bestTag = $tag;
    }
}
var_dump($bestTag);

The algorithm is pretty obvious, but only suited for a small number of tags/keywords.


If you dont mind using an external API, you should try one of these:

  • http://www.zemanta.com/
  • http://www.opencalais.com/
  • Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

To give an example, Zemanta will return the following tags (among other things) for your User String:

Bollywood, Kickboxing, Koramangala, Aerobics, Boxing, Sports, India, Asia

Open Calais will return

Sports, Hospitality Recreation, Health, Recreation, Human behavior, Kick, Yoga, Chisel Aerobics, Meditation, Indian philosophy, Combat sports, Aerobic exercise, Exercise

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜