How do I get tags/keywords from a webpage/feed?
I have to build a tag cloud out of a webpage/feed. Once you get the word frequency table of tags, it's easy to build the tagcloud. But my doubt is how do I retrieve the tags/keywords from the webpage/feed?
This is what I'm doing now:
Get the content -> strip HTML -> split them with \s\n\t(space,newline,开发者_运维百科tab) -> Keyword list
But this does not work great.
Is there a better way?
What you have is a rough 1st order approximation. I think if you then go back through the data and search for frequency of 2-word phrases, then 3 word phrases, up till the total number of words that can be considered a tag, you'll get a better representation of keyword frequency.
You can refine this rough search pattern by specifying certain words that can be contained as part of a phrase (pronouns ect).
精彩评论