开发者

How to find keywords (useful words) from text?

I am doing an experimental project.

What i am trying to achieve is, i want to find that what are the keywords in that text.

How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.

But problem is some common words like is,was,were are always开发者_StackOverflow中文版 at top. Apparently these are not worth.

Can you people suggest me some good logic to do it, so it finds good related keywords always?


Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.


Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.

Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.

Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.


my first approach to something like this would be more mathematical modeling than pure programming.

there are two "simple" ways you can attack a problem like this; a) exclusion list (penalize a collection of words which you deem useless) b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table

I am not sure if this was what you were looking for, but I hope it helps. By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜