开发者

How to extract words from text as per the context

I want to extract relevant words from a text statement provided by the user. eg. For a question "How many sides are there in a rectangle?" The words should be 'rectangles' , 'sides', 'many' , 'how'.

We've discovered that what exactly I'm aiming to do is a NLP Question answer system. But right now I want to only extract the required keywords from the question, The domain of the questions is not very vast.

I've come across various data mining tools but not very sure if they actually will be useful for this. They seem to be a bit too advanced or not exactly related.

Please let me know if there is any tool that suits the requireme开发者_如何学编程nt or should I go on and try coding myself.

Please provide any kind of pointers, that you think might help.


If all you have is just the questions, you can try part of speech tagging (POS) and named entity extraction (NER). The nouns in particular would be of interest. There are a number of open source tools for the same, Brill's POS tager, Lingpipe, Open NLP, etc. However if you also have a corpus from the domain that you are interested in, you can extract the key words and phrases from it by using how different the frequencies of the words and phrases are as compared to some other base corpus. Given a question you can then look for those key words and phrases.


Apart from srean's advice to use POS tagging and NER, many people use search engine tools (specifically Lucene, but several other exist) to do question answering. They index a set of documents that should contain the answer, use the question as a query, retrieve a set of document and filter those to find the answer. Search engine tools have built-in term weighting.

That's the baseline setup; for more advanced systems, they do all kind of preprocessing on the question and the documents, including stop word filtering, POS tagging, parsing, NER, genetic algorithms, etc.

See this paper for an example of this setup.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜