开发者

Find sentences with similar relative meaning from a list of sentences against an example one

I want to be able to find sentence开发者_JAVA百科s with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.

For example, my query sentence is:

Example: add (x) to (y) giving (z)

There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)

The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.

Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.

Thanks! (the tag is language-design as I couldn't create any new tag)


First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.

You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).

If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.

You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.


You will need to stem the words in your sentences down to a common synonym, and then compare those stems and use the ratio of stem matches in a sentence (5 out of 10 words) to compare against some threshold that the sentence is a match. For example all sentences with a word match of over 80% (or what ever percentage you deem acurate). At least that is one way to do it.


Write a function which creates some kinda hash, or "expression" from a sentence, which can be easy compared with other sentences' hashes.

Cca:
1. "the sum of (x) and (y) is (z)" => x + y = z
4. "(z) is the sum of (x) and (y)" => z = x + y

Some tips for the transformation: omit "the" words, convert double-word terms to a single word "sum of" => "sumof", find operator word and replace "and" with it.


Not that easy ^^ You should use a stopword filter first, to get non-information-bearing words out of it. Here are some good ones

Then you wanna handle synonyms. Thats actually a really complex theme, cause you need some kind of word sense disambiguation to do it. And most state of the art methods are just a little bit better then the easiest solution. That would be, that you take the most used meaning of a word. That you can do with WordNet. You can get synsets for a word, where all synonyms are in it. Then you can generalize that word (its called a hyperonym) and take the most used meaning and replace the search term with it.

Just to say it, handling synonyms is pretty hard in NLP. If you just wanna handle different wordforms like add and adding for example, you could use a stemmer, but no stemmer would help you to get from add to sum (wsd is the only way there)

And then you have different word orderings in your sentences, which shouldnt be ignored aswell, if you want exact answers (x+y=z is different from x+z=y). So you need word dependencies aswell, so you can see which words depend on each other. The Stanford Parser is actually the best for that task if you wanna use english.

Perhaps you should just get nouns and verbs out of a sentence and make all the preprocessing on them and ask for the dependencies in your search index. A dependency would look like

x (sum, y)
y (sum, x)
sum (x, y)

which you could use for ur search

So you need to tokenize, generalize, get dependencies, filter unimportant words to get your result. And if you wanna do it in german, you need a word decompounder aswell.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜