开发者

classifier4J with compound words

I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.

For instance if I add led zeppelin as 开发者_StackOverflow中文版a match, a sentence containing it won't be recognized as a match even though it should.

For adding a match I'm using addMatch() of SimpleWordsDataSource

And for asking for a match I'm using isMatch() of BayesianClassifier

Any ideas on how to fix this?


Ok, thanks for the insight. I'm attaching more source code.

SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);

wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");

classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match

Now I'm using the teachMatch method of BayesianClassifier and I've got different results. A sentence containing led zeppelin it is classified as a match, which is ok. But a sentence including led it is also classified as a match, which is wrong.

Here's the relevant code:

BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true


(I wrote classifier4j)

You need to train it with more data.

Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.

If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜