classifier4J with compound words
I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.
For instance if I add led zeppelin as 开发者_StackOverflow中文版a match, a sentence containing it won't be recognized as a match even though it should.
For adding a match I'm using addMatch() of SimpleWordsDataSource
And for asking for a match I'm using isMatch() of BayesianClassifier
Any ideas on how to fix this?
Ok, thanks for the insight. I'm attaching more source code.
SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);
wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");
classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match
Now I'm using the teachMatch method of BayesianClassifier and I've got different results. A sentence containing led zeppelin it is classified as a match, which is ok. But a sentence including led it is also classified as a match, which is wrong.
Here's the relevant code:
BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true
(I wrote classifier4j)
You need to train it with more data.
Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.
If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't
精彩评论