bayesian filter to mark duplicate items
I collect news for certain topics and then run bayesian classfier on them to mark them as interesting or non-interesting. I see th开发者_开发百科at there are news which are different articles are essentially the same news. e.g. - Ben Kingsley visits Taj Mahal with wife - Kingsley romances wife in Taj's lawns
How do I teach the system to mark all these as duplicates?
Thanks Sanjay
Interesting idea. I would guess this has been studied before, a look in some comp-sci journal should turn up a few good pointers. That said here are a few idea I have:
Method
You could find the most-unique key-phrases and see how well they match with the key phrases with the other articles. I would imagine the data published by google on the frequency of phrases on the web would give you baseline.
You somehow need to pickup on the fact that "in the" is a very common phrase but "Kingsley visits" is important. Once you have filtered down all the text to just the key phrases you could see how many of them match.
key phrases:
- set of all verbs, nouns, names, and novel (new/mis-spelt) words
- you could grab phrases that are say, between one and five words long
- remove all that are very common (could have classifier on common phrases)
- see how many of them match between articles.
- have a controllable slider to set the matching threshold
It's not going to be easy if you write this yourself but I would say it's a very interesting problem area.
Example
If we just using the titles and follow the method through by hand.
Ben Kingsley visits Taj Mahal with wife will create the following keywords:
- Ben Kingsley
- Kingsley
- Kingsley visits
- wife
- Mahal
- ... etc ...
but these should be removed as they are too common (hence don't help to uniquely identify the content)
- Ben
- with wife
once the same is done with the other title Kingsley romances wife in Taj's lawns then you can compare and find that quite a few key phrases match each other. Hence they are on the same subject.
Even though this is already a large undertaking there are many thing you could do to further the matching.
Extensions
These are all ways to trim the keyword set down once it is created.
WordNet would be a great start to looking into getting a match between say "longer" and "extend". This would be useful as articles wont use the same lexicon for their writing.
You could run a Bayesian Classfier on what counts as a key-phrase. It can be trained by having the set of all matching/non-matching articles and their key-phrases. You would have to be careful about how you deal with unseen phrases as these are likely to be the most important thing you come across. It might even be better to run it on what isn't a key-phrase.
It might even be an idea to calcluate the Levenshtein distance between some of the key-phrases if nothing else found a match. I'm guessing it is likely that there will always be some matches found.
I have a feeling that this is one of those things where a very good answer will get you a PhD. Than again, I suppose it has already been done before (google must have some automatic way to scrape all those news sites and fit them into categories and similar articles)
good luck with it.
This is a classification problem, but harder given the number of distinct classes you will have. One option might be to reduce the size of each document using Feature Selection (more info). Feature selection involves selecting the top n terms (excluding stop words, and possibly applying stemming to each word as well). Do this by calculating, for each document, the mutual information (more info) of each term, ordering the terms by that number and selecting the top n terms for each document. This reduced feature set of top n terms for each document can now form the basis for performing your duplicate selection (for example, if there are more than x% common terms between any documents, again x calculated through backtesting),
Most of this is covered in this free book on information retrieval.
精彩评论