Create a Occurrence Vector Using Apache Lucene
We are developing an application to detect plagiarism. We are using Apache lucene for document indexing. I have a need to create an开发者_C百科 occurrence vector for each document using the index we created. I would like to know whether there is a way to do this using apache lucene. I tried to use TermFreqVectors but I couldn't find a proper way. Any suggestion or help is highly appreciated.
Thanks.
The TermFreqVector
class does what you'd like, I think. It can even give you term positions so that you can detect ordered sequences of words. To generate the vector, you need to specify this at indexing time like this:
String text = "text you want to index; you could also use a Reader here";
Document doc = new Document();
doc.add(new Field("text", text, Store.NO, Index.ANALYZED, TermVector.WITH_POSITIONS));
At retrieval time, you can run phrase queries (e.g, "a b c"~25) or SpanQuery
s (which you have to construct programmatically).
To get term frequency and position information from the index, do something like this:
TermPositionVector v = (TermPositionVector) this.reader.getTermFreqVector(docnum, this.textField);
int wordIndex = v.indexOf("want");
int[] positions = v.getTermPositions(wordIndex); // should return the position(s) of the word "want" in your text
If you want to achieve this you could use a RAMDirectory to store your document (assuming you only want to do this for one document). Then you can use IndexReader.termDocs(Term term) to fetch the TermDocs for this directory, containing the document id (only one if you store one doc) and the frequency of the term in the document. You can then do this for each term to create your occurance vector.
You could off course also do this for more than one document and create multiple occurance vectors at once.
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/index/IndexReader.html
As I'm sure you are looking to find similarities in documents => similar documents, you might want to have a look on the MoreLikeThis implementation of Lucene: http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
精彩评论