开发者

Most efficient way to generate a list of Unigrams from a text field in MongoDB

I need to generate a vector of u开发者_JAVA百科nigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.

I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.

Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...

Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...

What would be the most efficient way to generate the unigram vector and then keep it updated?

Thanks!


Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.

{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }

For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).

First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.

db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);

Second, we define both map() and reduce() function.

var map = function() {
        uniqueWords = function (words){
            var arrWords = words.split(" ");
            var arrNewWords = [];
            var seenWords = {};
            for(var i=0;i<arrWords.length;i++) {
                if (!seenWords[arrWords[i]]) {
                    seenWords[arrWords[i]]=true;
                    arrNewWords.push(arrWords[i]);
                }
            }
            return arrNewWords;
        }
      var unigrams =  uniqueWords(this.body) ;
      emit(this.author, {unigrams:unigrams});
};

var reduce = function(key,values){

    Array.prototype.uniqueMerge = function( a ) {
        for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
            if ( this.indexOf( a[i] ) === -1 ) {
                nonDuplicates.push( a[i] );
            }
        }
        return this.concat( nonDuplicates )
    };

    unigrams = [];
    values.forEach(function(i){
        unigrams = unigrams.uniqueMerge(i.unigrams);
    });
    return { unigrams:unigrams};
};

Third, we actually run the map-reduce function.

var result  = db.stories.mapReduce( map,
                                  reduce,
                                  {query:{author:{$exists:true},mr_status:"inprocess"},
                                   out: {reduce:"authors_unigrams"}
                                  });

Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".

db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);

Optionally, you can see the result collection "authors_unigrams" by firing following command.

db.authors_unigrams.find();
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜