Most efficient way to generate a list of Unigrams from a text field in MongoDB

2023-03-19 09:05 问答作者：

I need to generate a vector of u开发者_JAVA百科nigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.

I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.

Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...

Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...

What would be the most efficient way to generate the unigram vector and then keep it updated?

Thanks!

Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.

{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }

For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).

First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.

db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);

Second, we define both map() and reduce() function.

var map = function() {
        uniqueWords = function (words){
            var arrWords = words.split(" ");
            var arrNewWords = [];
            var seenWords = {};
            for(var i=0;i<arrWords.length;i++) {
                if (!seenWords[arrWords[i]]) {
                    seenWords[arrWords[i]]=true;
                    arrNewWords.push(arrWords[i]);
                }
            }
            return arrNewWords;
        }
      var unigrams =  uniqueWords(this.body) ;
      emit(this.author, {unigrams:unigrams});
};

var reduce = function(key,values){

    Array.prototype.uniqueMerge = function( a ) {
        for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
            if ( this.indexOf( a[i] ) === -1 ) {
                nonDuplicates.push( a[i] );
            }
        }
        return this.concat( nonDuplicates )
    };

    unigrams = [];
    values.forEach(function(i){
        unigrams = unigrams.uniqueMerge(i.unigrams);
    });
    return { unigrams:unigrams};
};

Third, we actually run the map-reduce function.

var result  = db.stories.mapReduce( map,
                                  reduce,
                                  {query:{author:{$exists:true},mr_status:"inprocess"},
                                   out: {reduce:"authors_unigrams"}
                                  });

Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".

db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);

Optionally, you can see the result collection "authors_unigrams" by firing following command.

db.authors_unigrams.find();

继续阅读：lucene mapreduce mongodb opennlp

Most efficient way to generate a list of Unigrams from a text field in MongoDB

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？