开发者

Trying to get a count of each word in a MongoDB field is this a job for MapReduce?

I've got a collection with a bunch of body posts in it. For example:

posts = { { id: 0, body: "foo bar baz", otherstuff: {...} },
          { id: 1, body: "baz bar oof", otherstuff: {...} },
          { id: 2, body: "baz foo oof", otherstuff: {...} }
        };

I'd like to figure out how to loop through each document in the collection and carry a count of each word in each post body.

post_word_frequency = { { foo: 2 },
                        { bar: 2 },
                        { baz: 3 },
                        { oof: 2 },
                      };

I've never used MapReduce and I'm still really fresh to mongo, but I'm looking at the documentation on http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

map = function() {
    words = this.body.split(' ');
    for (i in words) {
       emit({ words[i] }, {count: 1});   
    }
};

reduce = function(key, values) {
     var count = 0;
     values.forEach(function(v) {
          count += v['count'];
     });
     return {count: count};
};

db.posts.mapReduce(map, reduce, {out: post_word_frequency});

As a bit of an added difficulty, I'm doing it in node.js (with node-mongo-native, though am willing to switch to do the reduce query if there's an easier way).

    var db = new Db('mydb', new Server('localhost', 27017, {}), {native_parser:false});
    db.open(function(err, db){
            db.collection('posts', function(err, col) {
                db.col.mapReduce(map, reduce, {out: post_word_frequency});
            });
    });

So far, I'm having difficulty in that node's telling me ReferenceError: post_word_frequency is not defined (I tried creating it in the shell, but that still didn't help).

So has anyone done a mapreduce with node.js? Is this the wrong use for map reduce? m开发者_运维知识库aybe another way to do it? (perhaps just loop and upsert into another collection?)

Thanks for any feedback and advice! :)

EDIT Ryanos below was correct (thanks!) one thing that's missing from my MongoDB based solution was finding the collection and converting it to an array.

 db.open(function(err, db){
    db.collection('posts', function(err, col) {
            col.find({}).toArray(function(err, posts){    // this line creates the 'posts' array as needed by the MAPreduce functions.
                    var words= _.flatten(_.map(posts, function(val) {


Theres a bug with {out: post_word_frequency} maybe you want {out: "post_word_frequency"} but it should work without this out variable.

Using underscore it can be simply done.

/*
  [{"word": "foo", "count": 1}, ...]
*/
var words = _.flatten(_.map(posts, function(val) {
    return _.map(val.body.split(" "), function(val) {
        return {"word": val, "count": 1};
    });
}));

/*
  {
    "foo": n, ...
  }
*/
var count = _.reduce(words, function(memo, val) {
    if (_.isNaN(++memo[val.word])) {
        memo[val.word] = 1;
    }
    return memo;
}, {});

Live Example

_.reduce, _.map, _.isNaN, _.flatten

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜