Trying to get a count of each word in a MongoDB field is this a job for MapReduce?
I've got a collection with a bunch of body posts in it. For example:
posts = { { id: 0, body: "foo bar baz", otherstuff: {...} },
{ id: 1, body: "baz bar oof", otherstuff: {...} },
{ id: 2, body: "baz foo oof", otherstuff: {...} }
};
I'd like to figure out how to loop through each document in the collection and carry a count of each word in each post body.
post_word_frequency = { { foo: 2 },
{ bar: 2 },
{ baz: 3 },
{ oof: 2 },
};
I've never used MapReduce and I'm still really fresh to mongo, but I'm looking at the documentation on http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
map = function() {
words = this.body.split(' ');
for (i in words) {
emit({ words[i] }, {count: 1});
}
};
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
});
return {count: count};
};
db.posts.mapReduce(map, reduce, {out: post_word_frequency});
As a bit of an added difficulty, I'm doing it in node.js (with node-mongo-native, though am willing to switch to do the reduce query if there's an easier way).
var db = new Db('mydb', new Server('localhost', 27017, {}), {native_parser:false});
db.open(function(err, db){
db.collection('posts', function(err, col) {
db.col.mapReduce(map, reduce, {out: post_word_frequency});
});
});
So far, I'm having difficulty in that node's telling me ReferenceError: post_word_frequency is not defined
(I tried creating it in the shell, but that still didn't help).
So has anyone done a mapreduce with node.js? Is this the wrong use for map reduce? m开发者_运维知识库aybe another way to do it? (perhaps just loop and upsert into another collection?)
Thanks for any feedback and advice! :)
EDIT Ryanos below was correct (thanks!) one thing that's missing from my MongoDB based solution was finding the collection and converting it to an array.
db.open(function(err, db){
db.collection('posts', function(err, col) {
col.find({}).toArray(function(err, posts){ // this line creates the 'posts' array as needed by the MAPreduce functions.
var words= _.flatten(_.map(posts, function(val) {
Theres a bug with {out: post_word_frequency}
maybe you want {out: "post_word_frequency"}
but it should work without this out
variable.
Using underscore
it can be simply done.
/*
[{"word": "foo", "count": 1}, ...]
*/
var words = _.flatten(_.map(posts, function(val) {
return _.map(val.body.split(" "), function(val) {
return {"word": val, "count": 1};
});
}));
/*
{
"foo": n, ...
}
*/
var count = _.reduce(words, function(memo, val) {
if (_.isNaN(++memo[val.word])) {
memo[val.word] = 1;
}
return memo;
}, {});
Live Example
_.reduce
, _.map
, _.isNaN
, _.flatten
精彩评论