开发者

Merging cached GQL queries instead of using IN

I'm generating a feed that merges the comments of many users, so your feed might be of comments by user1+user2+user1000 whereas mine might be user1+user2. So I have the line:

some_comments = Comment.gql("WHERE username IN :1",user_list)

I can't just memcache the whole thing since everyone will have different feeds, even if the feeds for user1 and user2 would be common to many viewers. According to the documentation:

...the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of开发者_JAVA百科 the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.

Is there a library function to merge some sorted and cached queries, or am I going to have to:

for user in user_list
  if memcached(user):
    add it to the results
  else:
    add Comment.gql("WHERE username = :1",user) to the results 
    cache it too
sort the results

(In the worst case (nothing is cached) I expect sending 30 GQL queries off is slower than one giant IN query.)


There's nothing built-in to do this, but you can do it yourself, with one caveat: If you do an in query and return 30 results, these will be the 30 records that sort lowest according to your sort criteria across all the subqueries. If you want to assemble the resultset from cached individual queries, though, either you are going to have to cache as many results for each user as the total result set (eg, 30), and throw away most of those results, or you're going to have to store fewer results per user, and accept that sometimes you'll throw away newer results from one user in favor of older results from another.

That said, here's how you can do this:

  1. Do a memcache.get_multi to retrieve cached result sets for all the users
  2. For each user that doesn't have a result set cached, execute the individual query, fetching the most results you need. Use memcache.set_multi to cache the result sets.
  3. Do a merge-join on all the result sets and take the top n results as your final result set. Because username is presumably not a list field (eg, every comment has a single author), you don't need to worry about duplicates.

Currently, in queries are executed serially, so this approach won't be any slower than executing an in query, even when none of the results are cached. This may change in future, though. If you want to improve performance now, you'll probably want to use Guido's NDB project, which will allow you to execute all the subqueries in parallel.


You can use memcache.get_multi() to see which of the user's feeds are already in memcache. Then use a set().difference() on the original user list vs. the user list found in memcache to find out which weren't retrieved. Then finally fetch the missing user feeds from the datastore in a batch get.

From there you can combine the two lists and, if it isn't too long, sort it in memory. If you're working on something Ajaxy, you could hand off sorting to the client.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜