Avoiding exploding indices and entity-group write-rate limits with appengine
I have an application in which there are Course
s, Topic
s, and Tag
s. Each Topic
can be in many Course
s and have many Tag
s. I want to look up every Topic
that has a specific Tag
x and is in specific Course
y.
Naively, I give each standard a list of
Course
ids andTag
ids, so I canselect * from Topic where tagIds = x && courseIds = y
. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.I could just
select * from Topic where tagIds = x
, and then use a for loop to go through the result, choosing onlyTopic
s whosecourseIds.contain(y)
. This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.I could
select __KEY__ from Topic where tagIds = x
ANDselect __KEY__ from Topic where courseIds = y
and find the intersection in my application code. If the sets are small this might not be unreasonab开发者_开发百科le.I could make a sort of join table,
TopicTagLookup
with atagId
andcourseId
field. The parent key of these entities would point to the relevantTopic
. Then I would need to make one of theseTopicTagLookup
entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!I could precalculate each query. A
TopicTagQueryCache
entity would hold atagId
,courseId
, and aList<TopicId>
. Then the query looks likeselect * from TopicTagQueryCache where tagId=x && courseId = y
, fetching the list of topic ids, and then using agetAllById
call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topic
s in the intersection of a Course
and a Tag
?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic
entity will require an extra query (first you query TopicTagQueryCache
to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache
you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic
(at worst, your Topic
will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent. As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.
精彩评论