Looking for Denormalization Advice for Google App Engine
I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my init开发者_运维技巧ial thoughts on the design of the entities:
- Event entity - id, name, timstamp, list property of tags, view count, creator's username, creator's profile image id, number of journal entries it contains, number of total comments it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
- JournalEntry entity - timestamp, journal text, name of event, creator's username, creator's profile image id, list property of comments (containing commenter username and image id)
- User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
- UserComment entity - username, id of event commented on, title of event commented on
- TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
- Rather than store
Event.id
as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them. - You have lots of options for modeling the relationship between
Event
andJournalEntry
: you could use aReferenceProperty
, you could parentJournalEntries
toEvents
and retrieve them with ancestor queries, or you could store a list ofJournalEntry
key ids or names onEvent
and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best. UserComment
references anEvent
, whileJournalEntry
references a list ofUserComments
, which is a little confusing. Is there a relationship betweenUserComment
andJournalEntry
? or just betweenUserComment
andEvent
?- Persisting so many counts is expensive. When I post a comment, you're going to write a new
UserComment
entity and also update myUser
entity and aJournalEntry
entity and anEvent
entity. The number ofUserComments
you expect perEvent
makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache? - When you fetch an
Event
from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting eachEvent
's search index words into a separate childEventIndex
entity. Then you can queryEventIndex
on your search term, fetch just theEventIndex
keys forEventIndexes
that match your search, derive the correspondingEvents
' keys withkey.parent()
, and fetch theEvents
by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35. - Updating
Event.viewCount
will fail if you have a lot of views for anyEvent
in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.
精彩评论