Best method to determine which of a set of keys exist in the datastore
I have a few hundred keys, all of the same Model, which I have pre-computed:
candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]
Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.
It is not necessary to know the data within the entities, just whether they exist.
One solution would be to use db.get():
keys_with_entities = set()
for entity in db.get(candidate_keys):
if entity:
keys_with_entities.add(entity.key())
However this procedure would fetch all entity data from the store which is unnecessary and costly.
A second idea is to use a Query with an IN
filter on key_name
, manually fetching in chunks of 30 to fit the requirements of the IN
pseudo-filter. However keys-only querie开发者_如何学Cs are not allowed with the IN
filter.
Is there a better way?
IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!
Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:
foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
return [x is not None for x in db.get(test_keys)]
At this point, the only solution I have is to manually query by key with keys_only=True
, once per key.
for key in candidate_keys:
if MyModel.all(keys_only=True).filter('__key__ =', key).count():
keys_with_entities.add(key)
This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API
quota.
How not to do it (update based on Nick Johnson's answer):
I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN
filter.
class MyModel(db.Model):
"""Some model"""
# ... all the old stuff
the_key = db.StringProperty(required=True) # just a duplicate of the key_name
#... meanwhile back in the example
for key_batch in batches_of_30(candidate_keys):
key_names = [x.name() for x in key_batch]
found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
keys_with_entities.update(found_keys)
The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN
set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.
精彩评论