How do you efficiently bulk index lookups?

2023-01-03 19:57 问答作者：

I have these entity kinds:

Molecule
Atom
MoleculeAtom

Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of t开发者_StackOverflowhe form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.

Both of these bulk lookups need to happen really fast. Right now I'm doing something like:

atom_ids_by_molecule_id = {}

for molecule_id in molecule_ids:
    moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
    atom_ids_by_molecule_id[molecule_id] = [
        MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
    ]

Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.

Ideas:

Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.
Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?

Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.

Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.

继续阅读：google-app-engine indexing scalability

How do you efficiently bulk index lookups?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？