Load globally accessible singleton on app start in Google App Engine using Python
Using Google app engine, is it possible to initialize a globally accessible singleton on app startup? I have a large static tree structure that I need to use on every request and want to initialize it beforehand. The tree structure i开发者_开发百科s too large (20+MB) to be put into Memcache and I am trying to figure out what other alternatives I have.
EDIT: Just to add some clarity based on the answers I've received so far. I'm loading a dictionary of words into a trie/prefix tree structure. The trie is immutable as the dictionary of words is fixed. I'm generating anagrams based on an input string of characters, so one request may access a fair amount of the trie on a single request, possibly more the 1MB, however I'm not certain.
Here's is the python structure that I'm loading the dictionary of words into as well.
class Node(object):
def __init__(self, letter='', final=False):
self.letter = letter
self.final = final
self.children = {}
def add(self, letters):
node = self
for index, letter in enumerate(letters):
if letter not in node.children:
node.children[letter] = Node(letter, index==len(letters)-1)
node = node.children[letter]
Each request might be served from a completely different process, on a different server, which might even be on a separate datacenter (hey, maybe in a different continent). There is nothing that's guaranteed to be "globally accessible" to the handlers of different requests to the same app except the datastore (even memcache's entries might disappear at any time if things get too busy: it's a cache, after all!-).
Perhaps you can keep your "static tree structure" in a data file that you upload together with the application's code, and access it from disk in lieu of "initializing".
Edit: as requested, here's a rough and ready example of the "lightweight class mapping tree into array" approach which I mentioned in a comment -- not tuned nor finely tested. I'm taking as the example a binary search tree with integer payloads and assuming that for some reason it's important to keep the exact structure in the "light" tree as in the "heavy" tree it represents. Even with these simplifications it's still a lot of code, but, here comes:
import array
import random
def _doinsert(tree, payload):
if tree is None: return HeavyTree(payload)
tree.insert(payload)
return tree
class HeavyTree(object):
def __init__(self, payload):
self.payload = payload
self.left = self.right = None
def insert(self, other):
if other <= self.payload:
self.left = _doinsert(self.left, other)
else:
self.right = _doinsert(self.right, other)
def walk(self):
if self.left:
for x in self.left.walk(): yield x
yield self.payload
if self.right:
for x in self.right.walk(): yield x
def walknodes(self):
yield self
if self.left:
for x in self.left.walknodes(): yield x
if self.right:
for x in self.right.walknodes(): yield x
data = [random.randint(0, 99) for _ in range(9)]
print 'data: ',
for x in data: print x,
print
theiter = iter(data)
thetree = HeavyTree(next(theiter))
for x in theiter: thetree.insert(x)
print
print 'Heavy tree:'
print 'nodes:',
for x in thetree.walknodes(): print x.payload,
print
print 'inord:',
for x in thetree.walk(): print x,
print
class LightTree(HeavyTree):
def __init__(self, base, offset):
self.base = base
self.offset = offset
@property
def payload(self):
return self.base[self.offset]
@property
def left(self):
return self._astree(self.offset+1)
@property
def right(self):
return self._astree(self.offset+2)
def _astree(self, i):
offset = self.base[i]
if offset < 0: return None
return LightTree(self.base, offset)
def heavy_to_light(heavy):
for i, node in enumerate(heavy.walknodes()):
node.id = i * 3
base = array.array('l', (i+1) * 3 * [-1])
for node in heavy.walknodes():
base[node.id] = node.payload
if node.left: base[node.id+1] = node.left.id
if node.right: base[node.id+2] = node.right.id
return LightTree(base, 0)
print
print 'Light tree:'
light = heavy_to_light(thetree)
print 'nodes:',
for x in light.walknodes(): print x.payload,
print
print 'base :',
for x in light.base: print x,
print
print 'inord:',
for x in light.walk(): print x,
print
A typical run would show:
data: 27 79 90 60 82 80 3 94 76
Heavy tree:
nodes: 27 3 79 60 76 90 82 80 94
inord: 3 27 60 76 79 80 82 90 94
Light tree:
nodes: 27 3 79 60 76 90 82 80 94
base : 27 3 6 3 -1 -1 79 9 15 60 -1 12 76 -1 -1 90 18 24 82 21 -1 80 -1 -1 94 -1 -1
inord: 3 27 60 76 79 80 82 90 94
variable in details every time, of course, since the data are being generated randomly.
Maybe this sort of thing is just too cumbersome for anybody who didn't start with good old Fortran (and thus inevitably learned how to represent logical pointers as indices into an array), as I did back in EE school many decades ago;-). But loading such arrays from a file straight into memory is blazingly fast (compared to unpickling and the like)...!-)
How much of this tree do you need to access on a single request? In what manner do you query it? Does it ever change?
If it's immutable, you don't really need a 'singleton' - which implies mutability - just a way to access the data on each instance. Depending on how you need to access it, you could store it as a data file, a blob, or as data in the datastore.
I think Google gives you 300MB of local memory for each instance that is running your project. So all you have to do is store the tree structure into a variable in some module.
Whenever Google spins up a new process for your app, it will run the code to build your tree once, and then you can access it for future requests that are handled by that process. Just make sure that building the tree takes less than 30 seconds, because it has to happen within the time frame of whatever random request makes Google decide to spin up a new process.
Memcache and the datastore are your best bets for truly global access across all your instances. However, a global variable can still be used to cache your data structure within each instance. Wouldn't a trie structure be pretty easy to break up into chunks such that each chunk will fit into memcache? Once you get your trie into memcache, anytime you access a chunk of the trie from memcache, you could store it in a global variable on that instance. Over the course of a few requests, you will build up a full copy of the trie on each instance that you have running. This is a bit complicated, but could ultimately give you the best performance.
精彩评论