开发者

Load globally accessible singleton on app start in Google App Engine using Python

Using Google app engine, is it possible to initialize a globally accessible singleton on app startup? I have a large static tree structure that I need to use on every request and want to initialize it beforehand. The tree structure i开发者_开发百科s too large (20+MB) to be put into Memcache and I am trying to figure out what other alternatives I have.

EDIT: Just to add some clarity based on the answers I've received so far. I'm loading a dictionary of words into a trie/prefix tree structure. The trie is immutable as the dictionary of words is fixed. I'm generating anagrams based on an input string of characters, so one request may access a fair amount of the trie on a single request, possibly more the 1MB, however I'm not certain.

Here's is the python structure that I'm loading the dictionary of words into as well.

class Node(object):

    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}

    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1)
            node = node.children[letter]


Each request might be served from a completely different process, on a different server, which might even be on a separate datacenter (hey, maybe in a different continent). There is nothing that's guaranteed to be "globally accessible" to the handlers of different requests to the same app except the datastore (even memcache's entries might disappear at any time if things get too busy: it's a cache, after all!-).

Perhaps you can keep your "static tree structure" in a data file that you upload together with the application's code, and access it from disk in lieu of "initializing".

Edit: as requested, here's a rough and ready example of the "lightweight class mapping tree into array" approach which I mentioned in a comment -- not tuned nor finely tested. I'm taking as the example a binary search tree with integer payloads and assuming that for some reason it's important to keep the exact structure in the "light" tree as in the "heavy" tree it represents. Even with these simplifications it's still a lot of code, but, here comes:

import array
import random

def _doinsert(tree, payload):
  if tree is None: return HeavyTree(payload)
  tree.insert(payload)
  return tree

class HeavyTree(object):
  def __init__(self, payload):
    self.payload = payload
    self.left = self.right = None
  def insert(self, other):
    if other <= self.payload:
      self.left = _doinsert(self.left, other)
    else:
      self.right = _doinsert(self.right, other)
  def walk(self):
    if self.left:
      for x in self.left.walk(): yield x
    yield self.payload
    if self.right:
      for x in self.right.walk(): yield x
  def walknodes(self):
    yield self
    if self.left:
      for x in self.left.walknodes(): yield x
    if self.right:
      for x in self.right.walknodes(): yield x

data = [random.randint(0, 99) for _ in range(9)]
print 'data: ',
for x in data: print x,
print
theiter = iter(data)
thetree = HeavyTree(next(theiter))
for x in theiter: thetree.insert(x)

print
print 'Heavy tree:'
print 'nodes:',
for x in thetree.walknodes(): print x.payload,
print
print 'inord:',
for x in thetree.walk(): print x,
print

class LightTree(HeavyTree):
  def __init__(self, base, offset):
    self.base = base
    self.offset = offset
  @property
  def payload(self):
    return self.base[self.offset]
  @property
  def left(self):
    return self._astree(self.offset+1)
  @property
  def right(self):
    return self._astree(self.offset+2)
  def _astree(self, i):
    offset = self.base[i]
    if offset < 0: return None
    return LightTree(self.base, offset)

def heavy_to_light(heavy):
  for i, node in enumerate(heavy.walknodes()):
    node.id = i * 3
  base = array.array('l', (i+1) * 3 * [-1])
  for node in heavy.walknodes():
    base[node.id] = node.payload
    if node.left: base[node.id+1] = node.left.id
    if node.right: base[node.id+2] = node.right.id
  return LightTree(base, 0)

print
print 'Light tree:'
light = heavy_to_light(thetree)
print 'nodes:',
for x in light.walknodes(): print x.payload,
print
print 'base :',
for x in light.base: print x,
print
print 'inord:',
for x in light.walk(): print x,
print

A typical run would show:

data:  27 79 90 60 82 80 3 94 76

Heavy tree:
nodes: 27 3 79 60 76 90 82 80 94
inord: 3 27 60 76 79 80 82 90 94

Light tree:
nodes: 27 3 79 60 76 90 82 80 94
base : 27 3 6 3 -1 -1 79 9 15 60 -1 12 76 -1 -1 90 18 24 82 21 -1 80 -1 -1 94 -1 -1
inord: 3 27 60 76 79 80 82 90 94

variable in details every time, of course, since the data are being generated randomly.

Maybe this sort of thing is just too cumbersome for anybody who didn't start with good old Fortran (and thus inevitably learned how to represent logical pointers as indices into an array), as I did back in EE school many decades ago;-). But loading such arrays from a file straight into memory is blazingly fast (compared to unpickling and the like)...!-)


How much of this tree do you need to access on a single request? In what manner do you query it? Does it ever change?

If it's immutable, you don't really need a 'singleton' - which implies mutability - just a way to access the data on each instance. Depending on how you need to access it, you could store it as a data file, a blob, or as data in the datastore.


I think Google gives you 300MB of local memory for each instance that is running your project. So all you have to do is store the tree structure into a variable in some module.

Whenever Google spins up a new process for your app, it will run the code to build your tree once, and then you can access it for future requests that are handled by that process. Just make sure that building the tree takes less than 30 seconds, because it has to happen within the time frame of whatever random request makes Google decide to spin up a new process.


Memcache and the datastore are your best bets for truly global access across all your instances. However, a global variable can still be used to cache your data structure within each instance. Wouldn't a trie structure be pretty easy to break up into chunks such that each chunk will fit into memcache? Once you get your trie into memcache, anytime you access a chunk of the trie from memcache, you could store it in a global variable on that instance. Over the course of a few requests, you will build up a full copy of the trie on each instance that you have running. This is a bit complicated, but could ultimately give you the best performance.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜