Google App Engine - Caching generated HTML
I have written a Google App Engine application that programatically generates a bunch of HTML code that is really the same output for each user who logs into my system, and I know that this is going to be in-efficient when the code goes into production. So, I am trying to figure out the best way to cache the generated pages.
The most probable option is to generate the pages and write them into the database, and then check the time of the database put operation for a given page against the time that the code was last updated. Then, if the code is newer than the last put to the database (for a particular HTML request), new HTML will be generated and served, and cached to the database. If the code is older than the last put to the database, then I will just get the HTML direct from the database and serve it (therefore avoiding all the CPU wastage of generating the HTML). I am not only looking to minimize load times, but to minimize CPU usage.
H开发者_StackOverflow中文版owever, one issue that I am having is that I can't figure out how to programatically check when the version of code uploaded to the app engine was updated.
I am open to any suggestions on this approach, or other approaches for caching generated html.
Note that while memcache could help in this situation, I believe that it is not the final solution since I really only need to re-generate html when the code is updated (as opposed to every time the memcache expires).
In order of speed:
- memcache
- cached HTML in data store
- full page generation
Your caching solution should take this into account. Essentially, I would probably recommend using memcache anyways. It will be faster than accessing the data store in most cases and when you're generating a large block of HTML, one of the main benefits of caching is that you potentially didn't have to incur the I/O penalty of accessing the data store. If you cache using the data store, you still have the I/O penalty. The difference between regenerating everything and pulling from cached html in the data store is likely to be fairly small unless you have a very complex page. It's probably better to get a bunch of very fast cache hits off memcache and do a full regenerate every once in a while than to make a call out to the data store every time. There's nothing stopping you from invalidating the cached HTML in memcache when you update, and if your traffic is high enough to warrant it, you can always do a multi-level caching system.
However, my main concern is that this is premature optimization. If you don't have the traffic yet, keep caching to a minimum. App Engine provides a set of really convenient performance analysis tools, and you should be using those to identify bottlenecks after you've got at least a few QPS of traffic.
Anytime you're doing performance optimization, measure first! A lot of performance "optimizations" turn out to either be slower than the original, exactly the same, or they have negative user experience characteristics (like stale data). Don't optimize until you're certain you have to.
A while ago I wrote a series of blog posts about writing a blogging system on App Engine. You may find the post on static generation of HTML pages of particular interest.
This is not a complete solution, but might offer some interesting option for caching.
Google Appengine Frontend Caching allows you a way of caching without using memcache.
Just serve a static version of your site
It's actually a lot easier than you think.
If you already have a file that contains all of the urls for your site (ex urls.py), half the work is already done.
Here's the structure:
+-/website
+--/static
+---/html
+--/app/urls.py
+--/app/routes.py
+-/deploy.py
/html is where the static files will be served from. urls.py contains a list of all the urls for your site. routes.py (if you moved the routes out of main.py) will need to be modified so you can see the dynamically generated version locally but serve the static version in production. deploy.py is your one-stop static site generator.
How you layout your urls module depends. I personally use it as a one-stop-shop to fetch all the metadata for a page but YMMV.
Example:
main = [
{ 'uri':'about-us', 'url':'/', 'template':'about-us.html', 'title':'About Us' }
]
With all of the urls for the site in a structured format it makes crawling your own site easy as pie.
The route configuration is a little more complicated. I won't go into detail because there are just too many different ways this could be accomplished. The important piece is the code required to detect whether you're running on a development or production server.
Here it is:
# Detect whether this the 'Development' server
DEV = os.environ['SERVER_SOFTWARE'].startswith('Dev')
I prefer to put this in main.py and expose it globally because I use it to turn on/off other things like logging but, once again, YMMV.
Last, you need the crawler/compiler:
import os
import sys
import urllib2
from app.urls import main
port = '8080'
local_folder = os.getcwd() + os.sep + 'static' + os.sep + 'html' + os.sep
print 'Outputting to: ' + local_folder
print '\nCompiling:'
for page in main:
http = urllib2.urlopen('http://localhost:' + port + page['url'])
file_name = page['template']
path = local_folder + file_name
local_file = open(path, 'w')
local_file.write(http.read())
local_file.close()
print ' - ' + file_name + ' compiled successfully...'
This is really rudimentary stuff. I was actually stunned with how easy it was when I created it. This is literally the equivalent of opening your site page-by-page in the browser, saving as html, and copying that file into the /static/html folder.
The best part is, the /html folder works like any other static folder so it will automatically be cached and the cache expiration will be the same as all the rest of your static files.
Note: This handles a site where the pages are all served from the root folder level. If you need deeper nesting of folders it'll need a slight modification to handle that.
Old thread, but i'll comment anyways as technology has progressed a little... Another idea that may or may not be approproate for you is to generate the HTML and store it on Google Cloud Storage. Then access the HTML via a CDN link that the cloud storage provides for you. No need to check memcache or wait for datastore to wake up on new requests. Ive started storing all my JavaScript, CSS, and other static content (images, downloads etc) like this for my appengine apps and its working well for me.
精彩评论