Hashing a python function to regenerate output when the function is modified
I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function
from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function
only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from o开发者_开发问答ther pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function
, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.
If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
@functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
@memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.
The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.
So, here is a really neat trick using decorators:
def memoize(f): cache={}; def result(*args): if args not in cache: cache[args]=f(*args); return cache[args]; return result;
With the above, you can then use:
@memoize def myfunc(x,y,z): # Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "@memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os; import os.path; import cPickle; class MemoizedFunction(object): def __init__(self,f): self.function=f; self.filename=str(hash(f))+".cache"; self.cache={}; if os.path.exists(self.filename): with open(filename,'rb') as file: self.cache=cPickle.load(file); def __call__(self,*args): if args not in self.cache: self.cache[args]=self.function(*args); return self.cache[args]; def __del__(self): with open(self.filename,'wb') as file: cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL); def memoize(f): return MemoizedFunction(f);
What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
@memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle
or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.
精彩评论