Special considerations when performing file I/O on an NFS share via a Python-based daemon?
I have a python-based daemon that provides a REST-like interface over HTTP to some command line tools. The general nature of the tool is to take in a request, perform some command-line action, store a pickled data structure to disk, and return some data to the caller. There's a secondary thread spawned on daemon startup that looks at that pickled data on disk periodically and does some cleanup based on what's in the data.
This works just fine if the disk where the pickled data resides happens to be local disk on a Linux machine. If you switch to NFS-mounted disk the daemon starts life just fine, but over time the NFS-mounted share "disappears" and the daemon can no longer tell where it is on disk with calls like os.getcwd()
. You'll start to see it log errors like:
2011-07-13 09:19:36,238 INFO Retrieved submit directory '/tech/condor_logs/submit'
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): handler.path: /condor/submit?queue=Q2%40scheduler
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): submitting from temporary submission directory '/tech/condor_logs/submit/tmpoF8YXk'
2011-07-13 09:19:36,240 ERROR Caught un-handled exception: [Errno 2] No such file or directory
2011-07-13 09:19:36,241 INFO submitter - - [13/Jul/2011 09:19:36]开发者_运维百科 "POST /condor/submit?queue=Q2%40scheduler HTTP/1.1" 500 -
The un-handled exception resolves to the daemon being unable to see the disk any more. Any attempts to figure out the daemon's current working directory with os.getcwd()
at this point will fail. Even trying to change to the root of the NFS mount /tech
, will fail. All the while the logger.logging.*
methods are happily writing out log and debug messages to a log file located on the NFS-mounted share at /tech/condor_logs/logs/CondorAgentLog
.
The disk is most definitely still available. There are other, C++-based daemons, reading and writing with a much higher rate of frequency on this share at the time that the python-based daemon.
I've come to an impasse debugging this problem. Since it works on local disk the general structure of the code must be good, right? There's something about NFS-mounted shares and my code that are incompatible but I can't tell what it might be.
Are there special considerations one must implement when dealing with a long-running Python daemon that will be reading and writing frequently to an NFS-mounted file share?
If anyone wants to see the code the portion that handles the HTTP request and writes the pickled object to disk is in github here. And the portion that the sub-thread uses to do periodic cleanup of stuff from disk by reading the pickled objects is here.
I have the answer to my problem and it had nothing to with the fact that I was doing file I/O on an NFS share. It turns out the problem just showed up faster if the I/O was over an NFS mount versus local disk.
A key piece of information is that the code was running threaded via the SocketServer.ThreadingMixIn
and HTTPServer
classes.
My handler code was doing something close to the following:
base_dir = getBaseDirFromConfigFile()
current_dir = os.getcwd()
temporary_dir = tempfile.mkdtemp(dir=base_dir)
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
That's the flow, more or less.
The problem wasn't that the I/O was being done on NFS. The problem was that os.getcwd()
isn't thread-local, it's a process global. So as one thread issued a chdir()
to move to the temporary space it just created under base_dir
, the next thread calling os.getcwd()
would get the other thread's temporary_dir
instead of the static base directory where the HTTP server was started in.
There's some other people reporting similar issues here and here.
The solution was to get rid of the chdir()
and getcwd()
calls. To startup and stay in one directory and access everything else through absolute paths.
The NFS vs local file stuff through me for a loop. It turns out my block:
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
was running much slower when the filesystem was NFS versus local. It made the problem occur much sooner because it increased the chances that one thread was still in doSomething()
while another thread was running the current_dir = os.getcwd()
part of the code block. On local disk the threads moved through the entire code block so quickly they rarely intersected like that. But, give it enough time (about a week), and the problem would crop up when using local disk.
So a big lesson learned on thread safe operations in Python!
To answer the question literally, yes there are some gotchas with NFS. E.g.:
NFS is not cache coherent, so if several clients are accessing a file they might get stale data.
In particular, you cannot rely on O_APPEND to atomically append to files.
Depending on the NFS server, O_CREAT|O_EXCL might not work properly (it does work properly on modern Linux, at least).
Especially older NFS servers have deficient or completely non-working locking support. Even on more modern servers, lock recovery can be a problem after server and/or client reboot. NFSv4, a stateful protocol, ought to be more robust here than older protocol versions.
All this being said, it sounds like you problem isn't really related to any of the above. In my experience, the Condor daemons will at some point, depending on the configuration, clean up files left from jobs that have finished. My guess would be to look for the suspect here.
精彩评论