Delayed Write errors
For the past few months, we've been losing data to a Delayed Write errors. I've experienced the error with both cu开发者_如何学编程stom code and shrink-wrap applications. For example, the error message below came from Visual Studio 2008 on building a solution
Windows - Delayed Write Failed : Windows was unable to save all the data for the file \Vital\Source\Other\OCHSHP\Done07\LHFTInstaller\Release\LHFAI.CAB. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
When it occurs in Adobe, Visual Studio, or Word, for example, no harm is done. The major problem is when it occurs to our custom applications (straight C apps that writes data in dBase files to a network share.)
From the program's perspective, the write succeeds. It deletes the source data, and goes on to the next record. A few minutes later, Windows pops up an error message saying that a delayed write occurred and the data was lost.
My question is, what can we do to help our networking/server teams isolate and correct the problem (read, convince them the problem is real. Simply telling them many, many times hasn't convinced them as of yet) and do you have any suggestions of how we can write to avoid the data loss?
Writes on Windows, like any modern operating system, are not actually sent to the disk until the OS gets around to it. This is a big performance win, but the problem (as you have found) is that you cannot detect errors at the time of the write.
Every operating system that does asynchronous writes also provides mechanisms for forcing data to disk. On Windows, the FlushFileBuffers or _commit function will do the trick. (One is for HANDLE
s, the other for file descriptors.)
Note that you must check the return value of every disk write, and the return value of these synchronizing functions, in order to be certain the data made it to disk. Also note that these functions block and wait for the data to reach disk -- even if you are writing to a network server -- so they can be slow. Do not call them until you really need to push the data to stable storage.
For more, see fsync() Across Platforms.
You have a corrupted file system or a hard disk that is failing. The networking/server team should scan the disk to fix the former and detect the latter. Also check the error log to see if it tells you anything. If the error log indicates that failure to write to the hardware then you need to replace the disk.
精彩评论