Possible to detect bit errors in memory in software?
A friend and I were curious as to whether you could detect levels of ionizing radiation by looking at rates of single bit errors in memory. I did a little research and I guess most errors are caught and fixe开发者_如何学运维d at the hardware level. Would there be any way to detect errors in software (say, in c code on a pc)?
I'm sure it depends on the architecture you're running on, but I'm pretty certain you won't be detecting any single bit errors in your memory any time soon. Most if not all RAM controllers should have implemented some form of ECC protection to safeguard against the rare bit problems RAM chips have. DDR RAM, for example, is VERY reliable compared to crap mediums like flash memory, which will be spec'd to REQUIRE X number of bits of ECC protection (somewhere between 8 and 16 or so) before they guarantee functionality. As long as you have under a certain number of bit errors, the bad bits will be corrected and probably unreported before even reaching the CPU software level.
Silent (Unreported) data corruption from something as simple as a single bit error is considered a huge "no-no" in the storage industry, so your memory manufacturer has probably done their darndest to prevent your application from seeing it, much less making you deal with it!
In any case, one common way to detect problems in any sort of memory is to run simple write compare loops over the address space. Write 0's to all your memory and read it back to detect stuck '1' data lines, write-read-compare F's to memory to detect stuck '0' data lines, and run a data ramp to help detect addressing problems. The width of the data ramp should adjust according to the address size. (i.e. 0x00, 0x01, 0x02... or 0x0000, 0x0001, 0x0002, etc). You can easily do these types of things using storage performance benchmarking tools like Iometer or similar, although it may be just as easy to write yourself.
Realistically, unless you're going to dedicate a lot of time to the problem, you might as well quit before you start. Even if you do detect an error, chances are pretty fair it's due to something like a power problem, not ionizing radiation (and you normally won't have any way to tell which you've encountered).
If you do decide to go ahead anyway, the obvious way to test is allocate some memory, write values to it, and read them back. You want to follow sufficiently predictable patterns that you can figure out the expected value is without reading from other memory (at least if you want to be able to isolate the error, and not just identify that something bad has happened).
If you really want to differentiate between ionizing radiation and other errors, it should at least be theoretically possible. Run your test on a number of computers at different altitudes simultaneously, and see if you see a higher rate at higher altitude.
If the errors are frequent enough that you have any chance of detecting them, you'd be in big trouble already - nothing would work. Or at least you'd feel like you were using Win95 all over again. I suspect you'd need a whole datacenter to have a chance of measuring this kind of error.
精彩评论