> and then when errors are detected you are supposed to restore from a replica/b...

toast0 · on Sept 26, 2018

> finally, ECC is a bit useless if your whole drive or device fails, which is a much more common failure mode.

I'm not sure that this is true. Almost all the platter based drives I've had fail gave some amount of warning signs before they became completely inaccessible. (The exception that I can remember was the one I damaged the circuit board of with a wrong type screw). My personal experience is just anecdotes, but we had a couple thousand disks at work that I managed, and we ended up swapping about one disk a week. SMART defect counts were very predictive of future disk failures, although even then the disks would usually be partially readable.

SSDs on the other hand, the failure rate was much lower, but the failure mode was always disk disappears from the bus. We could never figure out a way to predict that (our write volumes aren't very high). Occasionally we'd see a large increase in defect count and slowdown in access for a while as a drive would reallocate a large block, but if we waited for it to settle, everything would be fine after.

FooBarWidget · on Sept 26, 2018

> how do you distribute the ECC sensibly? you can't.

I am not an expert on ECC. Are you saying that you can't store the ECC just anywhere? Just storing it alongside the data is not good enough? Why does ECC need special treatment?

> storage is cheap nowadays, and even double or triple redundancy is cheaper and more straight-forward than trying to be clever.

Tell that to Apple to who charges $500+ for a 1 TB SSD upgrade in a Macbook Pro. :-( "Cheap" is relative. I am worried about bitrot on my laptop but I also don't want to half my disk space in order protect against that.

pwg · on Sept 26, 2018

> > how do you distribute the ECC sensibly? you can't.

> I am not an expert on ECC. Are you saying that you can't store the ECC just anywhere?

Well, you can put it "just anywhere", but /where/ you put it determines /what failure types/ you can recover from.

> Just storing it alongside the data is not good enough? Why does ECC need special treatment?

If you want to recover from bitrot, then putting the ECC data for a sector alongside the data in the same sector is sufficient (you'll have less bytes stored per sector, but if a bit flips, you can recover the original data).

But, storing the ECC in the same sector with the data it protects will not protect against losing the entire sector (drive can't read whole sector error). In this instance both the data and the ECC is lost simultaneously, so the ECC can not help here if it is also lost at the same time. So if you want to protect against loss of a sector you need your ECC stored somewhere else (i.e., on a different sector that is unlikely to be correlated to the lost one in a failure situation) so that you still have the ECC available when the sector you are protecting goes away.

But, if you are protecting against loss of an entire physical drive, then the ECC for the drive needs to be on another physical drive (same reasons apply as for a "sector", just at the level of a whole physical disk).

It is all tradeoffs. You /can/ put it anywhere, but where you choose to store it determines which failure types you can recover from.

blattimwind · on Sept 26, 2018

Disk drives use sector-level ECC, silent sector corruption should be (and IME is) more rare than unrecoverable sectors.

TheAceOfHearts · on Sept 26, 2018

Buy an external HDD and sync important files when you're at home. It's unlikely you'd suffer from bitrot in files created while you're away from home.

blattimwind · on Sept 26, 2018

> many storage technologies do protect against errors, CD/DVD/HDD all have ECC in the physical layer. but without control/knowledge of the physical medium (so at filesystem level), how do you distribute the ECC sensibly? you can't.

SSDs and other flash storage heavily rely on ECC as well.