Up until recently most of my data at home hadn’t been living in the safest environment. You see like many people I kept all my data on single hard drives, their only real protection being that most of them spent their lives unplugged, sitting next to my hard drive docking bay. Of course tragedy struck one day when my playful feline companion decided that the power cord for one of the portable hard drives looked like something to play with and promptly pulled it onto the floor. Luckily nothing of real importance was on there (apart from my music collection that had some of the oldest files I had ever managed to keep) but it did get me thinking about making my data a little more secure.
The easiest way to provide at least some level of protection was to get my data onto a RAID set so that at least a single disk failure wouldn’t take out my data again. I figured that if I put one large RAID in my media box and a second in my main PC (which I was planning to do anyway) then I could keep copies of the data on each of them, as RAID on its own is not a backup solution. A couple thousand dollars and a weekend later I was in possession of a new main PC and all the fixings of a new RAID set on my media PC ready to hold my data. Everything was looking pretty rosy for a while, but then the problems started.
Now the media PC that I had built was something of a beast, sporting enough RAM and a good enough graphics card to be able to play most recent games at high settings. Soon after I had completed building it I was going to a LAN with a bunch of mates of mine, one of which who was travelling from Melbourne and wasn’t able to bring his PC with him. Too easy I thought, he can just use this new awesome beast of a box to play games with us and everything shall be good. In all honesty it was until I saw him reboot it once and the RAID controller flashed up a warning about the RAID being critical, which sent chills down my spine.
Looking at the RAID UI in Windows I found that yes indeed one of the disks had dropped out of the RAID set, but there didn’t seem to be anything wrong with it. Confused I started the rebuild on the RAID set and it managed to complete successfully after a few hours, leaving me to think that I might have bumped a cable or something to trigger the “failure”. When I got it home however the problem kept recurring, but it was random and never seemed to follow a distinct pattern, except for it being the same disk every time. Eventually however it stabilized and so I figured that it was just a transient problem and left it at that.
Unfortunately for me it happened again last night, but it wasn’t the same disk this time. Figuring it was a bung RAID controller I was preparing to siphon my data off it in order to rebuild it as a software RAID when my wife asked me if I had actually tried Googling around to see if others had had the same issue. I had done so in the past but I hadn’t been very thorough with it so I decided that it was probably worth the effort, especially if it could save me another 4 hours of babying the copy process. What I found has made me deeply frustrated, not just with certain companies but also myself for not researching this properly.
The drives I bought all those months ago where Seagate ST2000DL003 2TB Green drives which are cheap, low power drives that seemed perfect for a large amount of RAID storage. However there’s a slight problem with these kinds of drives when they’re put into a RAID set. You see the hard drives have error correction built into them but thanks to their “green” rating this process can be quite slow, on the order of 10 seconds to minutes if the drive is under heavy load. RAID controllers are programmed to mark disks as failed if they stop responding after a certain period of time, usually a couple seconds or so. That means should a drive start correcting itself and not respond quick enough to the RAID controller it will mark the disk as failed and remove it, putting the array into a critical state.
Seeing the possibility for this to cause issues for everyone hard drive manufacturers have developed a protocol called Time-Limited Error Recovery (or Error Recovery Correction for Seagate). TLER limits the amount of time the hard drive will spend attempting to recover from an error, so if it can’t be dealt with within that time frame it’ll then hand it off to the RAID controller, leaving the disk in the RAID and allowing it to recover. For the drives I had bought this setting is set to off as default and a quick Google has shown that any attempts to change it are futile. Most other brands are able to change this particular value but for these particular Seagate drives they are unfortunately locked in this state.
So where does this leave me? Well apart from hoping that Seagate releases a firmware update that allows me to change that particular value I’m up the proverbial creek without a paddle. Replacing these drives with similar drives from another manufacturer will set me back another $400 and a weekend’s worth of work so it’s not something I’m going to do immediately. I’m going to pester Seagate and hope that they’ll release a fix for this because other than that one issue they’ve been fantastic drives and I’d hate to have to get rid of them because of it. Hopefully they’re responsive about it but judging by what people are saying on the Seagate forums I shouldn’t hold my breath, but it’s all I’ve got right now.