There’s plenty of useful articles and guides on managing Linux software RAID floating around. I recently had a disk failure in my 8-bay eSATA array and thought I’d add to the mix. Here’s what I did, including some notes specific to the Sans Digital 8-bay eSATA tower and recovery on RHEL6.
Fileserver and Disk Setup
I’ve consolidated all my various home machines onto a single, low-power fanless Hypervisor with an external 8-bay eSATA tower for NFS and local VM storage.
- Fanless Core i7 Hypervisor (i7-3630QM CPU @ 2.40GHz, 16G RAM)
- Sans Digital 8-bay eSATA tower
- Addonics ADSA3GPX1-2E eSATA PCI-E (in JBOD, using native sata_sil24 kernel module)
- 8 x 2TB Samsung Spinpoint F4 (mdadm RAID6 via XFS – 11TB usable)
- RHEL 6.7 x86_64
I setup mdadm to send me emails if I receive any failure events, this array has been very good to me with no failures since I purchased it back in 2010, until now. Linux mdadm does a nice job of summarizing what went wrong and the most likely culprit. Note that /dev/sde is going to be our failed disk, and in mdadm parlance “_” denotes a failed disk while “U” denotes an active, healthy one.
This is an automatically generated mail message from mdadm running on poopsock.example.com A Fail event had been detected on md device /dev/md1. It could be related to component device /dev/sde. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid6] [raid5] [raid4] md1 : active raid6 sdh sdb sde(F) sdg sdd sdf sdi sdc 11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/7] [_UUUUUUU] bitmap: 5/15 pages [20KB], 65536KB chunk
Here’s my /etc/mdadm.conf for reference:
MAILADDR root,email@example.com DEVICE partitions ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476
Sans Digital Specifics
The Sans Digital 8-bay eSATA tower I have has a specific mapping as to what disk corresponds to what slot. For my model it’s pretty straightforward:
- Count starts from bottom-up
- bottom slot = slot 1
- top slot = slot 8
- sata_sil24 kernel module automatically fails out disks
- e.g. (no need for mdadm –manage /dev/md1 –fail /dev/sde)
- Failed disks will not show red/amber LED activity
Replace the Physical Disk
This tower has hot-swappable sleds, making replacement very easy. Simply remove the failed disk and replace it with the same size disk or larger and pop it back in. You should then see:
Feb 3 06:21:12 poopsock kernel: scsi 6:3:0:0: Direct-Access ATA ST32000542AS CC35 PQ: 0 ANSI: 5 Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write Protect off Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: Attached scsi generic sg4 type 0 Feb 3 06:21:12 poopsock kernel: sde: unknown partition table Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Attached SCSI disk
Check mdadm status
One overlooked gotcha with mdadm is sometimes the sync_action flag might be busy, causing rebuilds or recoveries to fail. Let’s quickly check that it is idle. If it’s not you can echo “idle” into it to set it as such.
cat /sys/block/md1/md/sync_action idle
Now you can re-add your replacement disk to the array. I don’t bother with pre-partitioning anything – it’s a waste of time and it’s a relic from SCSI LUN days and not needed for most modern filesystems. I simply use the entire disk and let mdadm and XFS figure it out. If you’re using disks > 3TB you may need GPT labels for some setups but I’ve not hit any issues yet.
mdadm --manage /dev/md1 --add /dev/sde mdadm: re-added /dev/sde
Check Rebuild Status
You should be set now, but let’s check that it’s rebuilding as it should. First, check that sync_action reports recovery:
cat /sys/block/md1/md/sync_action recover
Great, now let’s check the actual progress via mdstat as well as with mdadm –detail:
Personalities : [raid6] [raid5] [raid4] md1 : active raid6 sde sdg sdf sdi sdb sdc sdh sdd 11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/7] [_UUUUUUU] [====>...............] recovery = 20.3% (397693684/1953513408) finish=1344.3min speed=19287K/sec bitmap: 13/15 pages [52KB], 65536KB chunk
mdadm --detail /dev/md1
/dev/md1: Version : 1.2 Creation Time : Tue Mar 8 20:15:00 2011 Raid Level : raid6 Array Size : 11721080448 (11178.09 GiB 12002.39 GB) Used Dev Size : 1953513408 (1863.02 GiB 2000.40 GB) Raid Devices : 8 Total Devices : 8 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Feb 3 12:40:39 2016 State : clean, degraded, recovering Active Devices : 7 Working Devices : 8 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 20% complete Name : localhost.localdomain:1 UUID : ae9b6c81:db904476:a418a6df:d91356ae Events : 1370671 Number Major Minor RaidDevice State 9 8 64 0 spare rebuilding /dev/sde 1 8 48 1 active sync /dev/sdd 2 8 32 2 active sync /dev/sdc 6 8 128 3 active sync /dev/sdi 4 8 16 4 active sync /dev/sdb 8 8 80 5 active sync /dev/sdf 10 8 96 6 active sync /dev/sdg 5 8 112 7 active sync /dev/sdh
Adjusting Rebuild & Check Speeds
You can limit the amount of available IOPS that are thrown into the rebuild process via the following as with this array normal storage access can sometimes be problematic during rebuilds or scrubs:
echo "500" > /proc/sys/dev/raid/speed_limit_min
echo "50000" > /proc/sys/dev/raid/speed_limit_max
Conversely, you can increase this amount (100M/sec is default) to speed things up:
echo "200000" > /proc/sys/dev/raid/speed_limit_max
Lastly, you might want adjust the normal raid-check cronjob to only run once a month, mdadm devs do err on the side of caution but I’ve never seen an issue with checking for bad blocks and data once a month instead.
cat /etc/cron.d/raid-check # Run system wide raid-check once a week on Sunday at 1am by default #0 1 * * Sun root /usr/sbin/raid-check <-- comment this out # Once a month is good * * 1 * * root /usr/sbin/raid-check
Business as Usual
After some amount of time your recovery should complete. Note that large, slow SATA disks while great at archival and general-purpose storage take a really long time to rebuild. This is why I urge everyone using SATA RAID to use RAID6 instead of RAID5 for the extra parity disk. It’s not uncommon to experience an additional disk failure during long rebuild times.
If all looks good your mdstat output should be clear again:
Personalities : [raid6] [raid5] [raid4] md1 : active raid6 sde sdg sdf sdi sdb sdc sdh sdd 11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 4/15 pages [16KB], 65536KB chunk unused devices: <none>
More Drives Lost? Failed eSATA Cable
As luck would have it, I had another failure of /dev/sde (my replacement drive was faulty) and during rebuild I had an eSATA cable go bad. This caused the loss of the additional 4 drives on lane2 so the array went away.
The below message was printed for each drive on lane2 or the 4 x drives connected to the second eSATA cable:
Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] Add. Sense: Scsi parity error Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] CDB: Write(10): 2a 00 00 00 00 08 00 00 02 00 Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8 Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8 Feb 17 06:47:06 poopsock kernel: md: super_written gets error=-5, uptodate=0 Feb 17 06:47:06 poopsock kernel: md/raid:md1: Disk failure on sdh, disabling device. Feb 17 06:47:06 poopsock kernel: md/raid:md1: Operation continuing on 4 devices. Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: rejecting I/O to offline device Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 0 Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdg, sector 0 Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device Feb 17 06:47:06 poopsock kernel: ata8: EH complete Feb 17 06:47:06 poopsock kernel: ata8.00: detaching (SCSI 7:0:0:0)
I know /dev/sde in this case is failed and was rebuilding but the other drives are fine. mdadm is conservative and will make a FAULTY flag on the good drives if they disappear.
First, I commented out the /dev/md1 array out of /etc/fstab so that when it boots it won’t try and mount. Then I powered down the server and array, replaced the faulty eSATA cable and power things back up, array first so the drives are spun up.
Next, I force assembled the array to clear the FAULTY flags – for just the known good disks. /dev/sde has not been fully rebuilt yet and so needs to be omitted.
mdadm --assemble --force /dev/md1 /dev/sdd /dev/sdc /dev/sdi /dev/sdb /dev/sdf /dev/sdg /dev/sdh
mdadm: forcing event count in /dev/sdf(5) from 1733689 upto 1733732 mdadm: forcing event count in /dev/sdg(6) from 1733689 upto 1733732 mdadm: forcing event count in /dev/sdh(7) from 1733689 upto 1733732 mdadm: forcing event count in /dev/sdi(3) from 1733688 upto 1733732 mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/sdi mdadm: clearing FAULTY flag for device 4 in /dev/md1 for /dev/sdf mdadm: clearing FAULTY flag for device 5 in /dev/md1 for /dev/sdg mdadm: clearing FAULTY flag for device 6 in /dev/md1 for /dev/sdh mdadm: Marking array /dev/md1 as 'clean'
mdadm: /dev/md1 has been started with 7 drives (out of 8).
Now I can rebuild the replaced disk /dev/sde.
mdadm --manage /dev/md1 --add /dev/sde
mdadm: added /dev/sde
Sometimes mdadm will kick a drive out of an array if it doesn’t respond after a predetermined amount of time (30sec by default for reads). In a lot of cases mdadm is conservative and errs on the side of caution. Of note, eSATA cables are notoriously unreliable and are known to become flaky or poop out entirely.
Unless you’ve got spares lying around and want to really play it safe, sometimes it’s worth trying to re-insert the failed disk back into the array and let it rebuild first before replacing, especially if you don’t hear any tell-tale failure sounds (clicking noises).
You can also do a simple dd test to check for CRC errors, if you get these your drive is most likely failed.
dd if=/dev/sde of=/root/testdisk bs=5G count=1
dd: reading `/dev/sde': Input/output error 0+0 records in 0+0 records out
Lastly you can use smartctl to investigate closer, paying attention to the UDMA_CRC_Error_Count value, though smartctl isn’t always the best at determining this either and can have a lot of false positives.
8-Bay Tower and Software RAID Caveats
If you have issues with disks being renamed, or not coming up or being thrown out of the array with the SANS Digital tower make sure you’re doing the following:
Use a write-intent bitmap, this helps store disk partition information and parity information for recovery. You can set this up at any time via:
mdadm /dev/md1 --grow --bitmap=internal
Use DEVICE partitions in your /etc/mdadm.conf, this will omit the need to use UDEV or disk-by-id labeling for your disks within mdadm. Here’s my /etc/mdadm.conf
DEVICE partitions ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476-a418-a6dfd91356ae
Drive Timeouts, Rebuilds and SCTRC / TLER
My older eSATA controller and drive bay can be a bit ornery when it comes to timeouts, this often results in drives being kicked out of the array during rebuild.
A blog post I found recently did a great job of providing a a workaround here by either enabling the SCT Error Recovery Control setting in your drives (if it’s supported) or worse case setting a higher timeout within Linux itself.
First, determine if your drive(s) supports SCTRC / TLER functionality:
for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done
If you see this then they support it:
SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds)
You can then apply this setting via smartctl again on the drives that do:
smartctl -q errorsonly -l scterc,70,70 /dev/sdb
If they all do, cool. If you’re like me and they are cheap, older consumer drives and they do not you’d see something like:
SCT Error Recovery Control: Read: Disabled Write: Disabled
You can instead set a liberal per-drive timeout as a workaround, this got me through my rather old, ornery e-SATA controller kicking out a drive repeatedly during replacement and subsequent rebuild.
for drive in sdh sdf sdi; do echo "180" > /sys/block/$drive/device/timeout; done
NOTE: This needs to be added to your boot startup to run once everytime, you can use the rc-local systemd service in Centos7 or beyond or /etc/rc.local in EL6 and below for this. It’s important to also note that mdadm can be conservative about failure conditions based on timeouts, but in most modern controllers even with consumer-level drives this shouldn’t always be needed.