Linux RAID Disk Replacement with Sans Digital 8-bay eSATA tower

raidTuxThere’s plenty of useful articles and guides on managing Linux software RAID floating around.  I recently had a disk failure in my 8-bay eSATA array and thought I’d add to the mix.  Here’s what I did, including some notes specific to the Sans Digital 8-bay eSATA tower and recovery on RHEL6.

 

Fileserver and Disk Setup
I’ve consolidated all my various home machines onto a single, low-power fanless Hypervisor with an external 8-bay eSATA tower for NFS and local VM storage.

array

Failure Scenario
I setup mdadm to send me emails if I receive any failure events, this array has been very good to me with no failures since I purchased it back in 2010, until now.  Linux mdadm does a nice job of summarizing what went wrong and the most likely culprit.   Note that /dev/sde is going to be our failed disk, and in mdadm parlance “_” denotes a failed disk while “U” denotes an active, healthy one.

This is an automatically generated mail message from mdadm
running on poopsock.example.com

A Fail event had been detected on md device /dev/md1.

It could be related to component device /dev/sde.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4]
md1 : active raid6 sdh[5] sdb[4] sde[9](F) sdg[10] sdd[1] sdf[8] 
      sdi[6] sdc[2]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/7] [_UUUUUUU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

Here’s my /etc/mdadm.conf for reference:

MAILADDR root,will@example.com
DEVICE partitions
ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476

Sans Digital Specifics
The Sans Digital 8-bay eSATA tower I have has a specific mapping as to what disk corresponds to what slot.  For my model it’s pretty straightforward:

  • Count starts from bottom-up
    • bottom slot = slot 1
    • top slot = slot 8
  • sata_sil24 kernel module automatically fails out disks
    • e.g. (no need for mdadm –manage /dev/md1 –fail /dev/sde)
  • Failed disks will not show red/amber LED activity

Note below, slot 8 is a solid green light while slots 1-7 are red/amber
array_failure_slot_zoom

Replace the Physical Disk
This tower has hot-swappable sleds, making replacement very easy.  Simply remove the failed disk and replace it with the same size disk or larger and pop it back in.  You should then see:

Feb 3 06:21:12 poopsock kernel: scsi 6:3:0:0: Direct-Access     
ATA      ST32000542AS     CC35 PQ: 0 ANSI: 5
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] 3907029168 
512-byte logical blocks: (2.00 TB/1.81 TiB)
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write Protect off
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: Attached scsi generic 
sg4 type 0
Feb 3 06:21:12 poopsock kernel: sde: unknown partition table
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Attached SCSI disk

Check mdadm status
One overlooked gotcha with mdadm is sometimes the sync_action flag might be busy, causing rebuilds or recoveries to fail.  Let’s quickly check that it is idle.  If it’s not you can echo “idle” into it to set it as such.

cat /sys/block/md1/md/sync_action
idle

Start Recovery
Now you can re-add your replacement disk to the array.  I don’t bother with pre-partitioning anything – it’s a waste of time and it’s a relic from SCSI LUN days and not needed for most modern filesystems.  I simply use the entire disk and let mdadm and XFS figure it out.  If you’re using disks > 3TB you may need GPT labels for some setups but I’ve not hit any issues yet.

mdadm --manage /dev/md1 --add /dev/sde
mdadm: re-added /dev/sde

Check Rebuild Status
You should be set now, but let’s check that it’s rebuilding as it should.  First, check that sync_action reports recovery:

cat /sys/block/md1/md/sync_action
recover

Great, now let’s check the actual progress via mdstat as well as with mdadm –detail:

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sde[9] sdg[10] sdf[8] sdi[6] sdb[4] sdc[2] 
      sdh[5] sdd[1]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/7] [_UUUUUUU]
      [====>...............] recovery = 20.3% (397693684/1953513408) 
      finish=1344.3min speed=19287K/sec
      bitmap: 13/15 pages [52KB], 65536KB chunk
mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Tue Mar  8 20:15:00 2011
     Raid Level : raid6
     Array Size : 11721080448 (11178.09 GiB 12002.39 GB)
  Used Dev Size : 1953513408 (1863.02 GiB 2000.40 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Feb  3 12:40:39 2016
          State : clean, degraded, recovering 
 Active Devices : 7
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 20% complete

           Name : localhost.localdomain:1
           UUID : ae9b6c81:db904476:a418a6df:d91356ae
         Events : 1370671

    Number   Major   Minor   RaidDevice State
       9       8       64        0      spare rebuilding   /dev/sde
       1       8       48        1      active sync   /dev/sdd
       2       8       32        2      active sync   /dev/sdc
       6       8      128        3      active sync   /dev/sdi
       4       8       16        4      active sync   /dev/sdb
       8       8       80        5      active sync   /dev/sdf
      10       8       96        6      active sync   /dev/sdg
       5       8      112        7      active sync   /dev/sdh

Adjusting Rebuild & Check Speeds
You can limit the amount of available IOPS that are thrown into the rebuild process via the following as with this array normal storage access can sometimes be problematic during rebuilds or scrubs:

echo "500" > /proc/sys/dev/raid/speed_limit_min
echo "50000" > /proc/sys/dev/raid/speed_limit_max

Conversely, you can increase this amount (100M/sec is default) to speed things up:

echo "200000" > /proc/sys/dev/raid/speed_limit_max

Lastly, you might want adjust the normal raid-check cronjob to only run once a month, mdadm devs do err on the side of caution but I’ve never seen an issue with checking for bad blocks and data once a month instead.

cat /etc/cron.d/raid-check 
# Run system wide raid-check once a week on Sunday at 1am by default
#0 1 * * Sun root /usr/sbin/raid-check <-- comment this out
# Once a month is good
* * 1 * * root /usr/sbin/raid-check

Business as Usual
After some amount of time your recovery should complete.  Note that large, slow SATA disks while great at archival and general-purpose storage take a really long time to rebuild.  This is why I urge everyone using SATA RAID to use RAID6 instead of RAID5 for the extra parity disk.  It’s not uncommon to experience an additional disk failure during long rebuild times.

If all looks good your mdstat output should be clear again:

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sde[9] sdg[10] sdf[8] sdi[6] sdb[4] sdc[2] sdh[5] 
      sdd[1]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/8] [UUUUUUUU]
      bitmap: 4/15 pages [16KB], 65536KB chunk

unused devices: <none>

More Drives Lost?  Failed eSATA Cable
As luck would have it, I had another failure of /dev/sde (my replacement drive was faulty) and during rebuild I had an eSATA cable go bad.  This caused the loss of the additional 4 drives on lane2 so the array went away.

The below message was printed for each drive on lane2 or the 4 x drives connected to the second eSATA cable:

Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] Add. Sense: Scsi parity error
Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] CDB: Write(10): 2a 00 00 00 00 08 00 00 02 00
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8
Feb 17 06:47:06 poopsock kernel: md: super_written gets error=-5, uptodate=0
Feb 17 06:47:06 poopsock kernel: md/raid:md1: Disk failure on sdh, disabling device.
Feb 17 06:47:06 poopsock kernel: md/raid:md1: Operation continuing on 4 devices.
Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 0
Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdg, sector 0
Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: ata8: EH complete
Feb 17 06:47:06 poopsock kernel: ata8.00: detaching (SCSI 7:0:0:0)

I know /dev/sde in this case is failed and was rebuilding but the other drives are fine.  mdadm is conservative and will make a FAULTY flag on the good drives if they disappear.

The Fix
First, I commented out the /dev/md1 array out of /etc/fstab so that when it boots it won’t try and mount.  Then I powered down the server and array, replaced the faulty eSATA cable and power things back up, array first so the drives are spun up.

Next, I force assembled the array to clear the FAULTY flagsfor just the known good disks.  /dev/sde has not been fully rebuilt yet and so needs to be omitted.

mdadm --assemble --force /dev/md1 /dev/sdd /dev/sdc /dev/sdi /dev/sdb /dev/sdf /dev/sdg /dev/sdh
mdadm: forcing event count in /dev/sdf(5) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdg(6) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdh(7) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdi(3) from 1733688 upto 1733732
mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/sdi
mdadm: clearing FAULTY flag for device 4 in /dev/md1 for /dev/sdf
mdadm: clearing FAULTY flag for device 5 in /dev/md1 for /dev/sdg
mdadm: clearing FAULTY flag for device 6 in /dev/md1 for /dev/sdh
mdadm: Marking array /dev/md1 as 'clean'
mdadm: /dev/md1 has been started with 7 drives (out of 8).

Rebuild Again
Now I can rebuild the replaced disk /dev/sde.

mdadm --manage /dev/md1 --add /dev/sde
mdadm: added /dev/sde

Futher Notes
Sometimes mdadm will kick a drive out of an array if it doesn’t respond after a predetermined amount of time (30sec by default for reads).  In a lot of cases mdadm is conservative and errs on the side of caution.  Of note, eSATA cables are notoriously unreliable and are known to become flaky or poop out entirely.

Unless you’ve got spares lying around and want to really play it safe, sometimes it’s worth trying to re-insert the failed disk back into the array and let it rebuild first before replacing, especially if you don’t hear any tell-tale failure sounds (clicking noises).

You can also do a simple dd test to check for CRC errors, if you get these your drive is most likely failed.

dd if=/dev/sde of=/root/testdisk bs=5G count=1
dd: reading `/dev/sde': Input/output error
0+0 records in
0+0 records out

Lastly you can use smartctl to investigate closer, paying attention to the UDMA_CRC_Error_Count value, though smartctl isn’t always the best at determining this either and can have a lot of false positives.

8-Bay Tower and Software RAID Caveats
If you have issues with disks being renamed, or not coming up or being thrown out of the array with the SANS Digital tower make sure you’re doing the following:

Use a write-intent bitmap, this helps store disk partition information and parity information for recovery.  You can set this up at any time via:

mdadm /dev/md1 --grow --bitmap=internal

Use DEVICE partitions in your /etc/mdadm.conf, this will omit the need to use UDEV or disk-by-id labeling for your disks within mdadm.  Here’s my /etc/mdadm.conf

DEVICE partitions
ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476-a418-a6dfd91356ae

About Will Foster

hobo devop/sysadmin, all-around nice guy.
This entry was posted in open source, sysadmin and tagged , , , , . Bookmark the permalink.

One Response to Linux RAID Disk Replacement with Sans Digital 8-bay eSATA tower

  1. What a fantastic writeup. This is going to come in handy in the future. Thanks Will!

    Like

Have a Squat, Leave a Reply ..

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s