Linux RAID Disk Replacement with Sans Digital 8-bay eSATA tower

raidTuxThere’s plenty of useful articles and guides on managing Linux software RAID floating around.  I recently had a disk failure in my 8-bay eSATA array and thought I’d add to the mix.  Here’s what I did, including some notes specific to the Sans Digital 8-bay eSATA tower and recovery on RHEL6.

 

Fileserver and Disk Setup
I’ve consolidated all my various home machines onto a single, low-power fanless Hypervisor with an external 8-bay eSATA tower for NFS and local VM storage.

array

Failure Scenario
I setup mdadm to send me emails if I receive any failure events, this array has been very good to me with no failures since I purchased it back in 2010, until now.  Linux mdadm does a nice job of summarizing what went wrong and the most likely culprit.   Note that /dev/sde is going to be our failed disk, and in mdadm parlance “_” denotes a failed disk while “U” denotes an active, healthy one.

This is an automatically generated mail message from mdadm
running on poopsock.example.com

A Fail event had been detected on md device /dev/md1.

It could be related to component device /dev/sde.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4]
md1 : active raid6 sdh[5] sdb[4] sde[9](F) sdg[10] sdd[1] sdf[8] 
      sdi[6] sdc[2]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/7] [_UUUUUUU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

Here’s my /etc/mdadm.conf for reference:

MAILADDR root,will@example.com
DEVICE partitions
ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476

Sans Digital Specifics
The Sans Digital 8-bay eSATA tower I have has a specific mapping as to what disk corresponds to what slot.  For my model it’s pretty straightforward:

  • Count starts from bottom-up
    • bottom slot = slot 1
    • top slot = slot 8
  • sata_sil24 kernel module automatically fails out disks
    • e.g. (no need for mdadm –manage /dev/md1 –fail /dev/sde)
  • Failed disks will not show red/amber LED activity

Note below, slot 8 is a solid green light while slots 1-7 are red/amber
array_failure_slot_zoom

Replace the Physical Disk
This tower has hot-swappable sleds, making replacement very easy.  Simply remove the failed disk and replace it with the same size disk or larger and pop it back in.  You should then see:

Feb 3 06:21:12 poopsock kernel: scsi 6:3:0:0: Direct-Access     
ATA      ST32000542AS     CC35 PQ: 0 ANSI: 5
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] 3907029168 
512-byte logical blocks: (2.00 TB/1.81 TiB)
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write Protect off
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: Attached scsi generic 
sg4 type 0
Feb 3 06:21:12 poopsock kernel: sde: unknown partition table
Feb 3 06:21:12 poopsock kernel: sd 6:3:0:0: [sde] Attached SCSI disk

Check mdadm status
One overlooked gotcha with mdadm is sometimes the sync_action flag might be busy, causing rebuilds or recoveries to fail.  Let’s quickly check that it is idle.  If it’s not you can echo “idle” into it to set it as such.

cat /sys/block/md1/md/sync_action
idle

Start Recovery
Now you can re-add your replacement disk to the array.  I don’t bother with pre-partitioning anything – it’s a waste of time and it’s a relic from SCSI LUN days and not needed for most modern filesystems.  I simply use the entire disk and let mdadm and XFS figure it out.  If you’re using disks > 3TB you may need GPT labels for some setups but I’ve not hit any issues yet.

mdadm --manage /dev/md1 --add /dev/sde
mdadm: re-added /dev/sde

Check Rebuild Status
You should be set now, but let’s check that it’s rebuilding as it should.  First, check that sync_action reports recovery:

cat /sys/block/md1/md/sync_action
recover

Great, now let’s check the actual progress via mdstat as well as with mdadm –detail:

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sde[9] sdg[10] sdf[8] sdi[6] sdb[4] sdc[2] 
      sdh[5] sdd[1]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/7] [_UUUUUUU]
      [====>...............] recovery = 20.3% (397693684/1953513408) 
      finish=1344.3min speed=19287K/sec
      bitmap: 13/15 pages [52KB], 65536KB chunk
mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Tue Mar  8 20:15:00 2011
     Raid Level : raid6
     Array Size : 11721080448 (11178.09 GiB 12002.39 GB)
  Used Dev Size : 1953513408 (1863.02 GiB 2000.40 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Feb  3 12:40:39 2016
          State : clean, degraded, recovering 
 Active Devices : 7
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 20% complete

           Name : localhost.localdomain:1
           UUID : ae9b6c81:db904476:a418a6df:d91356ae
         Events : 1370671

    Number   Major   Minor   RaidDevice State
       9       8       64        0      spare rebuilding   /dev/sde
       1       8       48        1      active sync   /dev/sdd
       2       8       32        2      active sync   /dev/sdc
       6       8      128        3      active sync   /dev/sdi
       4       8       16        4      active sync   /dev/sdb
       8       8       80        5      active sync   /dev/sdf
      10       8       96        6      active sync   /dev/sdg
       5       8      112        7      active sync   /dev/sdh

Adjusting Rebuild & Check Speeds
You can limit the amount of available IOPS that are thrown into the rebuild process via the following as with this array normal storage access can sometimes be problematic during rebuilds or scrubs:

echo "500" > /proc/sys/dev/raid/speed_limit_min
echo "50000" > /proc/sys/dev/raid/speed_limit_max

Conversely, you can increase this amount (100M/sec is default) to speed things up:

echo "200000" > /proc/sys/dev/raid/speed_limit_max

Lastly, you might want adjust the normal raid-check cronjob to only run once a month, mdadm devs do err on the side of caution but I’ve never seen an issue with checking for bad blocks and data once a month instead.

cat /etc/cron.d/raid-check 
# Run system wide raid-check once a week on Sunday at 1am by default
#0 1 * * Sun root /usr/sbin/raid-check <-- comment this out
# Once a month is good
* * 1 * * root /usr/sbin/raid-check

Business as Usual
After some amount of time your recovery should complete.  Note that large, slow SATA disks while great at archival and general-purpose storage take a really long time to rebuild.  This is why I urge everyone using SATA RAID to use RAID6 instead of RAID5 for the extra parity disk.  It’s not uncommon to experience an additional disk failure during long rebuild times.

If all looks good your mdstat output should be clear again:

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sde[9] sdg[10] sdf[8] sdi[6] sdb[4] sdc[2] sdh[5] 
      sdd[1]
      11721080448 blocks super 1.2 level 6, 64k chunk, algorithm 2 
      [8/8] [UUUUUUUU]
      bitmap: 4/15 pages [16KB], 65536KB chunk

unused devices: <none>

More Drives Lost?  Failed eSATA Cable
As luck would have it, I had another failure of /dev/sde (my replacement drive was faulty) and during rebuild I had an eSATA cable go bad.  This caused the loss of the additional 4 drives on lane2 so the array went away.

The below message was printed for each drive on lane2 or the 4 x drives connected to the second eSATA cable:

Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] Add. Sense: Scsi parity error
Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: [sdh] CDB: Write(10): 2a 00 00 00 00 08 00 00 02 00
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 8
Feb 17 06:47:06 poopsock kernel: md: super_written gets error=-5, uptodate=0
Feb 17 06:47:06 poopsock kernel: md/raid:md1: Disk failure on sdh, disabling device.
Feb 17 06:47:06 poopsock kernel: md/raid:md1: Operation continuing on 4 devices.
Feb 17 06:47:06 poopsock kernel: sd 7:2:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdh, sector 0
Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: end_request: I/O error, dev sdg, sector 0
Feb 17 06:47:06 poopsock kernel: sd 7:1:0:0: rejecting I/O to offline device
Feb 17 06:47:06 poopsock kernel: ata8: EH complete
Feb 17 06:47:06 poopsock kernel: ata8.00: detaching (SCSI 7:0:0:0)

I know /dev/sde in this case is failed and was rebuilding but the other drives are fine.  mdadm is conservative and will make a FAULTY flag on the good drives if they disappear.

The Fix
First, I commented out the /dev/md1 array out of /etc/fstab so that when it boots it won’t try and mount.  Then I powered down the server and array, replaced the faulty eSATA cable and power things back up, array first so the drives are spun up.

Next, I force assembled the array to clear the FAULTY flagsfor just the known good disks.  /dev/sde has not been fully rebuilt yet and so needs to be omitted.

mdadm --assemble --force /dev/md1 /dev/sdd /dev/sdc /dev/sdi /dev/sdb /dev/sdf /dev/sdg /dev/sdh
mdadm: forcing event count in /dev/sdf(5) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdg(6) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdh(7) from 1733689 upto 1733732
mdadm: forcing event count in /dev/sdi(3) from 1733688 upto 1733732
mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/sdi
mdadm: clearing FAULTY flag for device 4 in /dev/md1 for /dev/sdf
mdadm: clearing FAULTY flag for device 5 in /dev/md1 for /dev/sdg
mdadm: clearing FAULTY flag for device 6 in /dev/md1 for /dev/sdh
mdadm: Marking array /dev/md1 as 'clean'
mdadm: /dev/md1 has been started with 7 drives (out of 8).

Rebuild Again
Now I can rebuild the replaced disk /dev/sde.

mdadm --manage /dev/md1 --add /dev/sde
mdadm: added /dev/sde

Note that if you have data drives that get inadvertently added as spares you can also use the assemble command above to make them appear as data drives / raid members again.

Sans Digital and Port Lane Resets
Specific to this type of port-multiplier eSATA style unit, occasionally a failed or failing disk will make all the disks in that particular 4-port lane reset entirely.  This may disguise itself like a bad cable or even a faulty array.

kernel: ata8.00: failed to read SCR 1 (Emask=0x40)
kernel: ata8.00: failed to read SCR 0 (Emask=0x40)
kernel: ata8.01: failed to read SCR 1 (Emask=0x40)
kernel: ata8.01: failed to read SCR 0 (Emask=0x40)
kernel: ata8.02: failed to read SCR 1 (Emask=0x40)
kernel: ata8.02: failed to read SCR 0 (Emask=0x40)
kernel: ata8.03: failed to read SCR 1 (Emask=0x40)
kernel: ata8.03: failed to read SCR 0 (Emask=0x40)
kernel: ata8.04: failed to read SCR 1 (Emask=0x40)
kernel: ata8.05: failed to read SCR 1 (Emask=0x40)
kernel: ata8.15: exception Emask 0x100 SAct 0x0 SErr 0x200000 action 0x6 frozen
kernel: ata8.15: irq_stat 0x02060002, PMP DMA CS errata
kernel: ata8.15: SError: { BadCRC }
kernel: ata8.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: ata8.00: failed command: FLUSH CACHE EXT
kernel: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 9
kernel: res 82/31:04:04:00:00/00:00:00:90:82/00 Emask 0x2 (HSM violation)

Typically with this unit you can determine which disk is failing or troublesome by sight, it’ll have a lingering amber LED just a little longer than the others do that would normally denote read/write activity.

Determining from log messages a particular drive may be difficult because of the port-multiplier setup where they either show up as one identifier or the other (2 x 4 ports).

ls -l /sys/block/sd* | sed -e 's@.*-> \.\..*/ata@/ata@' -e 's@/host@ @' -e 's@/target.*/@ @'

All 6’s or 7’s, not that useful from the logs above.

/sys/block/sda -> ../devices/pci0000:00/0000:00:1f.2 0 sda
/sys/block/sdb -> ../devices/pci0000:00/0000:00:1f.2 1 sdb
/sys/block/sdc -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 6 sdc
/sys/block/sdd -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 6 sdd
/sys/block/sde -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 6 sde
/sys/block/sdf -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 6 sdf
/sys/block/sdg -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 7 sdg
/sys/block/sdh -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 7 sdh
/sys/block/sdi -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 7 sdi
/sys/block/sdj -> ../devices/pci0000:00/0000:00:1c.6/0000:03:00.0 7 sdj

You can also try reading from each drive individually to see if you can trigger a CRC error as well via dd or determining impending failure via smartctl below also.

The main point here specific to the Sans Digital style 8-bay port-multiplier style arrays is that often what may appear to be a flaky cables, card or array when it comes to port resets is really just a failing disk that hasn’t yet been kicked out but is troublesome enough to trigger port resets for the rest of the member disks in that line of ports.

kernel: ata8: controller in dubious state, performing PORT_RST
kernel: ata8: softreset failed (SRST command error)
kernel: ata8: reset failed (errno=-5), retrying in 8 secs
kernel: ata8: controller in dubious state, performing PORT_RST
kernel: ata8: softreset failed (SRST command error)
kernel: ata8: reset failed (errno=-5), retrying in 33 secs
kernel: ata8: limiting SATA link speed to 1.5 Gbps
kernel: ata8: controller in dubious state, performing PORT_RST

Futher Notes
Sometimes mdadm will kick a drive out of an array if it doesn’t respond after a predetermined amount of time (30sec by default for reads).  In a lot of cases mdadm is conservative and errs on the side of caution.  Of note, eSATA cables are notoriously unreliable and are known to become flaky or poop out entirely.

Unless you’ve got spares lying around and want to really play it safe, sometimes it’s worth trying to re-insert the failed disk back into the array and let it rebuild first before replacing, especially if you don’t hear any tell-tale failure sounds (clicking noises).

You can also do a simple dd test to check for CRC errors, if you get these your drive is most likely failed.

dd if=/dev/sde of=/root/testdisk bs=5G count=1
dd: reading `/dev/sde': Input/output error
0+0 records in
0+0 records out

Lastly you can use smartctl to investigate closer, paying attention to the UDMA_CRC_Error_Count value, though smartctl isn’t always the best at determining this either and can have a lot of false positives.

8-Bay Tower and Software RAID Caveats
If you have issues with disks being renamed, or not coming up or being thrown out of the array with the SANS Digital tower make sure you’re doing the following:

Use a write-intent bitmap, this helps store disk partition information and parity information for recovery.  You can set this up at any time via:

mdadm /dev/md1 --grow --bitmap=internal

Use DEVICE partitions in your /etc/mdadm.conf, this will omit the need to use UDEV or disk-by-id labeling for your disks within mdadm.  Here’s my /etc/mdadm.conf

DEVICE partitions
ARRAY /dev/md1 level=raid6 num-devices=8 UUID=ae9b6c81-db90-4476-a418-a6dfd91356ae

Drive Timeouts, Rebuilds and SCTRC / TLER
My older eSATA controller and drive bay can be a bit ornery when it comes to timeouts, this often results in drives being kicked out of the array during rebuild.

A blog post I found recently did a great job of providing a a workaround here by either enabling the SCT Error Recovery Control setting in your drives (if it’s supported) or worse case setting a higher timeout within Linux itself.

First, determine if your drive(s) supports SCTRC / TLER functionality:

for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done

If you see this then they support it:

SCT Error Recovery Control:
 Read: 70 (7.0 seconds)
 Write: 70 (7.0 seconds)

You can then apply this setting via smartctl again on the drives that do:

smartctl -q errorsonly -l scterc,70,70 /dev/sdb

If they all do, cool.  If you’re like me and they are cheap, older consumer drives and they do not you’d see something like:

SCT Error Recovery Control:
 Read: Disabled
 Write: Disabled

You can instead set a liberal per-drive timeout as a workaround, this got me through my rather old, ornery e-SATA controller kicking out a drive repeatedly during replacement and subsequent rebuild.

for drive in sdh sdf sdi; do echo "180" > /sys/block/$drive/device/timeout; done

NOTE: This needs to be added to your boot startup to run once everytime, you can use the rc-local systemd service in Centos7 or beyond or /etc/rc.local in EL6 and below for this.  It’s important to also note that mdadm can be conservative about failure conditions based on timeouts, but in most modern controllers even with consumer-level drives this shouldn’t always be needed.

Lowering SATA Speed
Sometimes it may help to force a lower SATA speed for your drives, this might help you rule out issues dealing with saturation, NCQ or in the case where you have a mix of various rotational speed and/or SATA generations in the same array.

Adding this to your Linux kernel boot line for example will force SATA speed to 1.5 Gbps.

libata.force=1.5G

In general, Linux will know what’s best to operate under.

About Will Foster

hobo devop/sysadmin/SRE
This entry was posted in open source, sysadmin and tagged , , , , , , , , , . Bookmark the permalink.

1 Response to Linux RAID Disk Replacement with Sans Digital 8-bay eSATA tower

  1. What a fantastic writeup. This is going to come in handy in the future. Thanks Will!

    Like

Have a Squat, Leave a Reply ..

This site uses Akismet to reduce spam. Learn how your comment data is processed.