Overview

Short-DescHigh-Memory Analysis Server
OSRocky Linux 9.2
CPU2 x AMD EPYC 9454 (96 cores)
MEM1.48 TiB
GPUNone
SSD931.51 GiB MDRAID0 (2xSSDPE2KX010T8O 931.51 GiB)
DISK(S)65.29 TiB RAIDZ5 (6xWUH721818AL5204 16.37 TiB)
MODELAdvanced Clustering Technologies ACTserv 2U Genoa
LOCATIONPark Avenue Campus (4075 Park Ave, Memphis, TN 38152)

Maintenance Notes

Repeated Disk Array Problems SOLVED

The solution was to disable the hardware raid and just use ZFS!

Array Failure (September 21, 2024)

During a test of compression speeds and effectiveness for simulation outputs, the array failed once again with three disks reporting their state as Failed in storcli /c0 /dall show. The three failed disks are:

Row/DIDEID:SlotSN
0252:02VGPS53L
3252:35BH79Y2J
5252:15BH7JWER
Disk 3 is the one that appears to have just now failed. Disk 0 failed on September 12, and Disk 5 failed September 18. Every failed drive shows no SMART errors, and an Other Error Count = 52.

Looking at the Key Code Qualifier in the event log for S3, it appears to be Sense: 2/04/1c, which is:

04h/1Ch  DZT   MAEB     LOGICAL UNIT NOT READY, ADDITIONAL POWER USE NOT YET GRANTED

According to the SCSI standard

Attempted Fix (September 25, 2024)

  • The reason that the XFS filesystem appears to be badly damaged is that if you bring all the disks online at once you are mixing disks that haven’t been written to for days-weeks with the disks that just recently failed
    • ONLY SET THE LAST FAILED DISK ONLINE WHEN FIXING
  • Doing a very heavy read/write will trigger the disk failure! This time it was a different disk, the one in slot 5.
    • Pulling it out and plugging it in again doesn’t fix it
    • Rebooting after pulling it out/plugging it in DOES restore the disk.
  • To restore after a failure without physical access:
    • Unmount the filesystem
    • Wait for the disk to move from the Configured shielded state to the Failed state.
    • Set the offending disk to missing with storcli64 /c0/e252/sX set missing
    • Reboot
    • Set the offending disk to good with storcli64 /c0/e252/sX set good
    • Import the disk from the Foreign DG with storcli64 /c0/fall import
  • Trashed the RAID by swapping the disks and setting the wrong ones to UGood. Fuck.
  • Trying to set the thing into JBOD mode now, hopefully that will let me use ZFS and avoid this whole mess.
  • Currently doing a test for bad blocks, although apparently the badblocks utility doesn’t support drives with more than blocks… Need to manually check:
    • Span a crypto layer above the device: cryptsetup open /dev/DEVICE NAME --type plain --cipher aes-xts-plain64
    • Fill the now opened decrypted layer with zeroes, which get written as encrypted data: shred -v -n 0 -z /dev/mapper/NAME
    • Compare fresh zeroes with the decrypted layer: cmp -b /dev/zero /dev/mapper/NAME If it just stops with a message about end of file, the drive is fine.
  • Strongly starting to suspect this is software. There are known issues with LSI controllers and big drives where they spin down and refuse to spin back up: