Overview

Short-Desc	High-Memory Analysis Server
OS	Rocky Linux 9.2
CPU	2 x AMD EPYC 9454 (96 cores)
MEM	1.48 TiB
GPU	None
SSD	931.51 GiB MDRAID0 (2xSSDPE2KX010T8O 931.51 GiB)
DISK(S)	65.29 TiB RAIDZ5 (6xWUH721818AL5204 16.37 TiB)
MODEL	Advanced Clustering Technologies ACTserv 2U Genoa
LOCATION	Park Avenue Campus (4075 Park Ave, Memphis, TN 38152)

Maintenance Notes

Repeated Disk Array Problems SOLVED

The solution was to disable the hardware raid and just use ZFS!

Array Failure (September 21, 2024)

During a test of compression speeds and effectiveness for simulation outputs, the array failed once again with three disks reporting their state as Failed in storcli /c0 /dall show. The three failed disks are:

Row/DID	EID:Slot	SN
0	252:0	2VGPS53L
3	252:3	5BH79Y2J
5	252:1	5BH7JWER
Disk 3 is the one that appears to have just now failed. Disk 0 failed on September 12, and Disk 5 failed September 18. Every failed drive shows no SMART errors, and an `Other Error Count = 52`.

Looking at the Key Code Qualifier in the event log for S3, it appears to be Sense: 2/04/1c, which is:

04h/1Ch  DZT   MAEB     LOGICAL UNIT NOT READY, ADDITIONAL POWER USE NOT YET GRANTED

According to the SCSI standard

Attempted Fix (September 25, 2024)

The reason that the XFS filesystem appears to be badly damaged is that if you bring all the disks online at once you are mixing disks that haven’t been written to for days-weeks with the disks that just recently failed
- ONLY SET THE LAST FAILED DISK ONLINE WHEN FIXING
Doing a very heavy read/write will trigger the disk failure! This time it was a different disk, the one in slot 5.
- Pulling it out and plugging it in again doesn’t fix it
- Rebooting after pulling it out/plugging it in DOES restore the disk.
To restore after a failure without physical access:
- Unmount the filesystem
- Wait for the disk to move from the Configured shielded state to the Failed state.
- Set the offending disk to missing with storcli64 /c0/e252/sX set missing
- Reboot
- Set the offending disk to good with storcli64 /c0/e252/sX set good
- Import the disk from the Foreign DG with storcli64 /c0/fall import
Trashed the RAID by swapping the disks and setting the wrong ones to UGood. Fuck.
Trying to set the thing into JBOD mode now, hopefully that will let me use ZFS and avoid this whole mess.
Currently doing a test for bad blocks, although apparently the badblocks utility doesn’t support drives with more than $2^{32}$ blocks… Need to manually check:
- Span a crypto layer above the device: cryptsetup open /dev/DEVICE NAME --type plain --cipher aes-xts-plain64
- Fill the now opened decrypted layer with zeroes, which get written as encrypted data: shred -v -n 0 -z /dev/mapper/NAME
- Compare fresh zeroes with the decrypted layer: cmp -b /dev/zero /dev/mapper/NAME If it just stops with a message about end of file, the drive is fine.
Strongly starting to suspect this is software. There are known issues with LSI controllers and big drives where they spin down and refuse to spin back up:

Encyclopedia Galactica

Contents

ocelot

Overview

Maintenance Notes

Repeated Disk Array Problems SOLVED

Array Failure (September 21, 2024)

Attempted Fix (September 25, 2024)

Table of Contents