Question regarding a damages SSD as part of a RAID1 mirror
https://askubuntu.com/questions/1565508/question-regarding-a-damages-ssd-as-part-of-a-raid1-mirror
Xubuntu 24.04.4
I've been running a new Intel Xeon W7-2595X system for a number of months now. I've experienced a couple of serious problem with a raid consisting two two Samsung 990 PRO 4 TB M.2 PCIe SSDs. The SSD partitioning is:
Partition Label FS Type Mnt Pt Flags Size
-------------- ------ -------- --------- --------- ---------
nvme[01]n1p1: esp fat32 /boot/efi boot, esp 1.05 GB
nvme[01]n1p2: root ext4 / -- 160.00 GB
nvme[01]n1p3: user ext4 /u -- 3.48 TB
unallocated: -- -- -- -- 1.84 MB
And the RAID1 consisted of two mirrored partitions:
/dev/md0: /dev/nvme0n1p2 & /dev/nvme1n1p2 # root filesystem
/dev/md1: /dev/nvme0n1p3 & /dev/nvme1n1p3 # user filesystem
A couple of months ago I got raid errors regarding the nvme1n device and partitions 2 and 3 were detached from the array. After a lot of searching through the logs I couldn't identify what happened, and decided that there was nothing seriously wrong with the SSD so I reattached both partitions and they synced up and everything has been running smoothly up until yesterday. This time it was nvme0n that was the problem and got detached and I discovered many serious errors reported by dmesg. I took relevant excerpts from the dmesg output and placed them in a file that can be reviewed here:
http://cjsa.com/misc/MD_dmesg.txt
My best guess is that this is a hardware failure. The devices still exist:
# ls -l /dev/nvme*
crw------- 1 root root 10, 261 Apr 1 08:50 /dev/nvme-fabrics
crw------- 1 root root 241, 0 Apr 1 08:50 /dev/nvme0
brw-rw---- 1 root disk 259, 0 Apr 6 11:27 /dev/nvme0n1
brw-rw---- 1 root disk 259, 1 Apr 6 11:27 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 2 Apr 6 11:27 /dev/nvme0n1p2
brw-rw---- 1 root disk 259, 3 Apr 6 11:27 /dev/nvme0n1p3
crw------- 1 root root 241, 1 Apr 1 08:50 /dev/nvme1
brw-rw---- 1 root disk 259, 4 Apr 6 15:11 /dev/nvme1n1
brw-rw---- 1 root disk 259, 5 Apr 6 15:11 /dev/nvme1n1p1
brw-rw---- 1 root disk 259, 6 Apr 6 15:11 /dev/nvme1n1p2
brw-rw---- 1 root disk 259, 7 Apr 6 15:11 /dev/nvme1n1p3
I marked the devices as failed and then removed them from the array. When I attempt to reattach them I get the message:
# mdadm --manage /dev/md0 -a /dev/nvme0n1p2 # re-add device
mdadm: Failed to write metadata to /dev/nvme0n1p2
# mdadm --manage /dev/md1 -a /dev/nvme0n1p3 # re-add device
mdadm: Failed to write metadata to /dev/nvme1n1p3
lsblk reports on the devices:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1 259:0 0 0B 0 disk
├─nvme0n1p1 259:1 0 1G 0 part /boot/efi
├─nvme0n1p2 259:2 0 160G 0 part
└─nvme0n1p3 259:3 0 3.5T 0 part
nvme1n1 259:4 0 3.6T 0 disk
├─nvme1n1p1 259:5 0 1G 0 part
├─nvme1n1p2 259:6 0 160G 0 part
│ └─md0 9:0 0 159.9G 0 raid1
│ └─md0p1 259:9 0 159.9G 0 part /
└─nvme1n1p3 259:7 0 3.5T 0 part
└─md1 9:1 0 3.5T 0 raid1
└─md1p1 259:8 0 3.5T 0 part /u
Neither gnome-disks nor gparted can see the nvme0n device.
As I said, I'm pretty sure this is a hardware failure with the SSD, but I wanted to ask whether there were any known ongoing problems with the mdadm software that I should be aware of? Thanks.
Addendum:
At the request of Andrei Borzenkov I re-ran the command:
# mdadm --manage /dev/md1 -a /dev/nvme0n1p3
mdadm: Failed to write metadata to /dev/nvme0n1p3
And here is the dmesg output:
[524176.526614] Buffer I/O error on dev nvme0n1p3, logical block 934535664, async page read
[524176.526622] Buffer I/O error on dev p3, logical block 934535678, async page read
[524176.526634] Buffer I/O error on dev nvme0n1p3, logical block 0, async page read
[524176.526636] Buffer I/O error on dev nvme0n1p3, logical block 0, async page read
[524176.531059] Buffer I/O error on dev nvme0n1p3, logical block 0, async page read
[524176.533409] Buffer I/O error on dev nvme0n1p3, logical block 0, async page read
[524176.533426] Buffer I/O error on dev nvme0n1p3, logical block 1, async page read
[524176.537695] nvme nvme0: Identify namespace failed (-5)
[524184.848470] [UFW BLOCK] IN=eno2np1 OUT= MAC=01:00:5e:00:00:fb:20:ef:bd:51:d7:15:08:00 SRC=10.10.10.160 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2
04-14-26 Follow-up:
After making sure that I could boot from the duplicated efi partition on the remaining working SSD, I did reboot the system and was very surprised to find that the "damaged" SSD (nvme0n1), reported on above, was now once again seen and both partitions p2 and p3 could be reattached to their respective arrays. They synced up and everything is back to "normal" for the moment.
What I didn't report above was that back in early March I had a similar RAID failure, but in that case it was nvme1n that failed. At that time gparted was able to see that device, the format was intact, and I had no significant errors as reported above, so I simply reattached the two failed partitions and got everything back to normal. This worked for a little over a month until the more critical failure reported here occurred.
So now I've experienced two major RAID failures with both SSD members, each failing simultaneously on partitions p2 and p3. This is making me question whether this is actually a hardware failure or possible something in the mdadm software. I've been closely monitoring the temperature of the SSDs and this seems to be a fairly stable 32 degrees C so overheating doesn't seem likely.
Does this additional information trigger further thoughts? Why would the SSD device be generating such catastrophic errors and become "invisible" to a degree, but then be fully functional upon a reboot?
In case it matters to those with deeper insight into the issue, here is the hardware situation:
Motherboard: ASUS Pro WS W790 SAGE SE
CPU: Intel Xeon W7-2595X
PCIe Card: ASUS Hyper M.2 x16 Gen5 Card (slot #1)
SSDs on Card: Samsung 990 PRO 4 TB SSD (slots #1 and #2)
Should this happen again, any suggestions beyond what I've done above that would help identify whether this is a hardware or software issue? If it's the SSD(s), I need to get Samsung to replace them while under warranty.
Thanks again for your insights.