Ubuntu 24.04 freezes with "nvme nvme0: I/O" timeout error
https://askubuntu.com/questions/1557696/ubuntu-24-04-freezes-with-nvme-nvme0-i-o-timeout-error
I am running Ubuntu 24.04.3 LTS (codename "noble") in an ASUS Vivobook 15. My kernel version is 6.14.0-33-generic according to the output of uname -r.
Ever since I got this laptot, my Ubuntu system would sometimes "randomly" freeze. This would occur rarely enough that it didn't represent a problem. However, today I exeperienced an annoying sequence of repeated freezes which got me looking into the problem more closely.
I ran dmesg -w and waited for the computer to freeze. It eventually did, precisely after the following output:
nvme nvme0: I/O tag 4 timeout, aborting req (WRITE), QID 7, size 4096
nvme nvme0: I/O tag 7 timeout, aborting req (WRITE), QID 7, size 4096
nvme nvme0: I/O tag 9 timeout, aborting req (WRITE), QID 7, size 4096
nvme nvme0: I/O tag 4 timeout, aborting req (WRITE), QID 0, size 4096
nvme nvme0: I/O tag 10 timeout, aborting req (WRITE), QID 0, size 4096
So it seems the freezing was triggered by an error in a write operation of 4KB. I found two other people with this issue (1, 2), but their final solution was to change the SSD, which is not economically viable for me.
After a force shutdown (the only way to handle the freezing), I ran sudo dmesg -T | grep -i nvme -n -A5 -B5, since the issue was comming from nvme. The boot section of the output looked normal:
nvme nvme0: pci function 10000:e1:00.0
nvme nvme0: allocated 64 MiB host memory buffer (1 segment).
nvme nvme0: 8/0/0 default/read/poll queues
I tried changing the GRUB_CMDLINE_LINX_DEFAULT from
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0"
to be frank simply because ChatGPT recommended it (I know this is embarassing). This only caused freezings to become worse, so I undid it.
Now I've changed GRUB_CMDLINE_LINUX_DEFAULT to
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pcie_aspm=off"
because in one of the hyperlinks I gave above someone recommended it. I have not experienced any more freezes, but it's only been a couple of hours so the issue might continue.
In case it's useful, here's the output of sudo smartctl -a /dev/nvme0:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.14.0-33-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: ADATA LEGEND 710
Serial Number: 2O522L1NCCAF
Firmware Version: VC3S500T
PCI Vendor/Subsystem ID: 0x1cc1
IEEE OUI Identifier: 0x707c18
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 707c18 1b52002a04
Local Time is: Wed Oct 22 18:13:19 2025 -03
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 100 Celsius
Critical Comp. Temp. Threshold: 110 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 230000 50000
1 + 4.00W - - 1 1 1 1 4000 50000
2 + 3.00W - - 2 2 2 2 4000 250000
3 - 0.0300W - - 3 3 3 3 5000 10000
4 - 0.0050W - - 4 4 4 4 54000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 30 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 3,338,108 [1.70 TB]
Data Units Written: 5,326,516 [2.72 TB]
Host Read Commands: 41,296,965
Host Write Commands: 76,772,047
Controller Busy Time: 0
Power Cycles: 503
Power On Hours: 131
Unsafe Shutdowns: 81
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
What's the root cause of my errors? Can I fix them in some way that does not involve buying a new SSD?