hardware | The long tail

This is a tale of my personal wild goose chase and frustrations due to my misplaced trust in Corsair PSUs while building my new Desktop. I thought it would be interesting/cathartic for others in the same (crappy) place — more concretely, it might help a few people avoid this particular wild-goose chase — there are of course other geese to catch! Strap in, folks, it’s going to be a bit of a lengthy ride!

Hope

The story starts with my college friend (a desktop enthusiast) who offered to get me (an even more enthusiastic desktop … err… enthusiast) reasonably priced parts from the US. A bunch of hunting later, we had the perfect config pinned down. I am a minimalist of sorts, so I chose the DRAM-less Samsung SSD 980 NVMe 500 GB as my OS drive. An Asus ROG Strix B550-F Gaming (without WiFi — it’s not a laptop!) complemented the Ryzen 7 5700G APU – graphics cards are still just too expensive for anyone like me who’s not too big on gaming right now. PCIe 3.0 instead of 4.0 was an acceptable bargain. A TrendNet TEG-S350 2.5G switch was a perfect complement to the 2.5G Ethernet port built into the motherboard. And finally, just for kicks, I added an Intel Optane 16 GB (MEMPEK1W016GAXT I think) as a superfast (technically should be the fastest in latency terms) SSD to play with. I already had a Corsair CX430M lying around from a previous build, so I didn’t need anything new in that department. All of this detail turns out to be relevant, I promise!

Battle

A few months later, I had the parts and began building the rig. My first problem was that both Windows and Fedora installers just refused to install on the Samsung SSD — giving very weird errors amounting to “This drive is somehow not possible for me to go ahead with installing.”¹. Sometimes it simply didn’t see the drive. At about this point, I made sure to upgrade the firmware of the motherboard (“BIOS”/”UEFI”).²

Now there were two M.2 slots available, one connected to the CPU(APU?) and the other to the SouthBridge. I started off with the Samsung on the CPU slot and the Intel on the SB slot, which didn’t work — the Windows installer just didn’t see the Samsung drive while trying to install. I switched it around based on some hints online, now it didn’t detect either (my memory is hazy here — this is more than 1.5 years old at this point). Somehow switching back made it available — I was wary and tired, and seized the chance and finished installing. I suspect that that was a rare chance and that I got lucky.

After installing, the problems didn’t stop, as one may expect. Typical BSODs on Windows were of the CRITICAL_PROCESS_DIED kind; on Fedora I encountered numerous disk read errors (the kind that drop you out of the desktop environment and into text-mode).

Preparation

I took up this problem at some point to root-cause this and solve it — and quickly ran into a number of documented issues about Samsung NVMe firmware in general, and about ASPM in particular. Specifically, it was about the transition from D3Cold (a very low power mode with high latency) to D3Hot (a slightly higher power mode with lower latency) that was apparently failing to happen in time. Another issue that raised its head at this time was that the BIOS failed to retain settings when the desktop was disconnected from the mains — how did the CMOS battery run out so quickly? At some point I replaced the battery and hoped that that battery failure was a weird chance bug… Fat chance!

First, I took up the firmware — and found that there was absolutely nothing to do… The SSD and the motherboard already had the latest firmware. After dicking around disappointedly and aimlessly³, I had to resort to workarounds to totally disable ASPM. The BIOS was completely unhelpful in this. Windows was only little better — it has a power plan option to disable PCIe Link State Power Management (LSPM). This didn’t help in the slightest.

Fedora was a lot better in that it allowed me specifically to disable ASPM, in two different ways, both as kernel parameters — nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off. Setting these seemed to have reduced the incidence measurably, so I was convinced I was on the right track. I used Linux and could live with the lower incidence of crashes.

Stagnation

But this was not okay with my parents who were definitely were not going to happily switch to Linux. I had given up in fixing the issue in the short term, and so I had only one bad option left — install windows on a SATA drive to get around this firmware bug. I had an old 128 GB EVO 750 from my previous desktop, and in short order I installed Windows on it. The bug was definitely gone, and things seemed to be working perfectly. Hurray?

And this is where things stayed for the next year or so, to the point that fixing the main issue had become a low-priority matter that weakly affirmed its existence occasionally. Somewhere in the period, the CMOS battery had run out again though, and I knew this was definitely a real problem somewhere.

Encounter

One day I finally decided I will actually start using the 2.5G switch (which was hitherto gathering dust in the original box) with the desktop (see, I told you it will eventually matter in this story!). The moment I switched (pun not intended) the gigabit err… switch that was on my desk with its faster cousin, I now had Linux just flat-out refuse to boot. That darned PCIe power management firmware bug was back! And now it was not occasional in the slightest — it literally never let Fedora proceed beyond the login screen! Windows was a little better in that it booted — but was crashing with high degree of certainty within a few minutes of starting up.

My nemesis was back with full force, and this time I had significantly more time to solve the issue. So I attacked it again…… and reached the same place as before — the ASPM was the only possible cause according to the Internet. Windows on the SATA SSD was still working though, but Windows was not giving me enough info to trace the problem — specifically I needed dmesg.

Regroup

So I live booted Linux from a USB flash stick. It booted up fine as one would expect, but acted up very severely when checking the contents of the disk — several times I was subject to the horror of seeing my NVMe SSD show up as a RAW device with size 0. Luckily the fact that the Linux and Windows started booting reassured me that the device still kind of worked and the data still existed. The (USB- or NVMe-installed?) Linux log meanwhile made it clear that ASPM was clearly the cause. In addition, I was facing another issue — which is that the network connection was flaky and frequently went down! And this happened in ALL the OSes – drat, now I was running into more nutty issues, was I just unlucky?

More hope

I checked again for motherboard firmware (UEFI) updates and found quite a few revisions had passed — there was hope! I installed the latest, and with trepidation checked the options. As part of the “onboard devices” configuration page, there was an option regarding ASPM — quick as a flash I disabled it. I tried booting Linux on the NVMe drive …. and it promptly failed to boot like all the previous times!

At this point, I learned that the onboard Intel I225-V ethernet controller was seriously flawed and couldn’t run at 2.5G link speeds, at least in its initial hardware revision. The 2nd revision mitigated the issue and the 3rd totally fixed it. By this time, I was getting the idea that these problems might be connected — some searching revealed that the ethernet controller was powered by 3.3V and … drumroll please … so were the M.2 slots⁴! If you’re getting clues as to what the real problem was, you’re faster than me…

How would I know whether I had the fully functional ethernet chip, the partially fixed, or the fully flawed one? I found conflicting information – some said the “(3)” appended to the device name in Device Manager was the hardware revision, and some said it was only the firmware (“NVM”) revision. If it was the former, it meant I had the fixed version and the issues I was facing had their origin elsewhere.

Of course, I clung to the possibility that it was the latter, so that I could blame Intel/Asus for this problem. The only way to check the hardware revision, then, was to detach the VRM heatsinks and the I/O shield. I managed to do that with a healthy dose of Internet videos and my own intuition… it was definitely an interesting exercise, to see how the whole thing sat together. Ultimately though, it turned out I had the fixed hardware revision (SLNMH, in my case). Bummer! Back to square 1.5?

Despair again

I was now out of options, and could only think of taking the motherboard to Asus support — afraid of facing the “you bought it in the US, we won’t give you support here”. I wasn’t even sure I would get paid support. While procrastinating on this issue, I found a couple of posts on Reddit advising people with NVMe issues to check the power supply.

Ray of light

Now I was confident that a Corsair power supply at the very least would supply reliable power, and especially had no reason to suspect it because the kernel was clearly telling its a power state transition issue⁵ — surely it would only say that if it got reliable telemetry saying that, rather than inferring/guessing? But out of idle curiosity I ran OpenHardwareMonitor. The post advised me to check the voltages specifically… lo, and behold, the 3v3 voltage was reading around 3.0 or even lower!

Now several reddit posts advised strictly to not trust the built-in sensors, but to measure with a multimeter. I was wondering how I would do this, but luckily the CX430M is a semi-modular PSU — and I wasn’t using many of the SATA or any of the PCIe aux power supply ports on the PSU. After foraging the internet for the pinouts for my PSU and finding them, I checked⁶ the 12 V value, which was rock solid according to both the multimeter and the internal sensor.

Victory at last

The 3.3 V reading, however, wandered all over the place, up to 2.8 V even⁷, AND coincided perfectly with the storage devices starting to misbehave. The kernel WAS indeed guessing about the D3Cold transition thing! Along with this, I also suddenly realised why the CMOS battery was dying so quickly — the motherboard apparently directly connected 3v3 from the PSU to the CR2032 cell forming a 3.3 V bus of sorts, so when the PSU was not doing its job, the cell was propping up the voltage. Among other things, it also explains why the NVMe drives fared better for a while in the middle — they were being powered by the cell… No wonder the poor thing ran out of juice quickly!

I had an older Corsair GS600, which I was not preferring to use because it was a non-modular PSU, but voltage stability is infinitely more important than redundant cables or airflow. Swapping them finally got me the smoothly working desktop that I should have had 1.5 years ago. I installed a new CMOS battery and now all is right with the world once more… well, not exactly. Just around the time I found out about the wandering voltages, the Optane has stopped responding to the BIOS or to the OS, despite swapping the M.2 slot drives multiple times.

But that is a goose to chase for another day.

I will eventually reproduce this by putting the PSU back in and give you all the exact error messages I faced. ↩︎
The “BIOS” was originally firmware on IBM PCs, and later was any firmware on PCs that adhered to that de facto standard. The “UEFI” is also firmware, one that adheres to the de jure UEFI specification. (A few de jure specifications like PC-98 and PnP BIOS also built on the de facto BIOS interface) ↩︎
I realised that the SSD support page was bad enough that if there was any bug that was fixed, I would not be able to tell — because it was lacking this little bit of info called RELEASE NOTES! ↩︎
Technically, the M.2 M-key slots specifically. The B-key slots work with SATA M.2 drives, which presumably would need 12V. But B-key slots are rare today I think. Incidentally, this was also probably the reason to make M-key flash devices incompatible with B-key slots — they probably have no capability to accept 12V. This is pure speculation on my part though. ↩︎
The dmesg error messages were nvme 0000:07:00.0: Unable to change power state from D3Cold to D0, device inaccessible and nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff . What was I supposed to think? ↩︎
It is awkward to hold the probes stably enough to read a value, you just have to spend a bit of time and use your ingenuity, no easy shortcuts. If I had enough patience, I would have crimped a new connector with open wires on the other end, but I didn’t. I also didn’t want to cut an existing connector. ↩︎
As it happens, the Corsair CX-M series is renowned for poor quality construction and choice of capacitors used. TechPowerup also spoke about bad voltage regulation on 3.3V, but that is under transient condition, which was not my case. ↩︎

The long tail

Weblog of yet another Indian Engineer

Category Archives: hardware

M.2 NVMe, CMOS batteries, 2.5GbE adapters, and a wild goose chase