Author Archives: Milind

Why can’t we add two temperatures?: Part #3

In the previous article, we looked at how positions of any kind form an affine space, whose differences are vectors. For when we are dealing with scalar quantities, in this final part of the series, we look for a more lightweight mathematical structure. As mentioned earlier, we use real numbers as values for things like temperature, but we can’t bring over all the familiar operations like addition (that they satisfy due to their field structure) to our physical quantities.

Well algebra has no shortage of mathematical structures with and without strange esoteric properties. But its surprisingly hard to track down a suitable mathematical answer from vector spaces or fields even if you knew what you were looking for. And it isn’t wiki’s fault either, they’ve got this nice little box (snapshot below) organising many related entities titled Algebraic Structures. Expanding all the ‘[show]’s however gets you no closer (see below).

The dive

The way algebra works is that there are many structures where sets are tied up with operations and axioms are attached to these structures. Each structure then becomes a full-blown topic for study in its own right, like groups (you’d have encountered this in school likely), rings, fields, vector spaces, Banach spaces, Hilbert spaces, and more, forming a dizzying array of entities. Fields are a relatively specialised structure with a lot of axioms and being closed under two different operations. So let’s start dropping axioms and operations from fields till we get to something useful, shall we?

A field is listed under ‘ring-like’ structures in that box. The least restrictive structure in that section is the cheekily named rng, a set closed under addition and multiplication but without the equivalent of a × 1 = a that we take for granted. But addition is a no-go, so there’s more chopping required still. ‘Group-like’ looks more interesting because they only need one operation, and hopefully something there drops even that one operation? Modules and Algebras (yes they are a type of algebraic structures studied in the subject of abstract algebra, and yes I agree they really needed to choose a different name) have even more structure and axioms piled on on top of fields (modules are almost vector spaces in fact), so they’re an immediate no-go.

And while group-like certainly delivers on the less axioms bit (see above, though only group-like structures get their own detailed table in the “Outline of algebraic structures”), they never drop the (usually thought of as addition) operation. E ven the exotic goopy-sounding “partial magma”, which is barely more than a set and allows a subset of elements to obey the (addition) operation, still needs an operation. And because the subtracting one position from another yields a displacement which is a genuine vector, that operation can’t be subtraction either. Downward we go then, abandoning even more order and structure as we descend into ~~savagery~~ sets.

Sets. Just collections of numbers, a way to tell if something is in them or isn’t … and in our case a total ordering since its pretty important to be able to compare temperatures. But we still need to be able to subtract and then work with the differences. So this feels a little bare.

Now for all the physical quantities we care about, we could do a little better by adding the notion of distance between any two numbers, to get a metric space. And this may have been barely sufficient, but for the fact that the distance has to be positive. And we haven’t even gotten to allowing the quantity differences to be added, subtracted, divided etc.

Torsors

But waitaminute, surely this has been done before? Why are we doing basic algebra research on an extremely straightforward property of our physical world? Fortunately, there is indeed prior (obscure) art on this.

So it turns out our elusive algebraic structures are called torsors – and its extremely likely you’re hearing about it for the first time in your life. So it turns out that dropping properties wasn’t going to lead us to much of anywhere. Instead the differences in values are happily treated as our ordinary numbers in blessedly-axiom-rich real number field. But the absolute values themselves just form a bare set which is just associated with this field. Torsors are also called principal homogenous spaces, where a homogenous space brings in the group action and the “principal” property implies a kind of linearity. In fact, it seems that affine spaces are in fact torsors where the group is a vector space.

Just like affine spaces, torsors are also defined as a set (call it T_absfor our temperature test-case) with a group action on it. Here, for our purposes, the group is the set (call it T_Δ) of real numbers with the addition operation. Because these represent temperature differences, T_Δ occupies the entire −∞ C° to +∞ C°. The set T_absspans [−273 °C, +∞ °C), and the torsor properties guarantee that for any two absolute temperatures, there exists a temperature difference between them. The restriction that an invalid absolute temperature (absolute hot, anyone?) does not result from an arithmetic operation will have to be enforced in the definition of the group action.

That’s it! In one stroke, we have solved several niggling issues of how to represent these physical quantities. More examples:

Voltage, or more accurately electric potential difference, is only ever defined between two points. An electric potential does indeed exist at each point but is undetermined and possibly unknowable by definition. Potential at points in Euclidean space are members of the set of reals, and potential differences would form the group/field of reals, and we only ever evaluate the differences but the calculations are based on electric fields at those points. Absolute potential is an ℝ-torsor.
Time is a more valuable case given a pressing need for picking a zero but which is otherwise meaningless. Time instants are members of the set of reals, and time intervals are members of the group/field of positive reals. Time intervals naturally result as from subtracting one time instant from another. Timezones would be an interesting concept to model here, and perhaps is better off defined as a vector. Time is an ℝ⁺-torsor.

I have vivid memories of dealing with time programmatically where correct handling and distinctions between instants and intervals was what sold me on that particular software library.

Torsors also apply perfectly to memory pointers – the type of pointer could be embedded into the group action as a size parameter which ends up multiplying the offsets. A void pointer would then be a set with no group action at all, essentially hobbling the ability to do pointer arithmetic. Therefore, pointers are ℤ-torsors!

To summarise, we represent temperature with a real number and a unit, and we expect to add/subtract/multiply/divide with reals because the set of real numbers is usually tacked on with closure under addition and multiplication (called a field in algebra). However, with many physical quantities that doesn’t make sense, and we need a more restrictive algebraic structure than fields. In such quantities, the right way is to classify absolute temperature as a different type of object than temperature differences which are just numbers, and explicitly define addition between an absolute and a difference. This structure is called a torsor.

Diving deeper

This actually felt like a small book, phew. I’m also linking a couple of further good resources:

This mathoverflow answer goes into the origin of the term
Matt Baker’s alternate construction of proportion spaces is later shown by commenters to be equivalent to a heap.
And finally Terence Tao (perhaps the greatest living mathematician) has written a beautiful piece on the ways to axiomatise physical units and dimension. He also mentions torsors for the same reason that I do (yay!).

The heap construction is also quite cleverly suited for the matter at hand, managing without the need for a separate set and group. Instead, it uses a single ternary operation (our usual arithmetic are all binary operations!) which more or less comes out to a−a+b, with some properties: given five elements of the heap, it satisfies the equivalent of a−a+b = b, and a−b+(c−d+e) = (a−b+c)−d+e . The elements are the same as the elements of the torsor itself, like absolute temperature, absolute potential, time instants, absolute pointer, etc. This very much does satisfy the restriction we need. The only reason I prefer the torsor construction is that it explicitly defines the differences/intervals as members of the group, which obviously have their own rightful existence.

Epilogue

I want to be clear here that there is no original work here, lest someone mistake so. I simply scoured the internet and am paraphrasing existing work targeted at the apparently-niche need of merely most physical parameters (!) I have been liberal in hyperlinking multiple good resources, because I needed all of them together to get the picture. I am not exactly a mathematician by training, let alone by profession, so I gladly welcome any corrections or discussions. I haven’t added a lot of mathematical notation here because I am not doing any rigorous work here and would not like to give anyone that idea. (Plus I’m hoping to add LaTeX capability to my blog another day). And if anyone liked the usage of some typographical symbols like degrees, sets, and the minus sign, I’ll be happier. Adiós, amig(o/a)s!

Why can’t we add two temperatures?: Part #2

3 Replies

In the first part, I talked about intuitive reasons why some physical quantities can’t be subject to the full repertoire of math operations. There’s more fun to be had, however, in the mathematical aspects of this limitation for related quantities like position and pointers (which are not addable due to a closely related but distinct reason from that of temperature).

Let’s check out your room

Let’s ask a different but equally simple question as in the last part: why can’t we add positions in 3-D space (or 2-D space for that matter)?

By which I mean, if you fix any given point in your bedroom as your origin, and mark the positions of various objects as 3-tuples ( 𝑥^*,𝑦^*,𝑧^* ), it makes literally no sense to add the position of your unwashed-clothes-bearing-chair to that of your underused washbasket (with a caveat for later). It does make sense to subtract them though, which gives you the 3-tuple displacement between the two (which is pointing from your washbasket to your chair, if you’re keeping track). It is also probably congruent to too-far-to-move-stuff currently.

A room with clothes on a chair and a separate washbasket — Courtesy Google Gemini

So now that we have brought in displacements, what can we say about them? Pretty straight: its a vector quantity, which just means it has a magnitude and direction. How to displacements differ from positions (after all both are 3-tuples)? The general answer on the internet is to distinguish between bound and free vectors. The terminology is fairly widespread in teaching material – I also encountered it, but none of us picked up on the distinctions involved and it never mattered. The biggest problem is that no one tells us what operations can be done on these bound and free vectors. Also, they tend to be emphasised in the definition, but eventually the distinction is lost (see here for e.g. towards the end, even though great care was taken initially) and any ordered sequence of 3 numbers become vectors. And then only intuition keeps calculations from becoming nonsensical. There’s more structure required for the bound vectors, so we have to go back to some definitions.

What’s a vector anyway?

Given the utter simplicity of the question for which apparently even STEM graduates can’t give the proper answer, let alone school teachers, we should probably be more rigorous than “a vector is a quantity with magnitude and direction”.

Source: https://xaktly.com/MathVectors.html

To do that, we first have to figure out what a scalar is. Again, the simple physics answer to this is a physical quantity (such as temperature) that can be represented only by magnitude (which is a number) with a unit. But merely being a number hasn’t told us whether certain scalars can be added or multiplied together. In math though, a scalar is a member of a field, which can be added, subtracted, multiplied, or divided by another (non-zero) scalar to yield another scalar. A very beautiful way of rewording this is that the field is closed under these four operations.

Building on this, a vector space over a scalar field is basically a set of sequences/tuples of scalars together with two binary operations: + (applies to two vectors and results in another vector), and × (applies between a scalar and a vector and results in another vector) that satisfy several axioms which I’ll not get into here. Among many others, the displacements between positions in your room that we talked about above can be added and multiplied/stretched by a scalar, and are clearly vectors.

Already, you can see that the vector space is also closed under addition, but multiplication of two vectors isn’t part of the deal.

This closure property makes a pretty powerful statement – that is, given any possible vectors and the addition operation, you have no hope of getting out of that vector space with any combination of operations. Sounds scary if you’re in a vector jail, but very satisfyingly safe if viruses were vectors. (Unfortunately, viruses are carried by a very different kind of vector nevertheless with suprising etymological commonality.) More to the point, closure is amazing if you’re trying to set boundaries or prove on what you can/can’t do with things and operations on those things. Same for velocities, accelerations, angular velocities, force, yada yada yada you get the drift.

But back to our position-addition-problem. The main reason I’m not going into vector space axioms is because they’re utterly irrelevant to what’s coming next. As we saw above, the scalars that vectors are defined using, come from a field. And a field is a set that is closed under addition, multiplication and their inverse operations (no prizes for guessing which they are). But the “scalars” of our positions, i.e. the underlying 𝑥-, 𝑦-, or 𝑧-components in the positions of your chair and washbasket, themselves don’t make sense to add (with the aforementioned caveat). And that’s not least because the result is nonsensically dependent on whether the origin is your room corner or the other side of the country. The components of position just can’t be meaningfully added, and thus don’t even begin to form a field.

Incidentally, we can now recognise that voltages and time, also “scalar” quantities in physics, do not form a field.

A familiar yet rigorous term

It turns out that geometry, only a little more rigorous than I learned in high school, always had the perfect answer to this issue of position arithmetic. It even has a very familiar sounding-name which pops in and out of common parlance and mathematical boilerplate setup. The concise answer is that positions in our real world belong to the Euclidean space. And Euclidean spaces in general are NOT vector spaces! Rather they are a very important type of structure called affine spaces. Affine spaces are sets whose members are called points (sounds familiar?), which are however associated with a vector space, and an action (we’re almost there). This action is a kind of function that does addition between a point and a vector to give… another point!

There are further technical conditions that successively add the expected intuitive properties of physical space onto the mathematical object. The Euclidean space also requires the presence of a dot product on its associated vector space, which doesn’t come built-in with a regular vector space.

One of these technical conditions essentially says that by fixing one particular point in the affine space as the origin, there must exist a corresponding vector for every point and vice versa. And this is the origin of the sloppy thinking that got us to this problem in the first place.

We were taught loosely that points were some kind of vectors, which some people call “bound vectors” or “position vectors”. The truth is the vectors from a given point in the affine space to each other point are displacement vectors or “free vectors”. And adding displacement vectors corresponding to two points is tantamount to marking a point with which you can complete a parallelogram. And that parallelogram is completely dependent upon what the origin is, and is thus a fairly useless object.

The caveat

However, it turns out that a weighted sum of points, only if the weights sum to zero, is independent of whatever point was picked as the origin. Such a weighted sum is called the barycentre, or the centre of mass. In the case of your room, this limited form of addition for positions (points to be precise) does exist, which can ideally put, for e.g., you, midway between your washbasket and your overfull chair for maximising the dilemma.

As you can see, kind of the whole point of affine spaces is that there is no meaningful zero-point for the entire universe. You just have to pick some arbitrary point and call it an origin, but its choice doesn’t affect anything. The tradeoff however is that the points themselves have no representation in terms of numbers. You have to resort to the associated vector space, at which point changing the origin involves changes to the components themselves.

Phew!

So that solves our problem of how to treat positions, where the answer is that positions are points in affine space and vectors can add to them, but two points can only be subtracted from one another, and no other operations are defined on points. There are some decent references for this available if you search hard enough, and some that address this in the much more general and mind-bending setting of differential geometry. This finally resolves the mess without requiring “free” and “bound”-vectors.

As for memory pointers, they are basically positions of a kind, and thus it makes sense that adding doesn’t work for them either. This is unlike temperature, where finding and using a true zero as in the Kelvin scale (or Rankine scale! there’s more than one way to skin that cat!) enables addition.

So if we now return to our original question in part 1, of adding temperature, and ask what mathematical structure represents it best, do we have an answer? Well… a relatively simple solution is to consider a 1-D affine space, or just a real line with the affine structure added. However, it seems excessive to involve a field, a vector space, and an inner-product, considering we want to talk about “scalars”! So, in our next (and final) part, we continue on our quest to pare down confusing cruft for our most basic physical quantities.

Why can’t we add two temperatures?: Part #1

1 Reply

A very simple question took me down the depths of mathematics: Why can’t we add two temperatures like 30 °C and 10 °C and get 40 °C?

Of course temperatures in Kelvin can be added, and of course temperature differences can be added. There is a little used scientific convention denoting temperature differences as e.g. C° while absolute temperatures are °C. This difference in notation, and the very significance of the Kelvin scale, presages some fairly fundamental underlying issues which we will explore in this article. (Or rather, we will rediscover what made physicists choose this distinction in the first place). But… wait… how are temperature differences obtained again?! Yep… we have the weird situation of the Celcius (and Fahrenheit) scale allowing subtraction while not allowing addition.

Importantly, its not even the only such physical quantity. absolute time is also unable to added – I’ll challenge you to add 1^st March 2023 and 3^rd July 2025 and tell me the result! Absolute electric potential (which gives rise to voltage) also cannot be added – absolute electric potential simply cannot be assigned a unique numeric value (although the potential at infinity is commonly fixed to be zero, its not really a privileged choice).

Why exactly do we have this problem? Before we dive in, be aware that “why?” is not always an answerable question. In fact, taken too far, it leads to the problem of the infinite epistemic regress. With that caveat, lets go however far we can.

What do we mean by temperature first of all? Apparently the degree of hotness or coldness (I didn’t like this definition in middle school and I still don’t. Edit: Parameter defining equilibrium distribution of energies in statistically independent particles and parameter characterising objects in thermal equilibrium feel much better.). It is written with a number and a unit, in this case Celcius but would work with any scale. We imagine we could put any number in there, and that typically means we use real numbers (in this case with the restriction that T ≥ −273.16°C), and hence it looks like you could apply all arithmetic operations to it. But clearly addition and also multiplication aren’t fitting in for temperature (if you recall that multiplication can be viewed as repeated addition).

No physics envy here

It turns out that the most famous theoretical justification comes not from physics or mathematics, but from psychology (with statistics having a claim to it). And although a few formidable mathematicians such as Tukey (a co-inventor of the extremely consquential Fast Fourier Transform and also apparently the origin of the word “bit” and “software”!) have contributed to the field, I am not alone in feeling like there’s been a conspicuous lack of mathematical attention (to put it mildly).

The contribution from psychology are the “scales of measurement” first by Stevens. He divides measurements into nominal scale (just names like colour), ordinal scale (names with ranks like college grades), interval scale (numbers with difference but no ratios), and ratio scale (number with a true zero and hence all arithmetic operations are allowed, edit: also called an absolute scale). If you think he was secretly an algebraist LARPing as a psychologist I’m with you, dear reader.

Stevens’ classification answers our original question saying that the lack of a true-zero disallows the meaning of “temperature is a quantity of something” from the Celcius scale, which is thus an interval scale. Since temperature is not quite a quantity of heat, adding them doesn’t give us a real quantity of any kind. The same applies to time (because the Gregorian calendar isn’t counting time from the Big Bang), and electric potential (because the physical theory simply doesn’t allow a real zero).

And – this will be interesting some of you who aren’t into the natural sciences (unnatural scientists?) – pointers to memory in languages such as C or C++ also can’t be added. However there is a much subtler and more interesting mathematics that underlies pointer arithmetic, which I will be covering in one or more followups to this article. Stay tuned!

Postscript: Measurement theory

Stevens’ contributions attracted only a little attention from natural scientists for a long time. Tukey’s additions frankly don’t feel like a substantial upgrade to this framework. Chrisman’s further expanded framework too doesn’t add much, with a very important exception for cyclic quantities like months or times within a day which were sorely needed (more in the upcoming articles in this series!). Only recently has this field gotten the rigour it should have, now called Measurement Theory, and is best accessed via a good review, given that it doesn’t have a wiki article yet.

SCTP on the internet: A (small) happy surprise

2 Replies

You might have heard about HTTP/3, which is used in on top of QUIC – a spanking-new transport layer protocol that replaces TCP. (If you’re thinking, “waitaminute, didn’t HTTP/2 come out just recently?!”, you’re not alone; its like suddenly there’s an urgency to develop on what was previously iterated slooowwwly and carefully).

The venerable TCP has been around for around 50 years now, but has had revisions throughout the years, some as recently as 2001. With changing patterns of usage (driven primarily by dominance of web traffic), further evolution was found necessary. The consensus view is that TCP today has been “ossified” in its current configurations by network middle-boxes. Hence QUIC, proposed and developed in large part by Google in the last decade, is intended to replace TCP, at least for web uses. So clearly, there was no other candidate protocol… right?

There was a choice developed in 90s, and standardised for around 20 years now. Stream Control Transport Protocol (SCTP) was originally intended for PSTN (telephone) signalling, but is a general-purpose modern transport protocol. It supports streams, one of the headline features of QUIC (and HTTP/2 as well), and multi-homing¹, a feature that is not yet supported by QUIC. So why did SCTP not replace TCP?

For the same reason QUIC has to be tunnelled over UDP (it is defined as such in the QUIC specification, it’s not a choice). SCTP is one of the poster-children of protocol-ossification — network equipment that you don’t control sitting between you and your desired content, interfering with protocols in ways you can’t usually change such that you’re forced to use the lowest common-denominator protocol. The biggest of these ways is to only pass-through TCP (and, at best UDP), branding everything else as unusual and simply dropping it in the name of security.

IPv6: the game-changer

But what about IPv6? IPv6 does not need or want NA(P)T. The question arises — would SCTP face the same problem on IPv6? Sometimes, the right answer is to just try it out and see what happens and how bad it is…

To test whether SCTP works or not, I’ll be using iperf3. First step is to use SCTP in a local, private network – good things begin at home.

Using Fedora 38 on both ends, I had to do the following to get iperf to connect and test successfully:

Enable kernel SCTP module with modprobe sctp
Allow TCP and SCTP port 5201 in the firewall — the TCP port is to allow iperf to do some apparently necessary coordination b/w client and server before the actual bulk transfer test

Here’s how that looks

[milind@desktop-linux ~]$ iperf3 -c 192.168.1.212 --sctp
Connecting to host 192.168.1.212, port 5201
[  5] local 192.168.1.107 port 58950 connected to 192.168.1.212 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  66.2 MBytes   556 Mbits/sec                  
[  5]   1.00-2.00   sec  67.0 MBytes   562 Mbits/sec                  
[  5]   2.00-3.00   sec  72.5 MBytes   608 Mbits/sec                  
[  5]   3.00-4.00   sec  70.1 MBytes   588 Mbits/sec                  
[  5]   4.00-5.00   sec  66.7 MBytes   560 Mbits/sec                  
[  5]   5.00-6.00   sec  65.2 MBytes   547 Mbits/sec                  
[  5]   6.00-7.00   sec  57.9 MBytes   486 Mbits/sec                  
[  5]   7.00-8.00   sec  63.2 MBytes   530 Mbits/sec                  
[  5]   8.00-9.00   sec  68.4 MBytes   574 Mbits/sec                  
[  5]   9.00-10.00  sec  63.2 MBytes   531 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec   661 MBytes   554 Mbits/sec                  sender
[  5]   0.00-10.00  sec   660 MBytes   554 Mbits/sec                  receiver

iperf Done.

Well it works, phew, but why such low transfer throughput? Here’s how it looks for TCP

[milind@desktop-linux ~]$ iperf3 -c 192.168.1.212
Connecting to host 192.168.1.212, port 5201
[  5] local 192.168.1.107 port 33508 connected to 192.168.1.212 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   116 MBytes   970 Mbits/sec    0    662 KBytes       
[  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    662 KBytes       
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    696 KBytes       
[  5]   3.00-4.00   sec   112 MBytes   944 Mbits/sec    0    696 KBytes       
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec    0    757 KBytes       
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    757 KBytes       
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    757 KBytes       
[  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    757 KBytes       
[  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    757 KBytes       
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    757 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   944 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

iperf Done.

What gives? UDP?

[milind@desktop-linux ~]$ iperf3 -c 192.168.1.212 --udp
Connecting to host 192.168.1.212, port 5201
[  5] local 192.168.1.107 port 51952 connected to 192.168.1.212 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   129 KBytes  1.05 Mbits/sec  91  
[  5]   1.00-2.00   sec   127 KBytes  1.04 Mbits/sec  90  
[  5]   2.00-3.00   sec   129 KBytes  1.05 Mbits/sec  91  
[  5]   3.00-4.00   sec   127 KBytes  1.04 Mbits/sec  90  
[  5]   4.00-5.00   sec   129 KBytes  1.05 Mbits/sec  91  
[  5]   5.00-6.00   sec   129 KBytes  1.05 Mbits/sec  91  
[  5]   6.00-7.00   sec   127 KBytes  1.04 Mbits/sec  90  
[  5]   7.00-8.00   sec   129 KBytes  1.05 Mbits/sec  91  
[  5]   8.00-9.00   sec   127 KBytes  1.04 Mbits/sec  90  
[  5]   9.00-10.00  sec   129 KBytes  1.05 Mbits/sec  91  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.25 MBytes  1.05 Mbits/sec  0.000 ms  0/906 (0%)  sender
[  5]   0.00-10.00  sec  1.25 MBytes  1.05 Mbits/sec  0.027 ms  0/906 (0%)  receiver

iperf Done.

Uhm, so much for a gigabit link through a 2.5g switch. Something is clearly up…

Okay, so it works on my local network at home. Big deal. That isn’t the “internet” by any stretch of imagination. Alright then, onto the next step. I guess I’m lucky that my ISP gives me a single full /64 IPv6 subnet of addresses (although it’s “not official” and I had to figure out how to enable it myself). With that I venture forth with no interest in legacy technologies like IPv4 <flips non-existent long hair backward arrogantly>.

So I got a friend with a Raspberry Pi to hook it to an LTE (with an Airtel SIM) USB modem (usually called a “dongle”) and configured it to not bother with any intermediate ~~shenanigans~~ routing². I SSHed into it (Yep, that was also over IPv6!) to control and observe both sides. I then tried the same experiment over IPv6 using SCTP. With bated breath

[milind@desktop-linux ~]$ iperf3 -c sctptest.sampleddns.net -6 --sctp -M 1000
Connecting to host sctptest.sampleddns.net, port 5201
[  5] local 2406:7400:9f:xxxx:xxxx:xxxx:xxxx:xxxx port 39980 connected to 2401:4900:yyyy:yyyy:yyyy:yyyy:yyyy:yyyy port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   832 KBytes  6.81 Mbits/sec                  
[  5]   1.00-2.00   sec  1.38 MBytes  11.5 Mbits/sec                  
[  5]   2.00-3.00   sec  1.38 MBytes  11.5 Mbits/sec                  
[  5]   3.00-4.00   sec  1.31 MBytes  11.0 Mbits/sec                  
[  5]   4.00-5.00   sec  1.31 MBytes  11.0 Mbits/sec                  
[  5]   5.00-6.00   sec  1.19 MBytes  9.96 Mbits/sec                  
[  5]   6.00-7.00   sec  1.06 MBytes  8.91 Mbits/sec                  
[  5]   7.00-8.00   sec  1.38 MBytes  11.5 Mbits/sec                  
[  5]   8.00-9.00   sec  1.31 MBytes  11.0 Mbits/sec                  
[  5]   9.00-10.00  sec  1.38 MBytes  11.5 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  12.5 MBytes  10.5 Mbits/sec                  sender
[  5]   0.00-10.02  sec  12.4 MBytes  10.4 Mbits/sec                  receiver

So SCTP over the open internet works!! (Note: hostname shown is faked)

On one of the tries, I had to change the Maximum Segment Size (MSS) to 1000 bytes instead of 1424 that it assumed to begin with. Some experimentation showed the max MSS to be 1283 — significance is as yet unknown. However, on another try on a different day it worked without any such parameter. Further digging required, clearly.

Another trial, this time using a hotspot from my phone’s LTE (by Vodafone-Idea, or as they glibly call themselves, “Vi”) connection to this Raspberry Pi.

[milind@desktop-linux ~]$ iperf3 -c sctptest.sampleddns.net --sctp
Connecting to host sctptest.sampleddns.net, port 5201
[  5] local 2402:8100:25xx:xxxx:xxxx:xxxx:xxxx:xxxx port 42110 connected to 2401:4900:yyyy:yyyy:yyyy:yyyy:yyyy:yyyy port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   128 KBytes  1.05 Mbits/sec                  
[  5]   1.00-2.00   sec  64.0 KBytes   525 Kbits/sec                  
[  5]   2.00-3.00   sec   128 KBytes  1.05 Mbits/sec                  
[  5]   3.00-4.00   sec   256 KBytes  2.10 Mbits/sec                  
[  5]   4.00-5.00   sec   192 KBytes  1.57 Mbits/sec                  
[  5]   5.00-6.00   sec   192 KBytes  1.57 Mbits/sec                  
[  5]   6.00-7.00   sec   192 KBytes  1.57 Mbits/sec                  
[  5]   7.00-8.00   sec   192 KBytes  1.57 Mbits/sec                  
[  5]   8.00-9.00   sec   128 KBytes  1.05 Mbits/sec                  
[  5]   9.00-10.00  sec   192 KBytes  1.57 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  1.62 MBytes  1.36 Mbits/sec                  sender
[  5]   0.00-10.26  sec  1.56 MBytes  1.28 Mbits/sec                  receiver

iperf Done.

Trying it the other way round (from RPi to desktop) didn’t work at first, either because a firewall somewhere is “protecting” me, or because there truly are middle-boxes at play… ?

To complete the triangle, I also ran the test on its third leg, between my desktop on FTTN and laptop on the LTE-fed hotspot. I could not get it to work with desktop as server and laptop as client, until I tried opening the port on my trusty Ubiquiti EdgeRouterX firewall (I’m not going to show it here, both because it occupies space and because I’m embarrassed at the low speed of Wi-Fi on the laptop.)

Switching back to the RPi on Airtel as a client and my desktop as a server, it was successful, showing it was indeed my firewall blocking connections.

ISP, Connection method	Client ↓ ＼ Server →	Desktop	Laptop	Raspberry Pi
ACT Fibernet, PPPoE	Desktop		No	Yes
Vi, LTE hotspot hosted by phone	Laptop	Yes		Yes
Airtel, LTE direct	Raspberry Pi	Yes	No

Summary of my SCTP experiments over the internet

Overall the score is 2/3 internet connections work okay, and one may or may not be affected by Android’s firewall.

So there’s something to this after all, eh? Other than the MSS, I encountered issues only with firewalls on my devices — hardly “middle-boxes”.

The next step is to of course verify that Vi LTE also really does support SCTP on IPv6. If that is the case, then a good test (which I hope to do “soon”) would be to verify among all the major mobile operators and the major wired ISPs which do support IPv6.

To be sure, this is a very small sample set, and IPv6 is not exactly dominant on the internet yet. But it does show that sometimes the Right Thing ™ happens even without explicit intent, and we should be attentive to such opportunities to grab them. The real lesson is to not give up before even trying.

I must take a sentence out to forcefully recommend Avery Pennarun’s article on networking. Nominally it is about IPv6, but each digression grants an interested reader with deep insight; it is the very best on the topic. ↩︎
Which meant that on IPv4 I got a single “public” address directly from the ISP, which was of course lovingly CGNATed. ↩︎

M.2 NVMe, CMOS batteries, 2.5GbE adapters, and a wild goose chase

Hope

The story starts with my college friend (a desktop enthusiast) who offered to get me (an even more enthusiastic desktop … err… enthusiast) reasonably priced parts from the US. A bunch of hunting later, we had the perfect config pinned down. I am a minimalist of sorts, so I chose the DRAM-less Samsung SSD 980 NVMe 500 GB as my OS drive. An Asus ROG Strix B550-F Gaming (without WiFi — it’s not a laptop!) complemented the Ryzen 7 5700G APU – graphics cards are still just too expensive for anyone like me who’s not too big on gaming right now. PCIe 3.0 instead of 4.0 was an acceptable bargain. A TrendNet TEG-S350 2.5G switch was a perfect complement to the 2.5G Ethernet port built into the motherboard. And finally, just for kicks, I added an Intel Optane 16 GB (MEMPEK1W016GAXT I think) as a superfast (technically should be the fastest in latency terms) SSD to play with. I already had a Corsair CX430M lying around from a previous build, so I didn’t need anything new in that department. All of this detail turns out to be relevant, I promise!

Battle

A few months later, I had the parts and began building the rig. My first problem was that both Windows and Fedora installers just refused to install on the Samsung SSD — giving very weird errors amounting to “This drive is somehow not possible for me to go ahead with installing.”¹. Sometimes it simply didn’t see the drive. At about this point, I made sure to upgrade the firmware of the motherboard (“BIOS”/”UEFI”).²

Now there were two M.2 slots available, one connected to the CPU(APU?) and the other to the SouthBridge. I started off with the Samsung on the CPU slot and the Intel on the SB slot, which didn’t work — the Windows installer just didn’t see the Samsung drive while trying to install. I switched it around based on some hints online, now it didn’t detect either (my memory is hazy here — this is more than 1.5 years old at this point). Somehow switching back made it available — I was wary and tired, and seized the chance and finished installing. I suspect that that was a rare chance and that I got lucky.

After installing, the problems didn’t stop, as one may expect. Typical BSODs on Windows were of the CRITICAL_PROCESS_DIED kind; on Fedora I encountered numerous disk read errors (the kind that drop you out of the desktop environment and into text-mode).

Preparation

I took up this problem at some point to root-cause this and solve it — and quickly ran into a number of documented issues about Samsung NVMe firmware in general, and about ASPM in particular. Specifically, it was about the transition from D3Cold (a very low power mode with high latency) to D3Hot (a slightly higher power mode with lower latency) that was apparently failing to happen in time. Another issue that raised its head at this time was that the BIOS failed to retain settings when the desktop was disconnected from the mains — how did the CMOS battery run out so quickly? At some point I replaced the battery and hoped that that battery failure was a weird chance bug… Fat chance!

First, I took up the firmware — and found that there was absolutely nothing to do… The SSD and the motherboard already had the latest firmware. After dicking around disappointedly and aimlessly³, I had to resort to workarounds to totally disable ASPM. The BIOS was completely unhelpful in this. Windows was only little better — it has a power plan option to disable PCIe Link State Power Management (LSPM). This didn’t help in the slightest.

Fedora was a lot better in that it allowed me specifically to disable ASPM, in two different ways, both as kernel parameters — nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off. Setting these seemed to have reduced the incidence measurably, so I was convinced I was on the right track. I used Linux and could live with the lower incidence of crashes.

Stagnation

But this was not okay with my parents who were definitely were not going to happily switch to Linux. I had given up in fixing the issue in the short term, and so I had only one bad option left — install windows on a SATA drive to get around this firmware bug. I had an old 128 GB EVO 750 from my previous desktop, and in short order I installed Windows on it. The bug was definitely gone, and things seemed to be working perfectly. Hurray?

And this is where things stayed for the next year or so, to the point that fixing the main issue had become a low-priority matter that weakly affirmed its existence occasionally. Somewhere in the period, the CMOS battery had run out again though, and I knew this was definitely a real problem somewhere.

Encounter

One day I finally decided I will actually start using the 2.5G switch (which was hitherto gathering dust in the original box) with the desktop (see, I told you it will eventually matter in this story!). The moment I switched (pun not intended) the gigabit err… switch that was on my desk with its faster cousin, I now had Linux just flat-out refuse to boot. That darned PCIe power management firmware bug was back! And now it was not occasional in the slightest — it literally never let Fedora proceed beyond the login screen! Windows was a little better in that it booted — but was crashing with high degree of certainty within a few minutes of starting up.

My nemesis was back with full force, and this time I had significantly more time to solve the issue. So I attacked it again…… and reached the same place as before — the ASPM was the only possible cause according to the Internet. Windows on the SATA SSD was still working though, but Windows was not giving me enough info to trace the problem — specifically I needed dmesg.

Regroup

So I live booted Linux from a USB flash stick. It booted up fine as one would expect, but acted up very severely when checking the contents of the disk — several times I was subject to the horror of seeing my NVMe SSD show up as a RAW device with size 0. Luckily the fact that the Linux and Windows started booting reassured me that the device still kind of worked and the data still existed. The (USB- or NVMe-installed?) Linux log meanwhile made it clear that ASPM was clearly the cause. In addition, I was facing another issue — which is that the network connection was flaky and frequently went down! And this happened in ALL the OSes – drat, now I was running into more nutty issues, was I just unlucky?

More hope

I checked again for motherboard firmware (UEFI) updates and found quite a few revisions had passed — there was hope! I installed the latest, and with trepidation checked the options. As part of the “onboard devices” configuration page, there was an option regarding ASPM — quick as a flash I disabled it. I tried booting Linux on the NVMe drive …. and it promptly failed to boot like all the previous times!

At this point, I learned that the onboard Intel I225-V ethernet controller was seriously flawed and couldn’t run at 2.5G link speeds, at least in its initial hardware revision. The 2nd revision mitigated the issue and the 3rd totally fixed it. By this time, I was getting the idea that these problems might be connected — some searching revealed that the ethernet controller was powered by 3.3V and … drumroll please … so were the M.2 slots⁴! If you’re getting clues as to what the real problem was, you’re faster than me…

How would I know whether I had the fully functional ethernet chip, the partially fixed, or the fully flawed one? I found conflicting information – some said the “(3)” appended to the device name in Device Manager was the hardware revision, and some said it was only the firmware (“NVM”) revision. If it was the former, it meant I had the fixed version and the issues I was facing had their origin elsewhere.

Of course, I clung to the possibility that it was the latter, so that I could blame Intel/Asus for this problem. The only way to check the hardware revision, then, was to detach the VRM heatsinks and the I/O shield. I managed to do that with a healthy dose of Internet videos and my own intuition… it was definitely an interesting exercise, to see how the whole thing sat together. Ultimately though, it turned out I had the fixed hardware revision (SLNMH, in my case). Bummer! Back to square 1.5?

Despair again

I was now out of options, and could only think of taking the motherboard to Asus support — afraid of facing the “you bought it in the US, we won’t give you support here”. I wasn’t even sure I would get paid support. While procrastinating on this issue, I found a couple of posts on Reddit advising people with NVMe issues to check the power supply.

Ray of light

Now I was confident that a Corsair power supply at the very least would supply reliable power, and especially had no reason to suspect it because the kernel was clearly telling its a power state transition issue⁵ — surely it would only say that if it got reliable telemetry saying that, rather than inferring/guessing? But out of idle curiosity I ran OpenHardwareMonitor. The post advised me to check the voltages specifically… lo, and behold, the 3v3 voltage was reading around 3.0 or even lower!

Now several reddit posts advised strictly to not trust the built-in sensors, but to measure with a multimeter. I was wondering how I would do this, but luckily the CX430M is a semi-modular PSU — and I wasn’t using many of the SATA or any of the PCIe aux power supply ports on the PSU. After foraging the internet for the pinouts for my PSU and finding them, I checked⁶ the 12 V value, which was rock solid according to both the multimeter and the internal sensor.

Victory at last

The 3.3 V reading, however, wandered all over the place, up to 2.8 V even⁷, AND coincided perfectly with the storage devices starting to misbehave. The kernel WAS indeed guessing about the D3Cold transition thing! Along with this, I also suddenly realised why the CMOS battery was dying so quickly — the motherboard apparently directly connected 3v3 from the PSU to the CR2032 cell forming a 3.3 V bus of sorts, so when the PSU was not doing its job, the cell was propping up the voltage. Among other things, it also explains why the NVMe drives fared better for a while in the middle — they were being powered by the cell… No wonder the poor thing ran out of juice quickly!

I had an older Corsair GS600, which I was not preferring to use because it was a non-modular PSU, but voltage stability is infinitely more important than redundant cables or airflow. Swapping them finally got me the smoothly working desktop that I should have had 1.5 years ago. I installed a new CMOS battery and now all is right with the world once more… well, not exactly. Just around the time I found out about the wandering voltages, the Optane has stopped responding to the BIOS or to the OS, despite swapping the M.2 slot drives multiple times.

But that is a goose to chase for another day.

I will eventually reproduce this by putting the PSU back in and give you all the exact error messages I faced. ↩︎
The “BIOS” was originally firmware on IBM PCs, and later was any firmware on PCs that adhered to that de facto standard. The “UEFI” is also firmware, one that adheres to the de jure UEFI specification. (A few de jure specifications like PC-98 and PnP BIOS also built on the de facto BIOS interface) ↩︎
I realised that the SSD support page was bad enough that if there was any bug that was fixed, I would not be able to tell — because it was lacking this little bit of info called RELEASE NOTES! ↩︎
Technically, the M.2 M-key slots specifically. The B-key slots work with SATA M.2 drives, which presumably would need 12V. But B-key slots are rare today I think. Incidentally, this was also probably the reason to make M-key flash devices incompatible with B-key slots — they probably have no capability to accept 12V. This is pure speculation on my part though. ↩︎
The dmesg error messages were nvme 0000:07:00.0: Unable to change power state from D3Cold to D0, device inaccessible and nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff . What was I supposed to think? ↩︎
It is awkward to hold the probes stably enough to read a value, you just have to spend a bit of time and use your ingenuity, no easy shortcuts. If I had enough patience, I would have crimped a new connector with open wires on the other end, but I didn’t. I also didn’t want to cut an existing connector. ↩︎
As it happens, the Corsair CX-M series is renowned for poor quality construction and choice of capacitors used. TechPowerup also spoke about bad voltage regulation on 3.3V, but that is under transient condition, which was not my case. ↩︎

The long tail

Weblog of yet another Indian Engineer