Make the Leap Second First-Class: An Open Letter to the International Telecommunication Union

Assumed Audience: The ITU, hackers, programmers, and anyone that cares about time and computing. Also, anyone that can tell me if I’m wrong. Discuss on Hacker News, but please don’t post on lobste.rs because I do not have an account.

Epistemic Status: Confident, enough that I would be willing to help implement these ideas and will in my own code as much as possible even if they are not adopted.

Dear International Telecommunication Union,

I’ve seen news that on 18 November 2022, General Conference on Weights and Measures (CGPM) voted to ask you to get rid of the leap second by 2035.

I disagree. Keep the leap second.

Or even better: make it first-class.

First, why should we not get rid of leap seconds?

Time, at least how we measure it, was invented to serve humans. Keeping our ticking hardware and computer clocks aligned with the human clock is essential to that end. A human with a healthy circadian rhythm will notice if our clocks are off from the solar time.

This is the biggest reason I don’t like Daylight Savings: it messes directly with human time. It is also hard on computers. If it’s bad for both, why don’t we get rid of it? I get why it exists, but we’re not in a total war anymore.

“Well, sure,” you might say, “that makes sense. But we’re never going to let the clocks get more than a minute off, and that’s close enough, right?”

You’re both right and wrong. You’re right that never allowing clocks to be more than a minute off is close enough.

But you’re wrong that allowing clocks to be a minute off is a good thing because the second reason is that longer periods between adjustments will cause more problems, not less.

Let’s take this story:

$work had thousands of full custom, dsp-heavy, location measurement hardware devices widely deployed in the field for UTDOA locating cell phones. It used GPS for time reference – if you know your location, you can get GPS time accurate around the 10’s of nanoseconds. GPS also broadcasts a periodic almanac which includes leap second offsets: if you wanted to apply the offset to GPS you could derive UTC. Anyway there were three models of these units, each with an off-the-shelf GPS chip from one of three location vendors you’ve probably heard of. The chip firmware was responsible for handling leaps.
One day, a leap second arrived from the heavens. We learned the three vendors all exhibited different behaviors! Some chips handled the leap fine. Some ignored it. Some just crashed, chip offline, no bueno, adios. And some went into a state that gave wildly wrong answers. After a flurry of log pulling, debugging, console cabling, and truck rolls, we had a procedure to identify units in bad states and reset them without too many getting bricked.
It seems the less likely an event is to occur, the less likely your vendor put work into handling it.

And a reply to it:

This recalls perhaps the biggest mistake in the GPS specification, the 1024-week rollover period. A timespan long enough to be impractical to test without expensive simulator hardware, short enough to be virtually guaranteed to cause problems in long-term installations… and long enough for OEMs to ignore with impunity. (“Eh, it’s 20 years, by that time I’ll be retired/dead/working somewhere else.”)
Moral [of the story]: timescale rollovers need to be designed to happen very frequently – as in a few days at most – or not at all. Unfortunately the leap second implementers didn’t have that option.

Let’s look at the conclusions the first reached:

It seems the less likely an event is to occur, the less likely your vendor put work into handling it.

And the second:

timescale rollovers need to be designed to happen very frequently – as in a few days at most – or not at all. Unfortunately the leap second implementers didn’t have that option.

These mirror a well-known principle in software engineering: the less likely that a state is reached, the less likely it is to be tested.

I don’t know what the CGPM was thinking when they voted to get rid of the leap second, but they certainly weren’t thinking of this principle.

In essence, they voted to remove a small, once-every-few-years disturbance in the Force that bothers some people who don’t learn and breaks some things that don’t get fixed, for a large, once-every-half-century destruction of Alderaan that will break everything in this world that software has eaten.

You thought Y2K was bad? Our world didn’t depend on software then like it does now. A small break is bad, but we’re used to small breaks happening all the time. Another Y2K? Chaos will reign, even if it does not destroy society.

Which it might.

And to help software engineers prepare for that, the CGPM decided to give no one any practice for 50 years! That’s longer than the working life of many adults! And it could be even longer!

This means that the knowledge of how to handle leap minutes will have to be passed down at least two generations of software engineers. We will have to rely on institutional memory when the living memory of software engineers won’t have the knowledge.

Good luck with that.

Now, leap seconds do not happen “every few days at most,” but leap milliseconds are not a good idea, and once every few years is still in the living memory of working software engineers.

This means that the knowledge can be passed down through practice, and practice makes perfect.

Good enough.

So do not abolish the leap second. Our software engineers, software, and society depend on it.

So what to do instead?

If I may be so bold, I would like to make a suggestion.

I do this for two reasons:

To spark ideas in smarter people than me.
To learn by flushing smarter people than me out of the woodwork using the magical principle of “Someone is wrong on the Internet.”

To that end, may I suggest that you make the leap second first-class?

This is what I mean: currently, UTC is defined one value. Why not define it as two? Or even three?

The first value, which I will call seconds, is what it already is: a value close to the actual solar time pretending to be starting from some epoch. Or however you actually define it.

The exact definition doesn’t matter to me.

The second value, which I will call offset, could be either the sum of all of the leap seconds since the epoch, or the number of positive leap seconds. If the first, no other value is needed. If the second, then a third value, which I will call neg_offset, would be the number of negative leap seconds.

Let’s call the first option (two values) UTC2 and the second option (three values) UTC3.

Those would also be convenient names if these were adopted as standards. Easy to remember!

To understand what is better about UTC2 and UTC3, we need to understand why software engineers, Facebook ones at least, hate leap seconds.

There are four reasons I can think of:

UTC times are not monotonically increasing or unique.
Software is not robust around leap seconds.
Conversion is lossy.
Storing time as UTC is not a good idea.

To start, computers like time to always be monotonically increasing or at least be unique. It’s easier to write software with those assumptions.

So basically, the software engineers are trying to be lazy.

They should just do their jobs instead of trying to push the problem onto future generations, even if it’s hard.

So let’s see match UTC2 and UTC3 match those assumptions.

With normal time, obviously, UTC2 and UTC3 are monotonically increasing because seconds is increasing.

When a positive leap second happens, seconds won’t change, but offset will be incremented.

This means that in the presence of only positive leap seconds, both UTC2 and UTC3 are effectively monotonically increasing.

When a negative leap second happens, seconds still won’t change. UTC2’s offset will be decremented, and UTC3’s neg_offset will be incremented.

This means that UTC2 is not monotonically increasing, but every time is unique. This is enough to make it exponentially easier to write good software.

Yes, even though I think software engineers should do their job, making it easier for them would make the leap second disturbance in the Force smaller.

But if we really wanted to make things easy, UTC3 is monotonically increasing. Always. And every time is unique.

This would make it super easy to write software!

Another reason software engineer hate leap seconds is that things are not robust around the time of a leap second.

This is related to the principle stated above, that less likely states are not as well tested, but remember, this is better than once every 50 years.

Why are they not robust? It has to do with the various techniques to deal with the leap second. I will use one for an example: leap smear.

The Facebook engineers said that if an NTP server doing a leap smear encounters a fault and shuts down, the whole smear is thrown into disarray. This is a valid criticism…of the leap smear.

Let’s look at what an NTP server does with a leap second using UTC2: it converts to TAI by adding offset to seconds and outputs that.

Let’s look at what an NTP server does with a leap second using UTC3: it converts to TAI by adding offset to, and subtracting neg_offset from, seconds and outputs that.

Every time is unique, and after converting to TAI, they are also all monotonically increasing.

Simple. No smear needed. The NTP servers would be the simplest pieces of infrastructure at Facebook.

That’s not to say that the NTP servers wouldn’t be complex for other reasons, like handling network problems or nanosecond precision or something like that. It’s just that UTC2 and UTC3 would not be the complex parts.

But that actually brings up another thing about leap seconds that software engineers hate: conversion.

As you well know (but others in my audience may not), there are many types of time. There is UTC, of course, but also TAI, UT1, LORAN, GPS, and civil time.

If a time standard is bad, then it’s hard to convert to and from that time. I guess, by definition, this means that UTC is bad because converting to and from UTC is lossy.

It makes sense why software engineers hate UTC.

But converting to and from UTC2 and UTC3 is not lossy.

Except for converting to and from UTC. Because UTC can go jump in a lake.

In fact, they are so not lossy that the first value, seconds, could be used as a marker for a scheduled time in the future. Say I want to schedule a doctor’s appointment after the next leap second. My software could simply store the seconds. Then, when the time is close, that seconds value is paired with the current offset (and neg_offset). Done.

Okay, okay, if the scheduled time is at a leap second boundary, you need a defined way of handling it. Just use the earlier value of offset or neg_offset. Something scheduled days beforehand can be a second off.

UT1 is tough, but it’s like rubber seconds anyway. The rest are easy.

This ease of conversion brings us to the last reason software engineers hate UTC: storage.

Lossless conversion means that it is always possible to store timestamps as UTC2 or UTC3 and never worry that the timestamp is wrong.

Even better, as alluded to, it’s even possible to not worry about using them for future timestamps.

You can’t do that with UTC because the meaning of a timestamp may change with new leap seconds.

Here’s how: if you store the entire data for UTC2 or UTC3, then when that future time comes, take the timestamp’s seconds, add its offset (and subtract its neg_offset), then subtract the current time’s offset (and add its neg_offset).

Voilá! The timestamp is properly converted to the current leap second offsets. It can also be stored again.

Don’t store a partially-converted time!

But a perfect standard is useless if it is not adopted. So how do we get this adopted?

Well, first of all, use a different name. This means that the current UTC standard won’t be touched, which means that platforms can still support UTC (and not break backwards compatibility) while adding support for whichever one becomes a standard.

This is why I suggested UTC2 and UTC3.

And that’s it! With a new name and no backwards compatibility, it doesn’t matter what else you do! All you have to do is make it a standard, and platforms will adopt it.

In fact, keeping UTC as-is has an additional benefit: the transition period between UTC and UTC2/UTC3 makes it easier to adopt the new standard because software can use old UTC code if UTC2/UTC3 is not supported by the platform.

And good protocols already have some support for it. For example, NTP has leap bits. NTP itself does not need to change; Facebook’s NTP servers, for example, could store the fact that the leap bits exist, and when they are cleared, increment offset instead of seconds (or decrement neg_offset).

No change needed!

The rest are details, such as how much space should the new values take. I think four bytes and those, and those four bytes could be used to encode both offset and neg_offset for UTC3.

I’ll leave the encoding to smarter people.

You could even add nanoseconds as another four-byte value to make “UTC2n” and “UTC3n.”

Anyway, I’ll skip the details.

But I hope that you see that UTC2 and UTC3 are a better option than just abolishing leap seconds and leaving a leap minute mess for a future generation that will have no practice and knowledge of what to do to avoid a catastrophe.

So sure, abolish leap seconds in UTC. But only if you add a new standard that makes them first-class.

Sincerely,

Gavin Howard

About

Contact

Archive

Categories

Tags

Subscribe

Make the Leap Second First-Class: An Open Letter to the International Telecommunication Union

Recent Posts

Subscribe