Introduction
Oh, the irony!
Merely hours after I posted my last post, in which I said, “I have found
more bugs than users have,” I got an email from a FreeBSD user (where my bc
and dc
are default) reporting a bug in my dc
.
What I claimed in my previous post is still true; I found and fixed two bugs in the past two weeks after a month of inactivity.
The Script
The bug was found by running the following script (found here) with my
dc
:
#!/bin/sh
if test $# -lt 1
then
echo usage: $0 year
exit 1
fi
echo $* '[ddsf[lfp[too early
]Pq]s@1583>@
ddd19%1+sg100/1+d3*4/12-sx8*5+25/5-sz5*4/lx-10-sdlg11*20+lz+lx-30%
d[30+]s@0>@d[[1+]s@lg11<@]s@25=@d[1+]s@24=@se44le-d[30+]s@21>@dld+7%-7+
[March ]smd[31-[April ]sm]s@31<@psnlmPpsn1z>p]splpx' | dc | tr '\012' ' '
echo ''
Wow, that is unreadable! The truth is that dc
code is, in my opinion, the
best (or worst) example of a write-only language. It’s even worse than
Perl!
And even though Wikipedia claims that the term is used in humor, I am not
joking. My evidence is just how much work it took to analyze this 233-byte dc
program.
Of course, there is a reason for that: an RPN parser is dead simple.
As proof, compare the size of my dc
’s lexer and parser to my
bc
’s lexer and parser.
This fact meant that dc
was one of the first programs Ken Thompson ran on the
PDP-11, and people have used dc
to implement Diffie-Hellman as a
protest against unconstitutional crypto export restrictions.
In other words, dc
is a historically significant language and calculator; it
might be useful to know it.
So let’s break this script down line-by-line.
Basics
But in order to do that, we need to learn some basics.
dc
Concepts
First, dc
is a stack-oriented language, like Forth.
And it’s not nearly as well-designed.
This means that you push things on the stack, in a certain order, to do things with them.
The way this is done is by Reverse Polish Notation, like old HP calculators.
I still have one of those old HP calculators, the one I used in Junior High, and even though I wrote my first program on a TI-84 (which I also still have), I am still nostalgically fond of that calculator.
Now that I know dc
code, I should probably attempt to write a program on that
HP.
Also, read that link about HP calculators. It’s a treasure.
Second, dc
has variables, called “registers.” In some dc
implementations,
register names can only be one character/byte. In some, you can get access to
actual variable names.
Third, dc
has strings, but they do not look like strings in most programming
languages. They are bounded by square brackets: []
.
Fourth, strings can contain other strings. (The brackets have to be balanced.)
Fifth, strings can be executed. This means that when looking at a string, you have to guess whether it is supposed to be printed, executed, or both. Often, that is easy, but not necessarily.
And finally, the combination of the above two things is how you create if
statements, loops, and functions in dc
.
For more details about how to create those things, see my post about dc
tricks.
Reformatting
But even though we understand dc
code now, it would serve us well to reformat
it.
And reformatting dc
code is no easy task.
The reason is that newlines don’t matter, most of the time. There are times when adding or removing a newline in the wrong place can completely change the meaning of the code.
In dc
’s case, you must know what commands take a register name, and which do
not.
If you are using my dc
, you can check the manpage, but in this case, I will do
the formatting for you.
#!/bin/sh
if test $# -lt 1
then
echo usage: $0 year
exit 1
fi
echo $* \
'[
d
d
sf
[
lf
p
[too early
]
P
q
]
s@
1583
>@
d
d
d
19
%
1
+
sg
100
/
1
+
d
3
*
4
/
12
-
sx
8
*
5
+
25
/
5
-
sz
5
*
4
/
lx
-
10
-
sd
lg
11
*
20
+
lz
+
lx
-
30
%
d
[
30
+
]
s@
0
>@
d
[
[
1
+
]
s@
lg
11
<@
]
s@
25
=@
d
[
1
+
]
s@
24
=@
se
44
le
-
d
[
30
+
]
s@
21
>@
d
ld
+
7
%
-
7
+
[March ]
sm
d
[
31-
[April ]
sm
]
s@
31
<@
p
sn
lm
P
p
sn
1
z
>p
]
sp
lp
x' \
| dc \
| tr '\012' ' '
echo ''
Yes, that is one command per line. That is how many commands there are in that amount of code.
That is why people sometimes use it for code golf.
But I would still like to make one more change. The two lines [too early
and
]P
are ugly, but they are that way because that string is supposed to be
printed, and the programmer wanted it printed with a newline. Adding the newline
directly does that.
However, there is a way to get around that, resulting in this code:
#!/bin/sh
if test $# -lt 1
then
echo usage: $0 year
exit 1
fi
echo $* \
'[
d
d
sf
[
lf
p
[too early]
P
10
P
q
]
s@
1583
>@
d
d
d
19
%
1
+
sg
100
/
1
+
d
3
*
4
/
12
-
sx
8
*
5
+
25
/
5
-
sz
5
*
4
/
lx
-
10
-
sd
lg
11
*
20
+
lz
+
lx
-
30
%
d
[
30
+
]
s@
0
>@
d
[
[
1
+
]
s@
lg
11
<@
]
s@
25
=@
d
[
1
+
]
s@
24
=@
se
44
le
-
d
[
30
+
]
s@
21
>@
d
ld
+
7
%
-
7
+
[March ]
sm
d
[
31-
[April ]
sm
]
s@
31
<@
p
sn
lm
P
p
sn
1
z
>p
]
sp
lp
x' \
| dc \
| tr '\012' ' '
echo ''
Just trust me that 10P
will print a newline.
The full script can be downloaded from here.
Analysis
Now that we have formatted the code, let’s look at it line-by-line.
Shell
First, let’s look at the last two lines of the entire script:
| tr '\012' ' '
Obviously, this translates all newlines dc
prints into spaces.
echo ''
And if it wasn’t obvious, that just prints a newline.
Now, all the shell lines:
#!/bin/sh
if test $# -lt 1
then
echo usage: $0 year
exit 1
fi
echo $* \
The first line is obvious.
The second through sixth lines just check the script argument.
The last line just echoes the year into dc
, which, since dc
is stack-based,
makes the year the first thing on the stack.
dc
The very first line is this:
'[
Simply put, that is just the start of the dc
code ('
) and the start of a
string ([
). When we get to the end, this long string will be executed, so in
essence, this string is still code.
In this case, think of it more like the opening brace of a function.
Preparation
d
d
sf
Next, the d
command copies the item on the top of the stack and pushes the
copy onto the stack. In other words, it “duplicates” the top of the stack.
There are two d
commands, so whatever is on the top of the stack will be
duplicated twice, leaving three total copies on the stack.
And what was on the top of the stack? The year; it was inserted by the shell code.
So at the end of the first two lines, there are three copies of the year on the top of the stack.
But the next command is s
, which stores the top of the stack into a register.
In this case, the script only uses one-character names, so the year is stored
into register f
.
At the end of the s
command; we have two copies of the year on the stack, and
the year is also stored in the f
register.
[
lf
p
[too early]
P
10
P
q
]
s@
This bunch of code is much more complicated, and that’s not just because there is more of it.
This is our first example of a full string (two, actually).
First, the larger outer string is pushed onto the stack.
Then, the s
command stores the string into the @
register.
Because dc
’s register names are only one character, they can be basically
any character, even non-alpha-numeric ones. This includes whitespace,
especially newlines.
This is why inserting a newline into the wrong place can change the meaning of code.
We’ll come back to the contents of the string in a minute because first, let’s talk about these two lines:
1583
>@
This tiny bit of code is the most complicated yet. It introduces string execution. In fact, it introduces conditional execution.
The number is just pushed onto the stack, but the next command is >
. This is a
command that pops the top two items off the stack and conditionally executes the
contents of the register after it.
In this case, the condition is that the value of the top of the stack is greater than the value of the second-to-top item on the stack.
While that makes sense if you don’t think about it, it is actually backwards from how other programming languages do it. So don’t think about it too hard.
Since the top of the stack is 1583
, and the second-to-top is the year, this
command will execute the contents of the @
register if 1583
is greater than
the year.
I suspect that the year 1583 is chosen because it was the first year under the Gregorian Calendar that Easter was celebrated.
What are the contents of the @
register? The string I skipped over earlier, so
let’s look at it now.
lf
p
[too early]
P
10
P
q
The l
command pushes the contents of the given register onto the stack, so the
first command will push the contents of the f
register onto the stack.
f
contains the last duplicate of the year, so the year is pushed onto the
stack again.
The p
command prints the item on top of the stack, but it does not pop the
item off of the stack. It also prints a newline after. This is one of the
newlines that tr
translates.
The [too early]
line pushes that string on the stack, and the P
command
prints it without a newline.
Then the 10
and P
lines print a newline, and the q
command quits dc
.
Of course, since tr
translates all of the newlines into spaces, and because it
comes after the text, and because a newline is printed after dc
exits anyway
(by echo ''
), the 10
and P
lines could be removed.
In essence, this first part of the script is an just an error check, an if
statement, but that is not obvious at first glance.
Algorithm
The next lines are:
d
d
d
The item on the top of the stack is duplicated three times.
What’s on the top of the stack? The year. Again.
There used to be three, but one was used for storing into the f
register, and
one was used to do the comparison with 1583
.
So after the duplications, there are four copies of the year on the stack.
The next lines begin the algorithm in earnest.
19
%
This appears to be calculating the variable a
(year mod 19), which appears in
all of the algorithms on Wikipedia.
Remember the pattern of “number, operation” because it appears again and again.
1
+
I cannot figure out this line. In none of the algorithms listed in Wikipedia is
a
is incremented.
sg
Incremented variable a
is popped from the stack and stored in register g
.
100
/
This divides the year (which is the top of the stack again) by 100, which
appears to calculate k
from Gauss’ algorithm, or b
from the later
algorithms.
1
+
d
Once again, the variable is incremented. This is confusing again because Gauss'
algorithm does not increment k
, and the others do not increment b
.
But whatever the value is, it is also duplicated.
3
*
4
/
The duplicate value is multiplied by 3
, and the result is divided by 4
.
At this point, it is starting to look like Gauss’ algorithm is the one in use, with changes, of course.
12
-
sx
Next, 12
is subtracted from the result, and that result is popped and stored
into the x
register.
Now, before we go any further, we have to answer the question that you always have to answer for stack-based languages: what’s on the top of the stack?
The value that was incremented after getting divided by 100
.
With that in mind, let’s continue:
8
*
5
+
25
/
5
-
sz
That value is multiplied by 8
, added to 5
, divided by 25
, and has 5
subtracted from it before it is popped and stored in the z
register.
Looking at the algorithms, the Gauss algorithm is still in the running, but the 1961 New Scientist algorithm still seems to have a chance.
Again we ask the question: what’s on the top of the stack?
It’s the year again; there are still two copies.
5
*
4
/
The year is multiplied by 5
, and the result is divided by 4
, and it is left
on the top of the stack.
lx
Next, the value in the x
register is pushed onto the stack, and we should
remind ourselves what is in x
: the value that was multiplied by 3
, divided
by 4
, and had 12
subtracted from it.
-
10
-
sd
First, the value loaded from x
is subtracted from our previous value, then
10
is subtracted from that, before that result is stored in the d
register.
Again, what is on the top of the stack?
It’s the last copy of the year, but it turns out that it doesn’t matter because…
lg
11
*
20
+
The value in the g
register is loaded (the one modded by 19
and
incremented), multiplied by 11
, and added to 20
.
lz
+
lx
-
30
%
d
The value in z
(the one divided by 25
and less 5
) is loaded and added to
the value that was already there. Then the value in x
is loaded (mulitplied by
3
, divided by 4
, and less 12
), and that is subtracted from the value that
had been on top before all of that is modded by 30
and duplicated.
At this point, I have lost what the algorithm was…the one used here is certainly not on Wikipedia.
But what is on the stack?
Debugging tip: to figure out what is on the stack at a certain point, I just add
fq
after the command I am interested in.
The f
command prints the contents of the stack, and q
will quit dc
in most
cases. If there are more than 3 strings executing, you would put that number of
strings on the stack and then run the Q
command.
All that’s on the stack is the two copies of the result just calculated and the year.
[
30
+
]
s@
Next, a string is pushed onto the stack. This doesn’t look like a string you
would want to print, so it’s probably code. And it is stored into the @
register.
Another string was stored in the @
register, the error check string at the
beginning. It doesn’t matter now, so the program can safely overwrite it.
0
>@
Then 0
is pushed onto the stack, compared against the former top of the stack,
and if it is greater (i.e., the former top of the stack is less than 0), the
string just stored in @
is executed (just another if
statement), which just
adds 30 to the top of the stack.
Remember that the items that were compared were removed from the top of the
stack, so the current top of the stack is the value that was modded by 30
and
duplicated. This was, in fact, why it was duplicated.
d
[
[
1
+
]
s@
lg
11
<@
]
s@
The value on top of the stack, the value that was just added to 30
is
duplicated and another string that looks like code is pushed on the stack before
being stored in the @
register.
25
=@
Next, the number on top of the stack is compared to 25
, and if they are equal,
the contents of @
are executed, so let’s go back to that.
[
1
+
]
s@
lg
11
<@
First, a string is pushed onto the stack and stored in @
.
It looks like whoever wrote this code loved to use the @
register to store
code to execute in if
statements.
Then the contents of the g
register are loaded, and if 11
is less than that,
that string is executed. Obviously, that string just increments the top of the
stack.
What is the top of the stack at that point? It’s the value that was duplicated before the first outer string was pushed, so that’s the value that could be incremented.
d
[
1
+
]
s@
24
=@
That value, which may have been incremented or not, is duplicated before yet
another if
statement. This time, it compares it against 24
, and if they are
equal, the original (not the duplicate, which is gone) is incremented.
The pattern of “duplication, if
statement” is a good one to remember. It will
pop up again.
se
44
le
-
Next, that value is stored in the e
register before 44
is pushed onto the
stack and the value in e
is pushed onto the stack again before being
subtracted from 44
.
It was pushed onto the stack again because in subtraction, order of the operands matters.
d
[
30
+
]
s@
21
>@
Here is the duplication, if
statement pattern again. I will leave it to the
reader to analyze it in detail.
d
ld
+
7
%
-
7
+
The value on top is duplicated before the contents of the d
register are
pushed. Then they are added before the patter of “number, operation” is done
once. Then, the result is subtracted from the original before yet another
“number, operation” pattern.
[March ]
sm
The string March
is stored into the m
register.
d
[
31-
[April ]
sm
]
s@
The value on top of the stack is duplicated before another string is put into
@
. It looks like it’s going to be another if
statement.
31
<@
Yes, it is. If the top of the stack is greater than 31, then we execute that string.
31-
[April ]
sm
31
is subtracted from the top of the stack (the original because the duplicate
was consumed), and then April
is stored into m
.
In other words, the top of the stack is the day, and if it’s greater than 31
(the number of days in March), then Easter will be in April, and we need to
adjust the day.
p
sn
The p
prints the current item on the stack, which is the day, followed by a
newline (later changed by tr
). It does not, however, consume the day, so next,
it is stored into the n
register.
lm
P
First, the contents of m
are loaded, which is the month, and it is popped off
the stack and printed (without a newline).
p
sn
The item on top of the stack (which is the year, by the way) is printed, popped,
and stored into n
.
1
z
>p
These lines push 1
onto the stack, then they push the number of items on the
stack onto the stack.
It pushes the length of the stack at the start of the command.
Then if the stack length is greater than 1
, the contents of p
are executed.
In other words, these lines are checking that the stack was empty before their
execution, and if they were not, it executes the contents of p
.
What are the contents of p
? We need to continue to find out.
End
]
sp
lp
x' \
Here, that long string is finished and stored into p
. Then it is loaded again
(this is done to ensure there is a copy of it) and executed with the command
x
, which unconditionally executes a string.
So those stack-checking lines above will end up re-executing p
if the stack
was not empty.
And just like that, I know where the bug is in my dc
: it’s not popping items
off the stack properly when printing them with the P
command.
The bug was fixed by commit 34ca7b50d1c0
.
These lines would not have been necessary, by the way, if it had not been for
the stack check, which is the very reason it found a bug in my dc
.
Conclusion
I hope you enjoyed this tour of the historically significant and hacker-like
dc
calculator. I don’t know that there was any point to it other than for me
to add to the oral tradition of software engineering.
And to debug my dc
without gdb
.