I have no clue if this idea is original or not. It may not be.
On Sunday, I had a flash of insight: what if we could use the Greatest Common
Divisor (GCD) algorithm to factor an integer N
? How could we make it do
that?
The answer was clear: make sure all of the possible factors, the primes up to
the sqrt(N)
, are in one of the arguments to the GCD algorithm. In other words,
we need to run GCD on N
and the primorial that includes all primes up to
sqrt(N)
.
Unfortunately, the primorial we want is HUGE, so that won’t work.
Maybe if we split the primes into sections and ran GCD on just the products of the sections, we could prevent the need for huge numbers.
In fact, it turns out that if we do that, the algorithm gains a few nice properties.
First, there is a bad case for the algorithm: when all of N
’s factors are
contained in a section, GCD returns N
, which is unhelpful. But if we section
the primes such that we run GCD on N
and the first product that is greater
than N
, repeating until done, then it turns out that the last prime that was
multiplied into the product is a factor, so if GCD returns N
, we can just take
the last prime as a factor.
The whole algorithm is suprisingly small in code:
def factor(n):
prod = 1
for p in get_next_prime():
while prod < n:
prod = prod * p
p = get_next_prime()
g = gcd(prod, n)
if g == n:
g = p
elif g != 1:
return g
prod = 1
This code was written by my coworker Broderick Gardner. It has been slightly edited.
Of course, this is assuming that GCD is implemented somewhere else, as well as
a prime sieve in the get_next_prime()
generator.
Actually, as it turns out, this algorithm has the makings of a prime sieve as
well: when you need to test a number P
to see if it’s prime, you just run GCD
on P
and all of the relevant products (up to sqrt(P)
).
Even better, if all you want is a product of primes, you can use this modified sieve to just give them to you directly by multiplying all of the candidate numbers together until you reach a limit. Then you run GCD on that number and on prime products up to the square root of the largest number. If the GCD is 1, all of the candidate numbers are primes. Otherwise, you divide by the GCD and then repeat until GCD does return 1. At that point, you know all of the factors inside the product are primes themselves.
Of course, wheel factorization will still help.
And with all of this said, I am sure there will be a question or two:
Is it possible that the NSA is using this to crack RSA?
It’s possible, but not very likely. Using the limit on the number of primes
below a number x
, which is (x / ln(x)) * (1 + (1.2762 / ln(x))
, I am able to
calculate an estimate of the amount of storage needed to store all of the
primorials.
The code (for bc
with the -l
flag) for calculating the amount of
storage is below.
define bound(x) {
auto s, l
s = sqrt(x)
l = l(s)
return s / l * (1 + (1.2762 / l))
}
define primorials_data(t) {
auto i, j, sum, bits[], sizes[]
sizes[0] = 0
sizes[1] = 2
sizes[2] = 2
sizes[3] = 2
sizes[4] = 5
for (i = 5; i < t; ++i)
{
sizes[i] = bound(2^i)
for (j = 0; j < i; ++j)
{
sizes[i] -= sizes[j]
}
}
for (i = 0; i < t; ++i)
{
bits[i] = sizes[i] * i - sizes[i] / 2
}
sum = 0
for (i = 0; i < t; ++i)
{
sum += bits[i]
}
return ceil(sum / 8, 0)
}
You need my bc
with its extended math library for this code to work.
The function bound(x)
just calculates the prime bound I mentioned earlier.
The function primorials_data(t)
calculates the amount of data needed to store
the primorials below 2^t
. This is how it works:
First, the number of primes for the first few numbers of bits are set as
constants. Then, for every bit size between 6 and t
, the upper limit on the
number of primes for that bit size is calculated by calculating the upper limit
and then subtracting the number of primes in all previous sizes. The it
calculates the number of bits needed for all sizes and sums them together,
dividing by 8 at the end to give a result in bytes.
In the calculation of bits, there is a bit of code that may seem weird:
sizes[i] / 2
. This code is there because of the rules for the size of a
product. A product’s size can either be the sum of the sizes of the operands or
one digit less. And both of those have a probability of 0.5. The sizes[i] / 2
takes into account those lost digits.
The result: to store the prime products up to 2^128
with no wasted bits would
require almost 5 exabytes of storage.
Now, the NSA almost certainly can store 5 exabytes, that is only enough to
generate primes up to 2^256
, which is only enough to crack 512-bit RSA keys.
That’s not enough to crack even 1024-bit keys.
And that’s still a lot of data; going any further is basically unfeasible.
What is the computational complexity of this algorithm?
As stated earlier, it’s exponential. (I didn’t care to calculate exactly what it is.)
The reason for this is that the upper limit on the number of primes is
(sqrt(N) / ln(sqrt(N))) * (1 + (1.2762 / ln(sqrt(N))))
(because we only need
primes up to sqrt(N)
), which while it grows fairly slowly, still grows fast
enough to place this algorithm into exponential territory. If it didn’t grow so
fast, this algorithm would be polynomial because the biggest limiting factor
(heh) is the number of primes.
Note that this complexity still applies if you use the primorial sieve trick because that trick is still limited by the number of primes.
Can this algorithm be sped up?
Absolutely. Just about every part of this algorithm is ready-made to be embarrassingly parallel. Once you partition the primes into sections, the algorithm can be run on each section in parallel. This goes for the prime product sieve as well, though the primes needed for the sieve do need to be stored either in memory or on disk in order to do so.
And further, if you have a lot of keys to factor, just factor them all at the same time, using the same data. If you have enough keys, you might bring the amortized complexity of the algorithm into polynomial territory, though that would probably require tons of keys.
In my opinion, it is that last fact that means there is a chance that the NSA is using this algorithm; they might be able to justify the use of the algorithm because of the amount of factoring they might want to do.
If any of you have further questions or even comments, please don’t hesitate to contact me.