Monday, July 21, 2014

How to efficiently enumerate prime numbers

There's a part in my code that requires me to evaluate a certain sum
$$\sum_{p\text{ prime }< \,M} h_E(p)$$
where $h_E$ is a function related to specified elliptic curve that can be evaluated efficiently, and $M$ is a given bound that I know. That is, I need to evaluate the function h_E at all the prime numbers less than $t$, and then add all those values up.

The question I hope to address in this post is: how can we do this efficiently as $M$ gets bigger and bigger? Specifically, what is the best way to compute a sum over all prime numbers up to a given bound when that bound can be very large?

[For those who have read my previous posts (you can skip this paragraph if you haven't - it's not the main thrust of this post), what I want to compute is, for an elliptic curve $E/\mathbb{Q}$, the analytic rank bounding sum $\sum_{\gamma} \text{sinc}^2(\Delta \gamma)$ over the zeros of $L_E(s)$ for positive parameter $\Delta$; this requires us to evaluate the sum $\sum_{n < \exp(2\pi\Delta)} c_n\cdot(2\pi\Delta-\log n)$. Here the $c_n$  are the logarithmic derivative coefficients of the completed $L$-function of $E$. Crucially $c_n = 0$ whenever $n$ isn't a prime power, and we can lump together all the terms coming from the same prime; we can therefore express the sum in the form you see in the first paragraph.]

As with so many things in mathematical programming, there is a simple but inefficient way to do this, and then there are more complicated and ugly ways that will be much faster. And as has been the case with other aspects of my code, I've initially gone with the first option to make sure that my code is mathematically correct, and then gone back later and reworked the relevant methods in an attempt to speed things up.

METHOD 1: SUCCINCT BUT STUPID

Here's a Python function that will evaluate the sum over primes. The function takes two inputs: a function $h_E$ and an integer $M$, and returns a value equal to the sum of $h_E(p)$ for all primes less than $M$. We're assuming here that the primality testing function is_prime() is predefined.

As you can see, we can achieve the desired outcome in a whopping six lines of code. Nothing mysterious going on here: we simply iterate over all integers less than our bound and test each one for primality; if that integer is prime, then we evaluate the function h_E at that integer and add the result to y. The variable y is then returned at the end.

Why is this a bad way to evaluate the sum? Because there are far more composite integers than there are primes. According to the prime number theorem, the proportion of integers up to $M$ that are prime is approximately $\frac{1}{\log M}$. For my code I want to compute with bounds in the order of $M = e^{8\pi} \sim 10^{11}$; the proportion of integers that are prime up to this bound value is correspondingly about $\frac{1}{8\pi} \sim 0.04$. That is, 96% of the integers we iterate over aren't prime, and we end up throwing that cycle away.

Just how inefficient this method is of course depends on how quickly we can evaluate the primality testing function is_prime(). The best known deterministic primality testing algorithm has running time that scales with (at most) the 6th power of $\log n$, where $n$ is the number being tested. This places primality testing in a class of algorithms called Polynomial Time Complexity Algorithms, which means the runtime of the function scales relatively well with the size of the input. However, what kills us here is the sheer number of times we have to call is_prime() - on all integers up to our bound $M$ - so even if it ran in constant time the prime_sum() function's running time is going to scale with the magnitude of $M$.

METHOD 2: SKIP THOSE $n$ WE KNOW ARE COMPOSITE

We can speed things up considerably by noting that apart from 2, all prime numbers are odd. We are therefore wasting a huge amount of time running primality tests on integers that we know a priori are composite. Assuming is_prime() takes a similar time to execute than our coefficient function h_E(), we could therefore roughly halve the runtime of the prime sum function by skipping the even integers and just checking odd numbers for primality.

We can go further. Apart from 2 and 3, all primes yield a remainder of 1 or 5 when you divide them by 6 (because all primes except for 2 are 1 (modulo 2) and all primes except for 3 are 1 or 2 (modulo 3)). We can therefore skip all integers that are 0, 2, 3 or 4 modulo 6; this means we only have to check for primality on only one third of all the integers less than $M$.

Here's a second version of the prime_sum() function that does this:

Of course we could go even further with the technique by looking at remainders modulo $p$ for more primes $p$ and combining the results: for example, all primes outside of 2, 3 and 5 can only have a remainder of 7, 11, 13, 17, 19, 23 or 29 modulo 30. However, the further you go the more special cases you need to consider, and the uglier your code becomes; as you can see, just looking at cases modulo 6 requires us to write a function about three times as long as previously. This method therefore will only be able to take us so far before the code we'd need to write would become too unwieldy for practicality.

METHOD 3: PRIME SIEVING...

This second prime_sum() version is a rudimentary example of a technique called prime sieving. The idea is to use quick computations to eliminate a large percentage of integers from consideration in a way that doesn't involve direct primality testing, since this is computationally expensive. Sieving techniques are an entire field of research in their own right, so I thought I'd just give as example one of the most famous methods: the Sieve of Eratosthenes (named after the ancient Greek mathematician who is thought to first come up with the idea). This takes as input a positive bound $M$ and returns a list of all prime numbers less than $M$. The method goes as follows:
1. Start with a list of boolean flags indexed by the numbers 2 through $M$, and set all of them to True.
2. Starting at the beginning of the list, let $i$ be the index of the first True entry. Set all entries at indices a multiples of $i$ to False.
3. Repeat step 2 until the first True entry is at index $> \sqrt{M}$.
4. Return a list of all integers $i$ such that the entry at index $i$ is True.
This is definitely a case where a (moving) picture is worth a thousand words:
 A good graphic representation of the Sieve of Eratosthenes being used to generate all primes less than 121. Courtesy Wikipedia: "Sieve of Eratosthenes animation". Licensed under CC BY-SA 3.0 via Wikimedia Commons.
How and why does this work? By mathematical induction, at each iteration the index of the first True entry will always be prime. However any multiple thereof is by definition composite, so we can walk along the list and flag them as not prime. Wash, rinse, repeat. We can stop at $\sqrt{M}$, since all composite numbers at most $M$ in magnitude must have at least one prime factor at most $\sqrt{M}$ in size.

Here is a third version of our prime_sum() function that utilizes the Sieve of Eratosthenes:

Let's see how the three versions stack up against each other time-wise in the Sage terminal. I've saved the three functions in a file called prime_sum_functions.py, which I then import up front (if you want to do the same yourself, you'll need to import or define appropriate is_prime() and sqrt() functions at the top of the file). I've also defined a sample toy function h_E() and bound M:

sage: from prime_sum_functions import *
sage: def h_E(n): return sin(float(n))/float(n)
sage: M = 10000
sage: prime_sum_v1(h_E,M)
0.19365326958140347
sage: prime_sum_v2(h_E,M)
0.19365326958140347
sage: prime_sum_v3(h_E,M)
0.19365326958140347
sage: %timeit prime_sum_v1(h_E,M)
1 loops, best of 3: 363 ms per loop
sage: %timeit prime_sum_v2(h_E,M)
1 loops, best of 3: 206 ms per loop
sage: %timeit prime_sum_v3(h_E,M)
10 loops, best of 3: 86.8 ms per loop

Good news! All three functions (thankfully) produce the same result. And we see version 2 is about 1.8 times faster than version 1, while version 3 is four times as fast. These ratios remained roughly the same when I timed the functions on larger bounds, which indicates that the three versions have the same or similar asymptotic scaling - this should be expected, since no matter what we do we will always have to check something at each integer up to the bound.

METHOD 4: ...AND BEYOND

It should be noted, however, that the Sieve of Eratosthenes as implemented above would be a terrible choice for my GSoC code. This is because in order to enumerate the primes up to $M$ we need to create a list in memory of size $M$. This isn't an issue when $M$ is small, but for my code I need $M \sim 10^{11}$; an array of booleans that size would take up about 12 gigabytes in memory, and any speedups we get from not having to check for primality would be completely obliterated by read/write slowdowns due to working with an array that size. In other words, while the Sieve of Eratosthenes has great time complexity, it has abysmal space complexity.

Thankfully, more memory-efficient sieving methods exist that drastically cut down the space requirements. The best of these - for example, the Sieve of Atkin - need about $\sqrt{M}$ space. For $M \sim 10^{11}$ this translates to only about 40 kilobytes; much more manageable.

Of course, there's always a downside: bleeding edge prime enumeration methods are finicky and intricate, and there are a plethora of ways to get it wrong when implementing them. At some point squeezing an extra epsilon of speedup from your code is no longer worth it in terms of the time and effort it will take to get there. For now, I've implemented a more optimized version of the second prime_sum() function in my code (where we skip over all integers that are obviously not prime), since for now that is my happy middle ground.  If I have time at the end of the project I will revisit the issue of efficient prime enumeration and try implement a more optimized sieving method, but that is a tomorrow problem.

1. This might be a naive question, but why can't you produce once (or download from the internet) the list of primes less than say 10^11, store them in files, and when the computation is needed load the files one at the time, iterate over that part of the list and continue with the next file.
The list (or better the numpy array) would be of about 10^11/log(10^11) < 4*10^9 elements, in my computer running
test = numpy.random.random(10^9)
takes less than 7GB of RAM, so even with a modest 2GB computer, having the whole prime list split into 16 files would be enough.

2. Yep, I've considered this, and it's something I might end up doing. There are two obstacles, however:

1. In order for reading the list of primes from file to be faster than enumerating them from scratch, the file of primes would have to be stored locally i.e. on a local disk, or better yet in RAM. However, I'm writing my code for inclusion in Sage, and bundling it with 7GB worth of text files isn't feasible; the current SAGE source tarball is only 400MB in size (note that I could include the text files as an optional spkg download, so this option is stil possible, albeit a clunky one).

2. Scalability. If I have a list of primes up to ~10^11, then I can only run my code with Delta parameters up to about 4. One day I might want to use a larger Delta value to get better rank estimates, but because this would require even more primes I'd be out of luck. Ultimately I want to write code that can be pushed as far as computational power allows, and from a philosophical point of view, hard coding in an artificial limit is undesireable.

But you do have a point. I should do some crunching and see if precomputing a list of primes will end up being faster/more practical.

3. That is a very interesting point. It is a waste of CPU for every user to be doing the same calculation, but it might be better than the waste of bandwidth that would come from every user downloading the same database. And it would be around 28GB of RAM space for primes less than 10^11.

For SMC, would it be feasible to store the database of files in each of the sage servers, a load from within the same server? And for local Sage installation it could produce the list once, I guess warning the use that she/he is about to lose 28GB of storage =)

4. An efficient sieve implemented in a compiled language can enumerate primes as fast as you can do anything useful with them (a few CPU cycles per prime), so there is little reason to use huge tables. See http://primesieve.org for instance.

5. Fastest way to list all primes below N in python. I wonder how hard would be to turn one of these list returning functions into a generator.

6. I do not exactly know the application, but being able to enumerate primes in an interval [M,M+N] is probably more useful. For instance, with M=10^15 and N=10^8, this should be not much longer (maybe 2x) than all the primes up to 10^8. I had thought there was some code in Sage do this already?

7. Yes, there already is an implementation in Sage to enumerate primes in the interval [M,M+N]. However, it's not optimized: when M=0 it just calls PARI's isprime() function, and is thus accordingly slow. There are two sieving methods implemented in PARI (at least, given the digging I did for my post): one to enumerate primes up to a given bound, and one to enumerate primes in an interval. The former is wrapped in Sage, but the latter isn't. What really should be done is to change Sage's prime_range() function to use the PARI sieving method.

- Simon