Math and Hats

Tuesday, September 23, 2014

Fall quarter at UW begins!

All good things must come to an end! It's been a gloriously warm and sunny summer, but with the advent of august this week the rain is due to set in. Cue the six months in Seattle where you don't see the sun (I've started taking my vitamin D supplements again).

This week also sees the University of Washington start back up again: Wednesday 24 September is the first day of fall quarter classes. I will be teaching two sections of MATH307: Introduction to Differential Equations. My website for the course is up and running at http://www.math.washington.edu/~mlungu/teaching/math307au14/index.html - go take a look (and let me know if there are any typos)!

Friday, August 22, 2014

GSoC: Wrapping Up

Today marks the last day of my Google Summer of Code 2014 project. Evaluations are due midday Friday PDT, and code submissions for successfully passed projects start soon thereafter. The end result of my project can be found at Sage Trac Ticket 16773. In total it's just over 2000 lines of Python and Cython code, to be added to the next major Sage release (6.4) if/when it gets a final thumbs-up review.

When I write just the number of lines of code it doesn't sound like very much at all - I'm sure there are GSoC projects this year that produced at least 10k lines of code! However, given that what I wrote is complex mathematical code that needed a) significant optimization, and b) to be be mathematically sound in the first place, I'd say that isn't too bad. Especially since the code does what the project description aimed for it to do: compute analytic ranks of rational elliptic curves very quickly with a very low rate of failure. Hopefully this functionality can and will be used to produce numerical data for number theory research in the years to come.

The Google Summer of Code has been a thoroughly rewarding experience for me. It's a great way to sharpen one's coding skills and get paid in the process. I recommend it for any university student who's looking to go into any industry that requires programming skills; I'd apply to do it again next year if I wasn't planning to graduate then!

Thursday, August 14, 2014

Things are Better in Parallel

A recent improvement I've implemented in my GSoC code is to allow for parallelized computation. The world has rapidly moved to multi-core as a default, so it makes sense to write code that can exploit this. And it turns out that the zero sum rank estimation method that I've been working on can be parallelized in a very natural way.

THE @parallel DECORATOR IN SAGE

But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.

sage: def f(n):

....:     return n*n

....: 

sage: f(2),f(3),f(5)

(4, 9, 25)

This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:

sage: def g(N):

....:     y = 0

....:     for n in range(N+1):

....:         y += f(n)

....:     return y

....: 

sage: %time g(10000000)

CPU times: user 17.5 s, sys: 214 ms, total: 17.7 s

Wall time: 17.6 s

333333383333335000000

In this example you could of course invoke the formula for the sum of the first $n$ squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .

However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.

Sage has a readily available mechanism to do exactly this: the @parallel decorator. To enable parallel computation in your function, put @parallel above your function definition (the decorator can take some parameters; below ncpus=2 tells it that we want to use 2 processors). Note that we also have to modify the function: now it no longer only takes the bound up to which we must add squares, but also a flag indicating whether we should consider even or odd squares.

sage: @parallel(ncpus=2)

....: def h(N,parity):

....:     y = 0

....:     if parity=="even":

....:         nums = range(0,N+1,2)

....:     elif parity=="odd":

....:         nums = range(1,N+1,2)

....:     for n in nums:

....:         y += f(n)

....:     return y

Instead of calling h with its standard sequence of parameters, we can pass it a list of tuples, where each tuple is a valid sequence of inputs. Sage then sends each tuple of inputs off to an available processor and evaluates the function on them there. What's returned is a generator object that can iterate over all the outputs; we can always see the output directly by calling list() on this returned generator:

sage: for tup in list(h([(1000,"even"),(1000,"odd")])):

....:     print(tup)

....: 

(((1000, 'even'), {}), 167167000)

(((1000, 'odd'), {}), 166666500)

Note that the tuple of inputs is also returned. Because we're doing things in parallel, we need to know which output corresponds to which input, especially since processes may finish at different times and return order is not necessarily the same as the order of the input list.

Finally, we can write a wrapper function which calls our parallelized function and combines the returned results:

sage: def i(N):

....:     y = 0

....:     for output in h([(N,"even"),(N,"odd")]):

....:         y += output[1]

....:     return y

....: 

sage: %time i(10000000)

CPU times: user 1.76 ms, sys: 33.2 ms, total: 35 ms

Wall time: 9.26 s

333333383333335000000

Note that i(10000000) produces the same output at g(10000000) but in about half the time, as the work is split between two processes instead of one. This is the basic gist of parallel computation: write code that can be partitioned into parts that can operate (relatively) independently; run those parts on different processors simultaneously; and then collect returned outputs and combine to produce desired result.

PARALLELIZING THE ZERO SUM COMPUTATION

Let's take a look at the rank estimating zero formula again. Let $E$ be a rational elliptic curve with analytic rank $r$. Then

\begin{align*}
r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) &= \frac{1}{\pi \Delta}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right) \\
&+ \frac{1}{2\pi^2\Delta^2}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) \\
&+ \frac{1}{\pi \Delta}\sum_{\substack{n = p^k \\ n < e^{2\pi\Delta}}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)
\end{align*}
where

$\gamma$ ranges over the nontrivial zeros of the $L$-function attached to $E$
$\Delta$ is a positive parameter
$\eta$ is the Euler-Mascheroni constant $=0.5772\ldots$
$N$ is the conductor of $E$
$\text{Li}_2(x)$ is the dilogarithm function, defined by $\text{Li}_2(x) = \sum_{k=1}^{\infty} \frac{x^k}{k^2}$
$c_n$ is the $n$th coefficient of the logarithmic derivative of the $L$-function of $E$, which is zero when $n$ is not a perfect prime power.

The right hand side of the sum, which is what we actually compute, can be broken up into three parts: the first term involving the curve's conductor $N$; the second term involving the dilogarithm function $Li_2(x)$; and the sum over prime powers. The first two parts are quick to compute: evaluating them can basically be done in constant time regardless of the magnitude of $N$ or $\Delta$.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest $\Delta$ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except $2$ are odd, and all primes except $2$ and $3$ are either $1$ or $5$ remainder $6$. One can scale this up: given a list of small primes $[p_1,p_2,\ldots,p_n]$, all other primes fall into one of a relatively small number of residue classes modulo $p_1 p_2\cdots p_n$. For example, all primes beyond $2$, $3$, $5$ and $7$ have one of the following 48 remainders when you divide them by $210 = 2\cdot 3\cdot 5 \cdot 7$:
\begin{align*}
&1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,\\
&83, 89, 97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, \\
&151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209,
\end{align*}
and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to $e^{2\pi\Delta}$, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return $\sum_{n \equiv 1 (\text{mod } 210)}^{1000000} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)$. Processor 2 would do the same with all integers that are $11$ modulo $210$, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into $n$ chunks of approximately equal size (where $n$ is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the $\text{sinc}^2$ zero sum with (the option of) multiple processors:

Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

sage: sage.parallel.ncpus.ncpus()

12

sage: E = EllipticCurve([12838,-51298])

sage: Z = LFunctionZeroSum(E)

sage: %time Z._zerosum_sincsquared_fast(Delta=2.6)

CPU times: user 36.7 s, sys: 62.9 ms, total: 36.8 s

Wall time: 36.8 s

2.8283629046

sage: %time Z._zerosum_sincsquared_parallel(Delta=2.6)

CPU times: user 7.87 ms, sys: 133 ms, total: 141 ms

Wall time: 4.06 s

2.8283629046

Same answer in a ninth of the time! Note that the parallelized method has some extra overhead, so even with 12 processors we're unlikely to get a full factor of 12 speedup. Nevertheless, the speed boost will allow us to run the zero sum code with larger $\Delta$ values, allowing us to investigate elliptic curves with even larger conductors.

Tuesday, August 5, 2014

How big should Delta be?

Let's take a look at the central formula in my GSoC rank estimation code. Let $E$ be a rational elliptic curve with analytic rank $r$. Then
$$ r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) = \frac{1}{\pi\Delta} \left[ C_0 + \frac{1}{2\pi\Delta}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) + \sum_{n=1}^{e^{2\pi\Delta}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right) \right] $$
where

$\gamma$ ranges over the nontrivial zeros of the $L$-function attached to $E$
$\Delta$ is a positive parameter
$C_0 = -\eta + \log\left(\frac{\sqrt{N}}{2\pi}\right)$; $\eta$ is the Euler-Mascheroni constant $=0.5772\ldots$ and $N$ is the conductor of $E$
$\text{Li}_2(x)$ is the dilogarithm function, defined by $\text{Li}_2(x) = \sum_{k=1}^{\infty} \frac{x^k}{k^2}$
$c_n$ is the $n$th coefficient of the logarithmic derivative of the $L$-function of $E$.

The thing I want to look at in this post is that parameter $\Delta$. The larger you make it, the closer the sum on the left hand side over the zeros is to the analytic rank, so when trying to determine the rank of $E$ we want to pick as large a $\Delta$ as we can. However, the bigger this parameter is the more terms we have to compute in the sum over the $c_n$ on the right hand side; moreover the number of terms - and thus the total computation time - scales exponentially with $\Delta$. This severely constrains how big we can make $\Delta$; generally a value of $\Delta=2$ may take a second or two for a single curve on SageMathCloud, while $\Delta=3$ may take upwards of an hour. For the average rank project I'm working on I ran the code on 12 million curves using $\Delta=1.3$; the total computation time was about 4 days on SMC.

However, it should be clear that using too large a $\Delta$ is overkill: if you run the code on a curve with $\Delta=1$ and get a bound of zero out, you know that curve's rank exactly zero (since it's at most zero, and rank is a non-negative integer). Thus using larger $\Delta$ values on that curve will do nothing except provide you the same bound while taking much longer to do so.

This begs the question: just how big of a $\Delta$ value is good enough? Can we, given some data defining an elliptic curve, decide a priori what size $\Delta$ to use so that a) the computation returns a bound that is likely to be the true of the curve, and b) it will do so in as little time as possible?

The relevant invariant to look at here is conductor $N$ of the elliptic curve; go back to the formula above and you'll see that the zero sum includes a term which is $O\left(\frac{\log(N)}{2\pi\Delta}\right)$ (coming from the $\frac{1}{\pi \Delta} C_0$ term). This means that size of the returned estimate will scale with $\log(N)$: for a given $\Delta$, the bound returned on a curve with 10-digit conductor will be about double that which is returned for a 5-digit conductor curve, for example. However, we can compensate this by increasing $\Delta$ accordingly.

To put it all more concretely we can pose the following questions:

Given an elliptic curve $E$ with conductor $N$, how large does $\Delta$ need to be in terms of $N$ so that the returned bound is guaranteed to be the true analytic rank of the curve?
Given a conductor size $N$ and a proportionality constant $P \in [0,1]$, how big does $\Delta$ have to be in terms of $N$ and $P$ so that at least $P\cdot 100$ percent of all elliptic curves with conductor of size about $N$ will, when the rank estimation code is run on them with that $\Delta$ value, yield returned bounds that are equal to their true analytic rank?

[Note: in both of the above questions we are making the implicit assumption that the returned rank bound is monotonically decreasing for increasing $\Delta$. This is not necessarily the case: the function $y = \text{sinc}^2(x)$ is not a decreasing function in $x$. However, in practice any upwards fluctuation we see in the zero sum is small enough to be eliminated when we take the floor function to get an integer rank bound.]

A $\Delta$ CHOICE GOOD ENOUGH FOR MOST CURVES

The first question is easier to phrase, but more difficult to answer, so we will defer it for now. To answer the second question, it is useful mention what we know about the distribution and density of nontrivial zeros of an elliptic curve $L$-function.

Using some complex analysis we can show that, for the $L$-function of an elliptic curve with conductor $N$, the expected number of zeros in the critical strip with imaginary part at most $T$, is $O(T\log N+ T\log T)$. That is, expected zero density has two distinct components: a part that scales linearly with log of the conductor, and a part that doesn't scale with conductor (but does scale slightly faster than linearly with how far out you go).

Consider the following: if we let

$$\Delta(N) = \frac{C_0}{\pi} = \frac{1}{\pi}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right)$$

then the first term in the right hand side of the zero sum formula is precisely 1 - this is the term that comes from the $\log N$ part of the zero density. The next term - the one involving $\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right)$ - is the term that comes from the part independent of $N$; because the right hand side is divided by $\Delta$ it therefore goes to zero as the curve's conductor increases. The third term contains the $c_n$ coefficients which (per Sato-Tate) will be positive half the time and negative half the time, so the entire sum could be positive or negative; we therefore expect its contribution to be small on average when we look at large number of elliptic curves.

It thus stands to reason that for this value of Delta, and when the conductor $N$ is sufficiently large, the zero sum will be about 1, plus or minus a smallish amount coming from the $c_n$ sum. This argument is by no means rigorous, but we might therefore expect the zero sum to be within 2 of the actual analytic rank most of the time. Couple that with knowledge of the root number and you get a rank upper bound out which is equal to the actual analytic rank in all but a few pathological cases.

Empirical evidence bears this out. I ran my rank estimation code with this choice of $\Delta$ scaling on the whole Cremona database, which contains all elliptic curves up to conductor 350000:

The proportion of curves up to conductor $N$ for which the computed rank bound is strictly greater than rank. The $x$-axis denotes conductor; the $y$-axis is the proportion of all curves up to that conductor for which the returned rank bound was not equal to the true rank (assuming BSD and GRH as always).

As you can see, the percentage of curves for which the rank bound is strictly greater than rank holds fairly constant at about 0.25%. That's minuscule: what this is saying is that if you type in a random Weierstrass equation, there is only about a 1 in 1000 chance that the rank bounding code with $\Delta = \frac{C_0}{\pi}$ will return a value that isn't the actual analytic rank. Moreover, the code runs considerably faster than traditional analytic rank techniques, so if you wanted to, for example, compute the ranks of a database of millions of elliptic curves, this would be a good first port of call.

Of course, this $\Delta$ scaling approach is by no means problem-free. Some more detailed analysis will show that that as stated above, the runtime of the code will actually be $O(N)$ (omitting log factors), i.e. asymptotic scaling is actually worse than traditional analytic rank methods, which rely on evaluating the $L$-function directly and thus are $O(\sqrt{N})$. It's just that with this code we have some very small constants sitting in front, so the crossover point is at large enough conductor values that neither method is feasible anyway.

This choice of $\Delta$ scaling works for conductor ranges up to about $10^9$; that corresponds to $\Delta \approx 2.5$, which will give you a total runtime of about 10-20 seconds for a single curve on SMC. Increase the conductor by a factor of 10 and your runtime will also go up tenfold.

For curves of larger conductor, instead of setting $\Delta = \frac{C_0}{\pi}$ we can choose to set $\Delta$ to be $\alpha\cdot \frac{C_0}{\pi}$ for any $\alpha \in [0,1]$; the resulting asymptotic runtime will then be $O(N^{\alpha})$, at the expense of having a reduced proportion of elliptic curves where rank bound is equal to true rank.

HOW LARGE DOES $\Delta$ HAVE TO BE TO GUARANTEE TRUE RANK?

When we use $\Delta = \frac{C_0}{\pi}$, the curves for which the returned rank estimate is strictly larger than the true rank are precisely those which have unusually low-lying zeros. For example, the rank 0 curve with Cremona label 256944c1, has a zero with imaginary part at 0.0256 (see here for a plot), compared to an expected value of 0.824. Using $\Delta = \frac{C_0}{\pi}$ on this curve means $\Delta \approx 1.214$; if we compute the corresponding zero sum with this value of $\Delta$ we get a value of 2.07803. The smallest value of $\Delta$ for which we get a zero sum value of less than 2 is empirically about 2.813; at this point taking the floor and invoking parity tells us that the curve's rank is zero.

The above example demonstrates that if we want to guarantee that the returned rank bound is the true analytic rank, we are forced to increase the size of $\Delta$ to something larger than $\frac{C_0}{\pi}$. Do we need to increase $\Delta$ by a fixed value independent of $N$? Do we need to increase it by some constant factor? Or does it need to scale faster than $\log N$? These are hard questions to answer; it comes down to determining how close to the central point the lowest nontrivial zero can be as a function of the conductor $N$ (or some other invariants of $E$), which in turn is intimately related to estimating the size of the leading coefficient of the $L$-series of $E$ at the central point. This is already the topic of a previous post: it is a question that I hope to make progress in answering in my PhD dissertation.

Tuesday, July 29, 2014

The average rank of elliptic curves

It's about time I should demonstrate the utility of the code I've written - the aim of the game for my GSoC project, after all, is to provide a new suite of tools with which to conduct mathematical research.

First some background. Given an elliptic curve $E$ specified by equation $y^2 = x^3 + Ax + B$ for integers $A$ and $B$, one of the questions we can ask is: what is the average rank of $E$ as $A$ and $B$ are allowed to vary? Because there are an infinite number of choices of $A$ and $B$, we need to formulate this question a bit more carefully. To this end, let us define the height of $E$ to be
$$ h(E) = \max\{|A|^3,|B|^2\} $$
[Aside: The height essentially measures the size of the coefficients of $E$ and is thus a fairly decent measure of the arithmetic complexity of the curve. We need the 3rd and 2nd powers in order to make the height function scale appropriately with the curve's discriminant.]

We can then ask: what is the limiting value of the average rank of all curves up to height $X$, as $X$ gets bigger and bigger? Is it infinity? Is it 0? Is it some non-trivial positive value? Does the limit even exist? It's possible that the average rank, as a function of $X$ could oscillate about and never stabilize.

There are strong heuristic arguments suggesting that the answer should be exactly $\frac{1}{2}$. Specifically, as the height bound $X$ gets very large, half of all curves should have rank 0, half should have rank 1, and a negligible proportion should have rank 2 or greater.

Even as recently as five years ago this there we knew nothing concrete unconditionally about average curve rank. There are some results by Brumer, Heath-Brown and Young providing successively better upper bounds on the average rank of curves ordered by height (2.3, 2 and $\frac{25}{14}$ respectively), but these results are contingent on the Riemann Hypothesis.

However, starting in 2011 Manjul Bhargava, together with Christopher Skinner and Arul Shankar, published a series of landmark papers (see here for a good expository slideshow, and here and here for two of the latest publications) proving unconditionally that average rank - that is, the limiting value of the average rank of all elliptic curves up to height $X$ - is bounded above by 0.885. A consequence of their work too is that at least 66% of all elliptic curves satisfy the rank part of the Birch and Swinnerton-Dyer Conjecture.

To a computationally-minded number theorist, the obvious question to ask is: Does the data support these results? I am by no means the first person to ask this question. Extensive databases of elliptic curves under various orderings have already been compiled, most notably those by Cremona (ordered by conductor) and Stein-Watkins (ordered essentially by discriminant). However, as yet no extensive tabulation of height-ordered elliptic curves has been carried out.

Here is a summarized table of elliptic curves with height at most 10000 - a total of 8638 curves, and the ranks thereof (all computations done in Sage, of course):

Rank	# Curves	%
0	2988	34.59%
1	4307	49.86%
2	1286	14.89%
3	57	0.66%
$\ge$4	0	0.00%
Total:	8638

Thus the average rank of elliptic curves is 0.816 when the height bound is 10000. This is worrisome: the average is significantly different from the value of 0.5 we're hoping to see.

The situation gets even worse when we go up to height bound 100000:

Rank	# Curves	%
0	19492	33.11%
1	28818	48.96%
2	9747	16.56%
3	801	1.36%
$\ge$4	4	0.01%
Total:	58862

This yields an average rank of 0.862 for height bound 100000. Bizarrely, the average rank is getting bigger, not smaller!

[Note: the fact that 0.862 is close to Bhargava's asymptotic bound of 0.885 is coincidental. Run the numbers for height 1 million, for example, and you get an average rank of 0.8854, which is bigger than the above asymptotic bound. Observationally, we see the average rank continue to increase as we push out to even larger height bounds beyond this.]

So what's the issue here? It turns out that a lot of the asymptotic statements we can make about elliptic curves are precisely that: asymptotic, and we don't yet have a good understanding of the associated rates of convergence. Elliptic curves, ornery beasts that they are, can seem quite different from their limiting behaviour when one only considers curves with small coefficients. We expect (hope?) that the average to eventually turn around and start to decrease back down to 0.5, but the exact point at which that happens is as yet unknown.

This is where I come in. One of the projects I've been working on (with Wei Ho, Jen Balakrishnan, Jamie Weigandt, Nathan Kaplan and William Stein) is to compute the average rank of elliptic curves for as large a height bound as possible, in the hope that we will get results a bit more reassuring than the above. The main steps of the project are thus:

Generate an ordered-by-height database of all elliptic curves up to some very large height bound (currently 100 million; about 18.5 million curves);
Use every trick in the book to compute the ranks of said elliptic curves;
Compute the average of said ranks.

Steps 1 and 3 are easy. Step 2 is not. Determining the rank of an elliptic curve is a notoriously hard problem - no unconditional algorithm with known complexity currently exists - especially when you want to do it for millions of curves in a reasonable amount of time. Sage, for example, already has a rank() method attached to their EllipticCurve class; if you pass the right parameters to it, the method will utilize an array of approaches to get a value out that is (assuming standard conjectures) the curve's rank. However, its runtime for curves of height near 100 million is on the order of 20 seconds; set it loose on 18.5 million such curves and you're looking at a total computation time of about 10 CPU years.

Enter GSoC project stage left. At the expense of assuming the Generalized Riemann Hypothesis and the Birch and Swinnerton-Dyer Conjecture, we can use the zero sum rank bounding code I've been working on to quickly compute concrete upper bounds on an elliptic curve's rank. This approach has a couple of major advantages to it:

It's fast. In the time it's taken me to write this post, I've computed rank bounds on 2.5 million curves.
Runtime is essentially constant for any curve in the database; we don't have to worry about how the method scales with height or conductor. If we want to go up to larger height bounds at a later date, no problem.

As always, some Terms and Conditions apply. The rank bounding code only gives you an upper bound on the rank: if, for example, you run the code on a curve and get the number 4 back, there's no way to determine with this method if the curve's rank is 4, or if it is really some non-negative integer less than 4.

Note: there is an invariant attached to an elliptic curve called the root number which is efficiently computable, even for curves with large conductor (it took less than 20 minutes to compute the root number for all 18.5 million curves in our database). The root number is one of two values: -1 or +1; if it's -1 the curve has odd analytic rank, and if it's +1 the curve has even analytic rank. Assuming BSD we can therefore always easily tell if the curve's rank is even or odd. My GSoC rank estimation code takes the root number into account, so in the example above, a returned value of 4 tells us that the curve's true rank is one of three values: 0, 2 or 4.

Even better, if the returned value is 0 or 1, we know this must be the actual algebraic rank of the curve: firstly, there's no ambiguity as to what the analytic rank is - it has to the returned 0 or 1; secondly, the BSD conjecture has been proven in the rank 0 & 1 cases. Thus even though we are a priori only computing analytic rank upper bounds, for some proportion of curves we've found the actual algebraic rank.

[Note that the rank bounding code I've written is predicated on knowing that all nontrivial zeros of an elliptic curve $L$-function lie on the critical line, so we still have to assume the Generalized Riemann Hypothesis.]

Thus running the rank bound code on the entire database of curves is a very sensible first step, and it's what I'm currently doing. It's computationally cheap to do - on SageMathCloud, using a Delta value of 1.0, the runtime for a single curve is about 4 milliseconds. Moreover, for some non-negligible percentage of curves the bounds will be observably sharp - based on some trial runs I'm expecting about 20-30% of the computed bounds to be 0 or 1.

That's about 4 million curves for which we won't have to resort to much more expensive rank finding methods. Huge savings!

Monday, July 21, 2014

How to efficiently enumerate prime numbers

There's a part in my code that requires me to evaluate a certain sum

$$ \sum_{p\text{ prime }< \,M} h_E(p) $$

where $h_E$ is a function related to specified elliptic curve that can be evaluated efficiently, and $M$ is a given bound that I know. That is, I need to evaluate the function h_E at all the prime numbers less than $t$, and then add all those values up.

The question I hope to address in this post is: how can we do this efficiently as $M$ gets bigger and bigger? Specifically, what is the best way to compute a sum over all prime numbers up to a given bound when that bound can be very large?

[For those who have read my previous posts (you can skip this paragraph if you haven't - it's not the main thrust of this post), what I want to compute is, for an elliptic curve $E/\mathbb{Q}$, the analytic rank bounding sum $ \sum_{\gamma} \text{sinc}^2(\Delta \gamma) $ over the zeros of $L_E(s)$ for positive parameter $\Delta$; this requires us to evaluate the sum $ \sum_{n < \exp(2\pi\Delta)} c_n\cdot(2\pi\Delta-\log n)$. Here the $c_n$ are the logarithmic derivative coefficients of the completed $L$-function of $E$. Crucially $c_n = 0$ whenever $n$ isn't a prime power, and we can lump together all the terms coming from the same prime; we can therefore express the sum in the form you see in the first paragraph.]

As with so many things in mathematical programming, there is a simple but inefficient way to do this, and then there are more complicated and ugly ways that will be much faster. And as has been the case with other aspects of my code, I've initially gone with the first option to make sure that my code is mathematically correct, and then gone back later and reworked the relevant methods in an attempt to speed things up.

METHOD 1: SUCCINCT BUT STUPID

Here's a Python function that will evaluate the sum over primes. The function takes two inputs: a function $h_E$ and an integer $M$, and returns a value equal to the sum of $h_E(p)$ for all primes less than $M$. We're assuming here that the primality testing function is_prime() is predefined.

As you can see, we can achieve the desired outcome in a whopping six lines of code. Nothing mysterious going on here: we simply iterate over all integers less than our bound and test each one for primality; if that integer is prime, then we evaluate the function h_E at that integer and add the result to y. The variable y is then returned at the end.

Why is this a bad way to evaluate the sum? Because there are far more composite integers than there are primes. According to the prime number theorem, the proportion of integers up to $M$ that are prime is approximately $\frac{1}{\log M}$. For my code I want to compute with bounds in the order of $M = e^{8\pi} \sim 10^{11}$; the proportion of integers that are prime up to this bound value is correspondingly about $\frac{1}{8\pi} \sim 0.04$. That is, 96% of the integers we iterate over aren't prime, and we end up throwing that cycle away.

Just how inefficient this method is of course depends on how quickly we can evaluate the primality testing function is_prime(). The best known deterministic primality testing algorithm has running time that scales with (at most) the 6th power of $\log n$, where $n$ is the number being tested. This places primality testing in a class of algorithms called Polynomial Time Complexity Algorithms, which means the runtime of the function scales relatively well with the size of the input. However, what kills us here is the sheer number of times we have to call is_prime() - on all integers up to our bound $M$ - so even if it ran in constant time the prime_sum() function's running time is going to scale with the magnitude of $M$.

METHOD 2: SKIP THOSE $n$ WE KNOW ARE COMPOSITE

We can speed things up considerably by noting that apart from 2, all prime numbers are odd. We are therefore wasting a huge amount of time running primality tests on integers that we know a priori are composite. Assuming is_prime() takes a similar time to execute than our coefficient function h_E(), we could therefore roughly halve the runtime of the prime sum function by skipping the even integers and just checking odd numbers for primality.

We can go further. Apart from 2 and 3, all primes yield a remainder of 1 or 5 when you divide them by 6 (because all primes except for 2 are 1 (modulo 2) and all primes except for 3 are 1 or 2 (modulo 3)). We can therefore skip all integers that are 0, 2, 3 or 4 modulo 6; this means we only have to check for primality on only one third of all the integers less than $M$.

Here's a second version of the prime_sum() function that does this:

Of course we could go even further with the technique by looking at remainders modulo $p$ for more primes $p$ and combining the results: for example, all primes outside of 2, 3 and 5 can only have a remainder of 7, 11, 13, 17, 19, 23 or 29 modulo 30. However, the further you go the more special cases you need to consider, and the uglier your code becomes; as you can see, just looking at cases modulo 6 requires us to write a function about three times as long as previously. This method therefore will only be able to take us so far before the code we'd need to write would become too unwieldy for practicality.

METHOD 3: PRIME SIEVING...

This second prime_sum() version is a rudimentary example of a technique called prime sieving. The idea is to use quick computations to eliminate a large percentage of integers from consideration in a way that doesn't involve direct primality testing, since this is computationally expensive. Sieving techniques are an entire field of research in their own right, so I thought I'd just give as example one of the most famous methods: the Sieve of Eratosthenes (named after the ancient Greek mathematician who is thought to first come up with the idea). This takes as input a positive bound $M$ and returns a list of all prime numbers less than $M$. The method goes as follows:

Start with a list of boolean flags indexed by the numbers 2 through $M$, and set all of them to True.
Starting at the beginning of the list, let $i$ be the index of the first True entry. Set all entries at indices a multiples of $i$ to False.
Repeat step 2 until the first True entry is at index $> \sqrt{M}$.
Return a list of all integers $i$ such that the entry at index $i$ is True.

This is definitely a case where a (moving) picture is worth a thousand words:

A good graphic representation of the Sieve of Eratosthenes being used to generate all primes less than 121. Courtesy Wikipedia: "Sieve of Eratosthenes animation". Licensed under CC BY-SA 3.0 via Wikimedia Commons.

How and why does this work? By mathematical induction, at each iteration the index of the first True entry will always be prime. However any multiple thereof is by definition composite, so we can walk along the list and flag them as not prime. Wash, rinse, repeat. We can stop at $\sqrt{M}$, since all composite numbers at most $M$ in magnitude must have at least one prime factor at most $\sqrt{M}$ in size.

Here is a third version of our prime_sum() function that utilizes the Sieve of Eratosthenes:

Let's see how the three versions stack up against each other time-wise in the Sage terminal. I've saved the three functions in a file called prime_sum_functions.py, which I then import up front (if you want to do the same yourself, you'll need to import or define appropriate is_prime() and sqrt() functions at the top of the file). I've also defined a sample toy function h_E() and bound M:

sage: from prime_sum_functions import *

sage: def h_E(n): return sin(float(n))/float(n)

sage: M = 10000

sage: prime_sum_v1(h_E,M)

0.19365326958140347

sage: prime_sum_v2(h_E,M)

0.19365326958140347

sage: prime_sum_v3(h_E,M)

0.19365326958140347

sage: %timeit prime_sum_v1(h_E,M)

1 loops, best of 3: 363 ms per loop

sage: %timeit prime_sum_v2(h_E,M)

1 loops, best of 3: 206 ms per loop

sage: %timeit prime_sum_v3(h_E,M)

10 loops, best of 3: 86.8 ms per loop

Good news! All three functions (thankfully) produce the same result. And we see version 2 is about 1.8 times faster than version 1, while version 3 is four times as fast. These ratios remained roughly the same when I timed the functions on larger bounds, which indicates that the three versions have the same or similar asymptotic scaling - this should be expected, since no matter what we do we will always have to check something at each integer up to the bound.

METHOD 4: ...AND BEYOND

It should be noted, however, that the Sieve of Eratosthenes as implemented above would be a terrible choice for my GSoC code. This is because in order to enumerate the primes up to $M$ we need to create a list in memory of size $M$. This isn't an issue when $M$ is small, but for my code I need $M \sim 10^{11}$; an array of booleans that size would take up about 12 gigabytes in memory, and any speedups we get from not having to check for primality would be completely obliterated by read/write slowdowns due to working with an array that size. In other words, while the Sieve of Eratosthenes has great time complexity, it has abysmal space complexity.

Thankfully, more memory-efficient sieving methods exist that drastically cut down the space requirements. The best of these - for example, the Sieve of Atkin - need about $\sqrt{M}$ space. For $M \sim 10^{11}$ this translates to only about 40 kilobytes; much more manageable.

Of course, there's always a downside: bleeding edge prime enumeration methods are finicky and intricate, and there are a plethora of ways to get it wrong when implementing them. At some point squeezing an extra epsilon of speedup from your code is no longer worth it in terms of the time and effort it will take to get there. For now, I've implemented a more optimized version of the second prime_sum() function in my code (where we skip over all integers that are obviously not prime), since for now that is my happy middle ground. If I have time at the end of the project I will revisit the issue of efficient prime enumeration and try implement a more optimized sieving method, but that is a tomorrow problem.

Monday, July 14, 2014

Cythonize!

I'm at the stage where my code essentially works: it does everthing I initially set out to have it do, including computing the aforementioned zero sums for elliptic curve $L$-functions. However, the code is written in pure Python, and it is therefore not as fast as it could be.

This is often not a problem; Python is designed to be easy to read and maintain, and I'm hoping that the Python code I wrote is both of those. If we were just planning to run it on elliptic curves with small coefficients - for example, the curve represented by the equation $y^2=x^3-16x+16$ - this wouldn't be an issue. Curves with small coefficients have small conductors and thus few low-lying zeros near the central point, which allows us to run the zero sum code on them with small Delta parameter values. A small Delta value means the computation will finish very quickly regardless of how efficiently it's implemented, so it probably wouldn't be worth my while trying to optimize the code in that case.

To illustrate this point, here is the first most high-level, generic version of the method that computes the sum $\sum_{\gamma} \text{sinc}^2(\Delta \gamma)$ over the zeros of a given elliptic curve $L$-function (minus documentation):

[Of course, there's plenty going on in the background here. I have a separate method, self.cn() which computes the logarithmic derivative coefficients, and I call the SciPy function spence() to compute the part of the sum that comes from the Fourier transform of the digamma function $\frac{\Gamma^{\prime}}{\Gamma}$. Nevertheless, the code is simple and straightforward, and (hopefully) it's easy to follow the logic therein.]

However, the whole point of my GSoC project is to produce code that can be used for mathematical research; ultimately we want to push computations as far as we can and run the code on elliptic curves with large conductor, since curves with small conductor are already well-understood. Ergo, it's time I thought about going back over what I've written and seeing how I can speed it up.

There are two distinct ways to achieve speedups. The first is to rewrite the code more cleverly to eliminates unnecessary loops, coercions, function calls etc. Here is a second version I have written of the same function (still in Python):

The major change I've made between the two versions is improving how the sum involving the logarithmic derivative coefficients is computed - captured in the variable y. In the first version, I simply iterated over all integers $n$ up to the bound $t$, calling the method self.cn() each time. However, the logarithmic derivative coefficient $c_n$ is zero whenever $n$ is not a prime power, and knowing its value for $n=p$ a prime allows us to efficiently compute its value for $p^k$ any power of that prime. It therefore makes sense to do everything "in-house": eliminate the method call to self.cn(), iterate only over primes, and compute the logarithmic derivative coefficients for all powers of a given prime together.

Let's see how the methods match up in terms of speed. Below we run the two versions of the zero sum method on the elliptic curve $E: y^2=x^3-16x+16$, which is a rank 1 curve of conductor 37:

sage: import sage.lfunctions.zero_sums as zero_sums

sage: ZS = zero_sums.LFunctionZeroSum(EllipticCurve([-16,16]))

sage: ZS._zerosum_sincsquared(Delta=1)

1.01038406984

sage: ZS._zerosum_sincsquared_fast(Delta=1)

1.01038406984

sage: %timeit ZS._zerosum_sincsquared(Delta=1)

10 loops, best of 3: 20.5 ms per loop

sage: %timeit ZS._zerosum_sincsquared_fast(Delta=1)

100 loops, best of 3: 3.46 ms per loop

That's about a sixfold speedup we've achieved, just by reworking the section of the code that computes the $c_n$ sum.

The downside, of course, is that the code in method version 2 is more complicated - and thus less readable - than that in version 1. This is often the case in software development: you can write code that is elegant and easy to read but slow, or you can write code that is fast but horribly complicated and difficult to maintain. And when it comes to mathematical programming, unfortunately, sometimes the necessity for speed trumps readability.

The second major way to achieve speedups is to abandon pure Python and switch to a more low-level language. I could theoretically take my code and rewrite it in C, for example; if done relatively intelligently I'm sure the resulting code would blow what I have here out the water in terms of speed. However, I have no experience writing C code, and even if I did getting the code to interface with the rest of the Sage codebase would be a major headache.

Thankfully there is a happy middle ground: Cython. Cython is a programming language - technically a superset of Python - that allows direct interfacing with C and C++ data types and structures. Code written in Cython can be as fast as C code and nearly as readable as pure Python. Crucially, because it's so similar to Python it doesn't require rewriting all my code from scratch. And Sage already knows how to deal with Cython, so there aren't any compatibility issues there.

I am therefore in the process of doing exactly that: rewriting my code in Cython. Mostly this is just a cut-and-paste job and is pleasingly hassle-free; however, in order to achieve the desired speedups, the bottleneck methods - such as our $\text{sinc}^2$ zero sum method above - must be modified to make use of C data types.

Here is the third, most recent version of the _zerosum_sincsquared() method for our zero sum class, this time written in Cython:

Definitely longer and uglier. I now must declare my (C) variables up front; previously Python just created them on the fly, which is nice but slower than allocating memory space for the variables a priori. I've eliminated the use of complex arithmetic, so that everything can be done using C integer and float types. I still iterate over all primes up to the bound $t$; however now I deal with those primes that divide the conductor of $E$ (for which the associated summand is calculated slightly differently) beforehand, so that in the main loop I don't have to check at each point if my prime $p$ divides the conductor or not [This last check is expensive: the conductor $N$ can be very large - too large to cast as a $C$ long long even - so we would have to use slower Python or Sage data types to represent it. Better to get rid of the check altogether].

Let's see how this version holds up in a speed test. The Cython code has already been built into Sage and the class loaded into the global namespace, so I can just call it without having to attach or load any file:

sage: ZS = LFunctionZeroSum(EllipticCurve([-16,16]))

sage: ZS._zerosum_sincsquared_fast(Delta=1)

1.01038406984

sage: %timeit ZS._zerosum_sincsquared_fast(Delta=1)

1000 loops, best of 3: 1.72 ms per loop

The good news: the Cythonized version of the method produces the same output as the Python versions, and it's definitely faster. The bad news: the speedup is only about a factor of 2, which isn't hugely impressive given how much uglier the code is.

Why is this? Crucially, we still iterate over all integers up to the bound $t$, testing for primality at each step. This is very inefficient: most integers are not prime (in fact, asymptotically 0 percent of all positive integers are prime); we should be using sieving methods to eliminate primality testing at those integers which we know before checking are composite. For example, we should at the very least only ever iterate over the odd numbers beyond 3. That immediately halves the number of primality tests we have to do, and we should therefore get a comparable speedup if primality testing is what is dominating the runtime in this method.

This is therefore what I hope to implement next: rework the zero sum code yet again to incorporate prime sieving. This has applicability beyond just the $\text{sinc}^2$ method: all explicit formula-type sums at some point invoke summing over primes or prime powers, so having access to code that can do this quickly would be a huge boon.