First some background. Given an elliptic curve $E$ specified by equation $y^2 = x^3 + Ax + B$ for integers $A$ and $B$, one of the questions we can ask is: what is the average rank of $E$ as $A$ and $B$ are allowed to vary? Because there are an infinite number of choices of $A$ and $B$, we need to formulate this question a bit more carefully. To this end, let us define the height of $E$ to be
$$ h(E) = \max\{|A|^3,|B|^2\} $$
[Aside: The height essentially measures the size of the coefficients of $E$ and is thus a fairly decent measure of the arithmetic complexity of the curve. We need the 3rd and 2nd powers in order to make the height function scale appropriately with the curve's discriminant.]
We can then ask: what is the limiting value of the average rank of all curves up to height $X$, as $X$ gets bigger and bigger? Is it infinity? Is it 0? Is it some non-trivial positive value? Does the limit even exist? It's possible that the average rank, as a function of $X$ could oscillate about and never stabilize.
There are strong heuristic arguments suggesting that the answer should be exactly $\frac{1}{2}$. Specifically, as the height bound $X$ gets very large, half of all curves should have rank 0, half should have rank 1, and a negligible proportion should have rank 2 or greater.
Even as recently as five years ago this there we knew nothing concrete unconditionally about average curve rank. There are some results by Brumer, Heath-Brown and Young providing successively better upper bounds on the average rank of curves ordered by height (2.3, 2 and $\frac{25}{14}$ respectively), but these results are contingent on the Riemann Hypothesis.
However, starting in 2011 Manjul Bhargava, together with Christopher Skinner and Arul Shankar, published a series of landmark papers (see here for a good expository slideshow, and here and here for two of the latest publications) proving unconditionally that average rank - that is, the limiting value of the average rank of all elliptic curves up to height $X$ - is bounded above by 0.885. A consequence of their work too is that at least 66% of all elliptic curves satisfy the rank part of the Birch and Swinnerton-Dyer Conjecture.
To a computationally-minded number theorist, the obvious question to ask is: Does the data support these results? I am by no means the first person to ask this question. Extensive databases of elliptic curves under various orderings have already been compiled, most notably those by Cremona (ordered by conductor) and Stein-Watkins (ordered essentially by discriminant). However, as yet no extensive tabulation of height-ordered elliptic curves has been carried out.
Here is a summarized table of elliptic curves with height at most 10000 - a total of 8638 curves, and the ranks thereof (all computations done in Sage, of course):
Rank | # Curves | % |
0
| 2988 | 34.59% |
1
| 4307 | 49.86% |
2
| 1286 | 14.89% |
3
| 57 | 0.66% |
$\ge$4
| 0 | 0.00% |
Total: | 8638 |
Thus the average rank of elliptic curves is 0.816 when the height bound is 10000. This is worrisome: the average is significantly different from the value of 0.5 we're hoping to see.
The situation gets even worse when we go up to height bound 100000:
Rank | # Curves | % |
0
| 19492 | 33.11% |
1
| 28818 | 48.96% |
2
| 9747 | 16.56% |
3
| 801 | 1.36% |
$\ge$4
| 4 | 0.01% |
Total: | 58862 |
This yields an average rank of 0.862 for height bound 100000. Bizarrely, the average rank is getting bigger, not smaller!
[Note: the fact that 0.862 is close to Bhargava's asymptotic bound of 0.885 is coincidental. Run the numbers for height 1 million, for example, and you get an average rank of 0.8854, which is bigger than the above asymptotic bound. Observationally, we see the average rank continue to increase as we push out to even larger height bounds beyond this.]
So what's the issue here? It turns out that a lot of the asymptotic statements we can make about elliptic curves are precisely that: asymptotic, and we don't yet have a good understanding of the associated rates of convergence. Elliptic curves, ornery beasts that they are, can seem quite different from their limiting behaviour when one only considers curves with small coefficients. We expect (hope?) that the average to eventually turn around and start to decrease back down to 0.5, but the exact point at which that happens is as yet unknown.
This is where I come in. One of the projects I've been working on (with Wei Ho, Jen Balakrishnan, Jamie Weigandt, Nathan Kaplan and William Stein) is to compute the average rank of elliptic curves for as large a height bound as possible, in the hope that we will get results a bit more reassuring than the above. The main steps of the project are thus:
- Generate an ordered-by-height database of all elliptic curves up to some very large height bound (currently 100 million; about 18.5 million curves);
- Use every trick in the book to compute the ranks of said elliptic curves;
- Compute the average of said ranks.
Steps 1 and 3 are easy. Step 2 is not. Determining the rank of an elliptic curve is a notoriously hard problem - no unconditional algorithm with known complexity currently exists - especially when you want to do it for millions of curves in a reasonable amount of time. Sage, for example, already has a rank() method attached to their EllipticCurve class; if you pass the right parameters to it, the method will utilize an array of approaches to get a value out that is (assuming standard conjectures) the curve's rank. However, its runtime for curves of height near 100 million is on the order of 20 seconds; set it loose on 18.5 million such curves and you're looking at a total computation time of about 10 CPU years.
Enter GSoC project stage left. At the expense of assuming the Generalized Riemann Hypothesis and the Birch and Swinnerton-Dyer Conjecture, we can use the zero sum rank bounding code I've been working on to quickly compute concrete upper bounds on an elliptic curve's rank. This approach has a couple of major advantages to it:
- It's fast. In the time it's taken me to write this post, I've computed rank bounds on 2.5 million curves.
- Runtime is essentially constant for any curve in the database; we don't have to worry about how the method scales with height or conductor. If we want to go up to larger height bounds at a later date, no problem.
As always, some Terms and Conditions apply. The rank bounding code only gives you an upper bound on the rank: if, for example, you run the code on a curve and get the number 4 back, there's no way to determine with this method if the curve's rank is 4, or if it is really some non-negative integer less than 4.
Note: there is an invariant attached to an elliptic curve called the root number which is efficiently computable, even for curves with large conductor (it took less than 20 minutes to compute the root number for all 18.5 million curves in our database). The root number is one of two values: -1 or +1; if it's -1 the curve has odd analytic rank, and if it's +1 the curve has even analytic rank. Assuming BSD we can therefore always easily tell if the curve's rank is even or odd. My GSoC rank estimation code takes the root number into account, so in the example above, a returned value of 4 tells us that the curve's true rank is one of three values: 0, 2 or 4.
Even better, if the returned value is 0 or 1, we know this must be the actual algebraic rank of the curve: firstly, there's no ambiguity as to what the analytic rank is - it has to the returned 0 or 1; secondly, the BSD conjecture has been proven in the rank 0 & 1 cases. Thus even though we are a priori only computing analytic rank upper bounds, for some proportion of curves we've found the actual algebraic rank.
[Note that the rank bounding code I've written is predicated on knowing that all nontrivial zeros of an elliptic curve $L$-function lie on the critical line, so we still have to assume the Generalized Riemann Hypothesis.]
[Note that the rank bounding code I've written is predicated on knowing that all nontrivial zeros of an elliptic curve $L$-function lie on the critical line, so we still have to assume the Generalized Riemann Hypothesis.]
Thus running the rank bound code on the entire database of curves is a very sensible first step, and it's what I'm currently doing. It's computationally cheap to do - on SageMathCloud, using a Delta value of 1.0, the runtime for a single curve is about 4 milliseconds. Moreover, for some non-negligible percentage of curves the bounds will be observably sharp - based on some trial runs I'm expecting about 20-30% of the computed bounds to be 0 or 1.
That's about 4 million curves for which we won't have to resort to much more expensive rank finding methods. Huge savings!