Loading [MathJax]/jax/output/HTML-CSS/jax.js

Friday, August 22, 2014

GSoC: Wrapping Up

Today marks the last day of my Google Summer of Code 2014 project. Evaluations are due midday Friday PDT, and code submissions for successfully passed projects start soon thereafter. The end result of my project can be found at Sage Trac Ticket 16773. In total it's just over 2000 lines of Python and Cython code, to be added to the next major Sage release (6.4) if/when it gets a final thumbs-up review.

When I write just the number of lines of code it doesn't sound like very much at all - I'm sure there are GSoC projects this year that produced at least 10k lines of code! However, given that what I wrote is complex mathematical code that needed a) significant optimization, and b) to be be mathematically sound in the first place, I'd say that isn't too bad. Especially since the code does what the project description aimed for it to do: compute analytic ranks of rational elliptic curves very quickly with a very low rate of failure. Hopefully this functionality can and will be used to produce numerical data for number theory research in the years to come.

The Google Summer of Code has been a thoroughly rewarding experience for me. It's a great way to sharpen one's coding skills and get paid in the process. I recommend it for any university student who's looking to go into any industry that requires programming skills; I'd apply to do it again next year if I wasn't planning to graduate then!

Thursday, August 14, 2014

Things are Better in Parallel

A recent improvement I've implemented in my GSoC code is to allow for parallelized computation. The world has rapidly moved to multi-core as a default, so it makes sense to write code that can exploit this. And it turns out that the zero sum rank estimation method that I've been working on can be parallelized in a very natural way.

THE @parallel DECORATOR IN SAGE


But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.

sage: def f(n):
....:     return n*n
....: 
sage: f(2),f(3),f(5)
(4, 9, 25)

This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:

sage: def g(N):
....:     y = 0
....:     for n in range(N+1):
....:         y += f(n)
....:     return y
....: 
sage: %time g(10000000)
CPU times: user 17.5 s, sys: 214 ms, total: 17.7 s
Wall time: 17.6 s
333333383333335000000

In this example you could of course invoke the formula for the sum of the first n squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .

However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.

Sage has a readily available mechanism to do exactly this: the @parallel decorator. To enable parallel computation in your function, put @parallel above your function definition (the decorator can take some parameters; below ncpus=2 tells it that we want to use 2 processors). Note that we also have to modify the function: now it no longer only takes the bound up to which we must add squares, but also a flag indicating whether we should consider even or odd squares.

sage: @parallel(ncpus=2)
....: def h(N,parity):
....:     y = 0
....:     if parity=="even":
....:         nums = range(0,N+1,2)
....:     elif parity=="odd":
....:         nums = range(1,N+1,2)
....:     for n in nums:
....:         y += f(n)
....:     return y

Instead of calling h with its standard sequence of parameters, we can pass it a list of tuples, where each tuple is a valid sequence of inputs. Sage then sends each tuple of inputs off to an available processor and evaluates the function on them there. What's returned is a generator object that can iterate over all the outputs; we can always see the output directly by calling list() on this returned generator:

sage: for tup in list(h([(1000,"even"),(1000,"odd")])):
....:     print(tup)
....: 
(((1000, 'even'), {}), 167167000)
(((1000, 'odd'), {}), 166666500)

Note that the tuple of inputs is also returned. Because we're doing things in parallel, we need to know which output corresponds to which input, especially since processes may finish at different times and return order is not necessarily the same as the order of the input list.

Finally, we can write a wrapper function which calls our parallelized function and combines the returned results:

sage: def i(N):
....:     y = 0
....:     for output in h([(N,"even"),(N,"odd")]):
....:         y += output[1]
....:     return y
....: 
sage: %time i(10000000)
CPU times: user 1.76 ms, sys: 33.2 ms, total: 35 ms
Wall time: 9.26 s
333333383333335000000

Note that i(10000000) produces the same output at g(10000000) but in about half the time, as the work is split between two processes instead of one. This is the basic gist of parallel computation: write code that can be partitioned into parts that can operate (relatively) independently; run those parts on different processors simultaneously; and then collect returned outputs and combine to produce desired result.

PARALLELIZING THE ZERO SUM COMPUTATION


Let's take a look at the rank estimating zero formula again. Let E be a rational elliptic curve with analytic rank r. Then

r<γsinc2(Δγ)=1πΔ(η+log(N2π))+12π2Δ2(π26Li2(e2πΔ))+1πΔn=pkn<e2πΔcn(1logn2πΔ)
where
  • γ ranges over the nontrivial zeros of the L-function attached to E
  • Δ is a positive parameter
  • η is the Euler-Mascheroni constant =0.5772
  • N is the conductor of E
  • Li2(x) is the dilogarithm function, defined by Li2(x)=k=1xkk2
  • cn is the nth coefficient of the logarithmic derivative of the L-function of E, which is zero when n is not a perfect prime power.
The right hand side of the sum, which is what we actually compute, can be broken up into three parts: the first term involving the curve's conductor N; the second term involving the dilogarithm function Li2(x); and the sum over prime powers. The first two parts are quick to compute: evaluating them can basically be done in constant time regardless of the magnitude of N or Δ.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest Δ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except 2 are odd, and all primes except 2 and 3 are either 1 or 5 remainder 6. One can scale this up: given a list of small primes [p1,p2,,pn], all other primes fall into one of a relatively small number of residue classes modulo p1p2pn. For example, all primes beyond 2, 3, 5 and 7 have one of the following 48 remainders when you divide them by 210=2357:
1,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,
and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to e2πΔ, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return 1000000n1(mod 210)cn(1logn2πΔ). Processor 2 would do the same with all integers that are 11 modulo 210, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into n chunks of approximately equal size (where n is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the sinc2 zero sum with (the option of) multiple processors:

def _zerosum_sincsquared_parallel(self,Delta=1,bad_primes=None,ncpus=None):
r"""
Parallelized implementation of self._zerosum_sincsquared_fast().
Faster than self._zerosum_sincsquared_fast() when Delta >= ~1.75.
.. NOTE::
This will only produce correct output if self._E is given by its
global minimal model, i.e. if self._E.is_minimal()==True.
INPUT:
- ``Delta`` -- positive real parameter defining the
tightness of the zero sum, and thus the closeness of the returned
estimate to the actual analytic rank of the form attached to self.
- ``bad_primes`` -- (default: None) If not None, a list of primes dividing
the level of the form attached to self. This is passable so that this
method can be run on curves whose conductor is large enough to warrant
precomputing bad primes.
- ``ncpus`` - (default: None) If not None, a positive integer
defining the number of CPUs to be used for the computation. If left as
None, the maximum available number of CPUs will be used.
OUTPUT:
A positive real number that bounds the analytic rank of the modular form
attached to self from above.
.. SEEALSO::
:meth:`~sage.lfunctions.zero_sums.LFunctionZeroSum_abstract.zerosum_sincsquared`
for the more general but slower version of this method.
:meth:`~sage.lfunctions.zero_sums.LFunctionZeroSum_abstract.zerosum`
for the public method that calls this private method.
EXAMPLES::
sage: E = EllipticCurve('37a')
sage: Z = LFunctionZeroSum(E)
sage: print(E.rank(),Z._zerosum_sincsquared_parallel(Delta=1))
(1, 1.01038406984)
sage: E = EllipticCurve('121a')
sage: Z = LFunctionZeroSum(E);
sage: print(E.rank(),Z._zerosum_sincsquared_parallel(Delta=1.5,ncpus=8))
(0, 0.0104712060087)
"""
# If Delta>6.619, then we will most likely get overflow: some ap values
# will be too large to fit into a c int
if Delta > 6.619:
raise ValueError("Delta value too large; will result in overflow")
cdef double npi = self._pi
cdef double twopi = npi*2
cdef double eg = self._euler_gamma
cdef double N_double = self._N
cdef double t,u,w,y,z,expt,bound1,logp,logq
cdef double thetap,thetaq,sqrtp,sqrtq,p,q
cdef int ap,aq
cdef unsigned long n
t = twopi*Delta
expt = c_exp(t)
bound1 = c_exp(t/2)
u = t*(-eg + c_log(N_double)/2 - c_log(twopi))
w = npi**2/6-spence(-expt**(-1)+1)
y = 0
# Do bad primes first. Add correct contributions and subtract
# incorrect contribution; the latter are added back later on.
if bad_primes is None:
bad_primes = self._N.prime_divisors()
bad_primes = [prime for prime in bad_primes if prime<expt]
for prime in bad_primes:
n = prime
ap = self._e.ellap(n)
p = n
sqrtp = c_sqrt(p)
thetap = c_acos(ap/(2*sqrtp))
logp = c_log(p)
q = 1
sqrtq = 1
aq = 1
thetaq = 0
logq = logp
z = 0
while logq < t:
q *= p
sqrtq *= sqrtp
aq *= ap
thetaq += thetap
# Actual value of this term
z += (aq/q)*(t-logq)
# Incorrect value of this term to be removed below
z -= 2*c_cos(thetaq)*(t-logq)/sqrtq
logq += logp
y -= z*logp
# Good prime case. Bad primes are treated as good primes, but their
# contribution here is cancelled out above; this way we don't
# have to check if each prime divides the level or not.
if ncpus is None:
ncpus = num_cpus()
small_primes, jump, residue_chunks = self._get_residue_data(ncpus)
# Must deal with small primes separately
for m in small_primes:
n = m
if n<expt:
y += self._sincsquared_summand_1(n,t,ap,p,logp,thetap,sqrtp,
logq,thetaq,sqrtq,z)
@parallel(ncpus=ncpus)
def _sum_over_residues(residues):
"""
Return the p-power sum over residues in a residue chunk
"""
cdef double y = 0
cdef unsigned long n,k
n = 0
# Case: n+jump<sqrt(expt)
while n+jump<bound1:
for m in residues:
k = n+m
if n_is_prime(k):
y += self._sincsquared_summand_1(k,t,ap,p,logp,thetap,sqrtp,
logq,thetaq,sqrtq,z)
n += jump
# Case: n<sqrt(expt) but maybe n+jump>sqrt(expt)
for m in residues:
k = n+m
if n_is_prime(k):
if k<bound1:
y += self._sincsquared_summand_1(k,t,ap,p,logp,thetap,sqrtp,
logq,thetaq,sqrtq,z)
elif k<expt:
y += self._sincsquared_summand_2(k,t,ap,p,logp)
n += jump
# Case: sqrt(expt)<=n<expt-jump
while n+jump<expt:
for m in residues:
k = n+m
if n_is_prime(k):
y += self._sincsquared_summand_2(k,t,ap,p,logp)
n += jump
# Case: n<expt but n+jump>expt
for m in residues:
k = n+m
if k>=expt:
return y
elif n_is_prime(k):
y += self._sincsquared_summand_2(k,t,ap,p,logp)
# _sum_over_residues() function is parallized
for summand in _sum_over_residues(residue_chunks):
y += summand[1]
return RDF(2*(u+w+y)/(t**2))
Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

sage: sage.parallel.ncpus.ncpus()
12
sage: E = EllipticCurve([12838,-51298])
sage: Z = LFunctionZeroSum(E)
sage: %time Z._zerosum_sincsquared_fast(Delta=2.6)
CPU times: user 36.7 s, sys: 62.9 ms, total: 36.8 s
Wall time: 36.8 s
2.8283629046
sage: %time Z._zerosum_sincsquared_parallel(Delta=2.6)
CPU times: user 7.87 ms, sys: 133 ms, total: 141 ms
Wall time: 4.06 s
2.8283629046

Same answer in a ninth of the time! Note that the parallelized method has some extra overhead, so even with 12 processors we're unlikely to get a full factor of 12 speedup. Nevertheless, the speed boost will allow us to run the zero sum code with larger Δ values, allowing us to investigate elliptic curves with even larger conductors.

Tuesday, August 5, 2014

How big should Delta be?

Let's take a look at the central formula in my GSoC rank estimation code. Let E be a rational elliptic curve with analytic rank r. Then
r<γsinc2(Δγ)=1πΔ[C0+12πΔ(π26Li2(e2πΔ))+e2πΔn=1cn(1logn2πΔ)]
where

  • γ ranges over the nontrivial zeros of the L-function attached to E
  • Δ is a positive parameter
  • C0=η+log(N2π); η is the Euler-Mascheroni constant =0.5772 and N is the conductor of E
  • Li2(x) is the dilogarithm function, defined by Li2(x)=k=1xkk2
  • cn is the nth coefficient of the logarithmic derivative of the L-function of E.
The thing I want to look at in this post is that parameter Δ. The larger you make it, the closer the sum on the left hand side over the zeros is to the analytic rank, so when trying to determine the rank of E we want to pick as large a Δ as we can. However, the bigger this parameter is the more terms we have to compute in the sum over the cn on the right hand side; moreover the number of terms - and thus the total computation time - scales exponentially with Δ. This severely constrains how big we can make Δ; generally a value of Δ=2 may take a second or two for a single curve on SageMathCloud, while Δ=3 may take upwards of an hour. For the average rank project I'm working on I ran the code on 12 million curves using Δ=1.3; the total computation time was about 4 days on SMC.

However, it should be clear that using too large a Δ is overkill: if you run the code on a curve with Δ=1 and get a bound of zero out, you know that curve's rank exactly zero (since it's at most zero, and rank is a non-negative integer). Thus using larger Δ values on that curve will do nothing except provide you the same bound while taking much longer to do so.

This begs the question: just how big of a Δ value is good enough? Can we, given some data defining an elliptic curve, decide a priori what size Δ to use so that a) the computation returns a bound that is likely to be the true of the curve, and b) it will do so in as little time as possible?

The relevant invariant to look at here is conductor N of the elliptic curve; go back to the formula above and you'll see that the zero sum includes a term which is O(log(N)2πΔ) (coming from the 1πΔC0 term). This means that size of the returned estimate will scale with log(N): for a given Δ, the bound returned on a curve with 10-digit conductor will be about double that which is returned for a 5-digit conductor curve, for example. However, we can compensate this by increasing Δ accordingly.

To put it all more concretely we can pose the following questions:
  • Given an elliptic curve E with conductor N, how large does Δ need to be in terms of N so that the returned bound is guaranteed to be the true analytic rank of the curve?
  • Given a conductor size N and a proportionality constant P[0,1], how big does Δ have to be in terms of N and P so that at least P100 percent of all elliptic curves with conductor of size about N will, when the rank estimation code is run on them with that Δ value, yield returned bounds that are equal to their true analytic rank?
[Note: in both of the above questions we are making the implicit assumption that the returned rank bound is monotonically decreasing for increasing Δ. This is not necessarily the case: the function y=sinc2(x) is not a decreasing function in x. However, in practice any upwards fluctuation we see in the zero sum is small enough to be eliminated when we take the floor function to get an integer rank bound.]

A Δ CHOICE GOOD ENOUGH FOR MOST CURVES


The first question is easier to phrase, but more difficult to answer, so we will defer it for now. To answer the second question, it is useful mention what we know about the distribution and density of nontrivial zeros of an elliptic curve L-function.

Using some complex analysis we can show that, for the L-function of an elliptic curve with conductor N, the expected number of zeros in the critical strip with imaginary part at most T, is O(TlogN+TlogT). That is, expected zero density has two distinct components: a part that scales linearly with log of the conductor, and a part that doesn't scale with conductor (but does scale slightly faster than linearly with how far out you go).

Consider the following: if we let 
Δ(N)=C0π=1π(η+log(N2π))
then the first term in the right hand side of the zero sum formula is precisely 1 - this is the term that comes from the logN part of the zero density. The next term - the one involving π26Li2(e2πΔ) - is the term that comes from the part independent of N; because the right hand side is divided by Δ it therefore goes to zero as the curve's conductor increases. The third term contains the cn coefficients which (per Sato-Tate) will be positive half the time and negative half the time, so the entire sum could be positive or negative; we therefore expect its contribution to be small on average when we look at large number of elliptic curves.

It thus stands to reason that for this value of Delta, and when the conductor N is sufficiently large, the zero sum will be about 1, plus or minus a smallish amount coming from the cn sum. This argument is by no means rigorous, but we might therefore expect the zero sum to be within 2 of the actual analytic rank most of the time. Couple that with knowledge of the root number and you get a rank upper bound out which is equal to the actual analytic rank in all but a few pathological cases.

Empirical evidence bears this out. I ran my rank estimation code with this choice of Δ scaling on the whole Cremona database, which contains all elliptic curves up to conductor 350000:
The proportion of curves up to conductor N for which the computed rank bound is strictly greater than rank. The x-axis denotes conductor; the y-axis is the proportion of all curves up to that conductor for which the returned rank bound was not equal to the true rank (assuming BSD and GRH as always).
As you can see, the percentage of curves for which the rank bound is strictly greater than rank holds fairly constant at about 0.25%. That's minuscule: what this is saying is that if you type in a random Weierstrass equation, there is only about a 1 in 1000 chance that the rank bounding code with Δ=C0π will return a value that isn't the actual analytic rank. Moreover, the code runs considerably faster than traditional analytic rank techniques, so if you wanted to, for example, compute the ranks of a database of millions of elliptic curves, this would be a good first port of call.

Of course, this Δ scaling approach is by no means problem-free. Some more detailed analysis will show that that as stated above, the runtime of the code will actually be O(N) (omitting log factors), i.e. asymptotic scaling is actually worse than traditional analytic rank methods, which rely on evaluating the L-function directly and thus are O(N). It's just that with this code we have some very small constants sitting in front, so the crossover point is at large enough conductor values that neither method is feasible anyway. 

This choice of Δ scaling works for conductor ranges up to about 109; that corresponds to Δ2.5, which will give you a total runtime of about 10-20 seconds for a single curve on SMC. Increase the conductor by a factor of 10 and your runtime will also go up tenfold.

For curves of larger conductor, instead of setting Δ=C0π we can choose to set Δ to be αC0π for any α[0,1]; the resulting asymptotic runtime will then be O(Nα), at the expense of having a reduced proportion of elliptic curves where rank bound is equal to true rank.

HOW LARGE DOES Δ HAVE TO BE TO GUARANTEE TRUE RANK?


When we use Δ=C0π, the curves for which the returned rank estimate is strictly larger than the true rank are precisely those which have unusually low-lying zeros. For example, the rank 0 curve with Cremona label 256944c1, has a zero with imaginary part at 0.0256 (see here for a plot), compared to an expected value of 0.824. Using Δ=C0π on this curve means Δ1.214; if we compute the corresponding zero sum with this value of Δ we get a value of 2.07803. The smallest value of Δ for which we get a zero sum value of less than 2 is empirically about 2.813; at this point taking the floor and invoking parity tells us that the curve's rank is zero.

The above example demonstrates that if we want to guarantee that the returned rank bound is the true analytic rank, we are forced to increase the size of Δ to something larger than C0π. Do we need to increase Δ by a fixed value independent of N? Do we need to increase it by some constant factor? Or does it need to scale faster than logN? These are hard questions to answer; it comes down to determining how close to the central point the lowest nontrivial zero can be as a function of the conductor N (or some other invariants of E), which in turn is intimately related to estimating the size of the leading coefficient of the L-series of E at the central point. This is already the topic of a previous post: it is a question that I hope to make progress in answering in my PhD dissertation.