THE @parallel DECORATOR IN SAGE
But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.
sage: def f(n):
....: return n*n
....:
sage: f(2),f(3),f(5)
(4, 9, 25)
This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:
sage: def g(N):
....: y = 0
....: for n in range(N+1):
....: y += f(n)
....: return y
....:
sage: %time g(10000000)
CPU times: user 17.5 s, sys: 214 ms, total: 17.7 s
Wall time: 17.6 s
333333383333335000000
In this example you could of course invoke the formula for the sum of the first n squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .
However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.
Sage has a readily available mechanism to do exactly this: the
sage: @parallel(ncpus=2)
....: def h(N,parity):
....: y = 0
....: if parity=="even":
....: nums = range(0,N+1,2)
....: elif parity=="odd":
....: nums = range(1,N+1,2)
....: for n in nums:
....: y += f(n)
....: return y
Instead of calling h with its standard sequence of parameters, we can pass it a list of tuples, where each tuple is a valid sequence of inputs. Sage then sends each tuple of inputs off to an available processor and evaluates the function on them there. What's returned is a generator object that can iterate over all the outputs; we can always see the output directly by calling list() on this returned generator:
sage: for tup in list(h([(1000,"even"),(1000,"odd")])):
....: print(tup)
....:
(((1000, 'even'), {}), 167167000)
(((1000, 'odd'), {}), 166666500)
Note that the tuple of inputs is also returned. Because we're doing things in parallel, we need to know which output corresponds to which input, especially since processes may finish at different times and return order is not necessarily the same as the order of the input list.
Finally, we can write a wrapper function which calls our parallelized function and combines the returned results:
sage: def i(N):
....: y = 0
....: for output in h([(N,"even"),(N,"odd")]):
....: y += output[1]
....: return y
....:
sage: %time i(10000000)
CPU times: user 1.76 ms, sys: 33.2 ms, total: 35 ms
Wall time: 9.26 s
333333383333335000000
Note that i(10000000) produces the same output at g(10000000) but in about half the time, as the work is split between two processes instead of one. This is the basic gist of parallel computation: write code that can be partitioned into parts that can operate (relatively) independently; run those parts on different processors simultaneously; and then collect returned outputs and combine to produce desired result.
PARALLELIZING THE ZERO SUM COMPUTATION
Let's take a look at the rank estimating zero formula again. Let E be a rational elliptic curve with analytic rank r. Then
r<∑γsinc2(Δγ)=1πΔ(−η+log(√N2π))+12π2Δ2(π26−Li2(e−2πΔ))+1πΔ∑n=pkn<e2πΔcn(1−logn2πΔ)
where
r<∑γsinc2(Δγ)=1πΔ(−η+log(√N2π))+12π2Δ2(π26−Li2(e−2πΔ))+1πΔ∑n=pkn<e2πΔcn(1−logn2πΔ)
where
- γ ranges over the nontrivial zeros of the L-function attached to E
- Δ is a positive parameter
- η is the Euler-Mascheroni constant =0.5772…
- N is the conductor of E
- Li2(x) is the dilogarithm function, defined by Li2(x)=∑∞k=1xkk2
- cn is the nth coefficient of the logarithmic derivative of the L-function of E, which is zero when n is not a perfect prime power.
The right hand side of the sum, which is what we actually compute, can be broken up into three parts: the first term involving the curve's conductor N; the second term involving the dilogarithm function Li2(x); and the sum over prime powers. The first two parts are quick to compute: evaluating them can basically be done in constant time regardless of the magnitude of N or Δ.
It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest Δ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.
As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except 2 are odd, and all primes except 2 and 3 are either 1 or 5 remainder 6. One can scale this up: given a list of small primes [p1,p2,…,pn], all other primes fall into one of a relatively small number of residue classes modulo p1p2⋯pn. For example, all primes beyond 2, 3, 5 and 7 have one of the following 48 remainders when you divide them by 210=2⋅3⋅5⋅7:
1,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,
and no other.
If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to e2πΔ, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return ∑1000000n≡1(mod 210)cn(1−logn2πΔ). Processor 2 would do the same with all integers that are 11 modulo 210, etcetera.
In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into n chunks of approximately equal size (where n is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.
Here is the method I wrote that computes the sinc2 zero sum with (the option of) multiple processors:
Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.
Let's see some timings. The machine I'm running Sage on has 12 available processors:
Same answer in a ninth of the time! Note that the parallelized method has some extra overhead, so even with 12 processors we're unlikely to get a full factor of 12 speedup. Nevertheless, the speed boost will allow us to run the zero sum code with larger Δ values, allowing us to investigate elliptic curves with even larger conductors.
It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest Δ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.
As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except 2 are odd, and all primes except 2 and 3 are either 1 or 5 remainder 6. One can scale this up: given a list of small primes [p1,p2,…,pn], all other primes fall into one of a relatively small number of residue classes modulo p1p2⋯pn. For example, all primes beyond 2, 3, 5 and 7 have one of the following 48 remainders when you divide them by 210=2⋅3⋅5⋅7:
1,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,
and no other.
If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to e2πΔ, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return ∑1000000n≡1(mod 210)cn(1−logn2πΔ). Processor 2 would do the same with all integers that are 11 modulo 210, etcetera.
In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into n chunks of approximately equal size (where n is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.
Here is the method I wrote that computes the sinc2 zero sum with (the option of) multiple processors:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def _zerosum_sincsquared_parallel(self,Delta=1,bad_primes=None,ncpus=None): | |
r""" | |
Parallelized implementation of self._zerosum_sincsquared_fast(). | |
Faster than self._zerosum_sincsquared_fast() when Delta >= ~1.75. | |
.. NOTE:: | |
This will only produce correct output if self._E is given by its | |
global minimal model, i.e. if self._E.is_minimal()==True. | |
INPUT: | |
- ``Delta`` -- positive real parameter defining the | |
tightness of the zero sum, and thus the closeness of the returned | |
estimate to the actual analytic rank of the form attached to self. | |
- ``bad_primes`` -- (default: None) If not None, a list of primes dividing | |
the level of the form attached to self. This is passable so that this | |
method can be run on curves whose conductor is large enough to warrant | |
precomputing bad primes. | |
- ``ncpus`` - (default: None) If not None, a positive integer | |
defining the number of CPUs to be used for the computation. If left as | |
None, the maximum available number of CPUs will be used. | |
OUTPUT: | |
A positive real number that bounds the analytic rank of the modular form | |
attached to self from above. | |
.. SEEALSO:: | |
:meth:`~sage.lfunctions.zero_sums.LFunctionZeroSum_abstract.zerosum_sincsquared` | |
for the more general but slower version of this method. | |
:meth:`~sage.lfunctions.zero_sums.LFunctionZeroSum_abstract.zerosum` | |
for the public method that calls this private method. | |
EXAMPLES:: | |
sage: E = EllipticCurve('37a') | |
sage: Z = LFunctionZeroSum(E) | |
sage: print(E.rank(),Z._zerosum_sincsquared_parallel(Delta=1)) | |
(1, 1.01038406984) | |
sage: E = EllipticCurve('121a') | |
sage: Z = LFunctionZeroSum(E); | |
sage: print(E.rank(),Z._zerosum_sincsquared_parallel(Delta=1.5,ncpus=8)) | |
(0, 0.0104712060087) | |
""" | |
# If Delta>6.619, then we will most likely get overflow: some ap values | |
# will be too large to fit into a c int | |
if Delta > 6.619: | |
raise ValueError("Delta value too large; will result in overflow") | |
cdef double npi = self._pi | |
cdef double twopi = npi*2 | |
cdef double eg = self._euler_gamma | |
cdef double N_double = self._N | |
cdef double t,u,w,y,z,expt,bound1,logp,logq | |
cdef double thetap,thetaq,sqrtp,sqrtq,p,q | |
cdef int ap,aq | |
cdef unsigned long n | |
t = twopi*Delta | |
expt = c_exp(t) | |
bound1 = c_exp(t/2) | |
u = t*(-eg + c_log(N_double)/2 - c_log(twopi)) | |
w = npi**2/6-spence(-expt**(-1)+1) | |
y = 0 | |
# Do bad primes first. Add correct contributions and subtract | |
# incorrect contribution; the latter are added back later on. | |
if bad_primes is None: | |
bad_primes = self._N.prime_divisors() | |
bad_primes = [prime for prime in bad_primes if prime<expt] | |
for prime in bad_primes: | |
n = prime | |
ap = self._e.ellap(n) | |
p = n | |
sqrtp = c_sqrt(p) | |
thetap = c_acos(ap/(2*sqrtp)) | |
logp = c_log(p) | |
q = 1 | |
sqrtq = 1 | |
aq = 1 | |
thetaq = 0 | |
logq = logp | |
z = 0 | |
while logq < t: | |
q *= p | |
sqrtq *= sqrtp | |
aq *= ap | |
thetaq += thetap | |
# Actual value of this term | |
z += (aq/q)*(t-logq) | |
# Incorrect value of this term to be removed below | |
z -= 2*c_cos(thetaq)*(t-logq)/sqrtq | |
logq += logp | |
y -= z*logp | |
# Good prime case. Bad primes are treated as good primes, but their | |
# contribution here is cancelled out above; this way we don't | |
# have to check if each prime divides the level or not. | |
if ncpus is None: | |
ncpus = num_cpus() | |
small_primes, jump, residue_chunks = self._get_residue_data(ncpus) | |
# Must deal with small primes separately | |
for m in small_primes: | |
n = m | |
if n<expt: | |
y += self._sincsquared_summand_1(n,t,ap,p,logp,thetap,sqrtp, | |
logq,thetaq,sqrtq,z) | |
@parallel(ncpus=ncpus) | |
def _sum_over_residues(residues): | |
""" | |
Return the p-power sum over residues in a residue chunk | |
""" | |
cdef double y = 0 | |
cdef unsigned long n,k | |
n = 0 | |
# Case: n+jump<sqrt(expt) | |
while n+jump<bound1: | |
for m in residues: | |
k = n+m | |
if n_is_prime(k): | |
y += self._sincsquared_summand_1(k,t,ap,p,logp,thetap,sqrtp, | |
logq,thetaq,sqrtq,z) | |
n += jump | |
# Case: n<sqrt(expt) but maybe n+jump>sqrt(expt) | |
for m in residues: | |
k = n+m | |
if n_is_prime(k): | |
if k<bound1: | |
y += self._sincsquared_summand_1(k,t,ap,p,logp,thetap,sqrtp, | |
logq,thetaq,sqrtq,z) | |
elif k<expt: | |
y += self._sincsquared_summand_2(k,t,ap,p,logp) | |
n += jump | |
# Case: sqrt(expt)<=n<expt-jump | |
while n+jump<expt: | |
for m in residues: | |
k = n+m | |
if n_is_prime(k): | |
y += self._sincsquared_summand_2(k,t,ap,p,logp) | |
n += jump | |
# Case: n<expt but n+jump>expt | |
for m in residues: | |
k = n+m | |
if k>=expt: | |
return y | |
elif n_is_prime(k): | |
y += self._sincsquared_summand_2(k,t,ap,p,logp) | |
# _sum_over_residues() function is parallized | |
for summand in _sum_over_residues(residue_chunks): | |
y += summand[1] | |
return RDF(2*(u+w+y)/(t**2)) |
Let's see some timings. The machine I'm running Sage on has 12 available processors:
sage: sage.parallel.ncpus.ncpus()
12
12
sage: E = EllipticCurve([12838,-51298])
sage: Z = LFunctionZeroSum(E)
sage: Z = LFunctionZeroSum(E)
sage: %time Z._zerosum_sincsquared_fast(Delta=2.6)
CPU times: user 36.7 s, sys: 62.9 ms, total: 36.8 s
Wall time: 36.8 s
2.8283629046
sage: %time Z._zerosum_sincsquared_parallel(Delta=2.6)
CPU times: user 7.87 ms, sys: 133 ms, total: 141 ms
Wall time: 4.06 s
2.8283629046
I know almost nothing about number theory. Is it clear that the numbers of primes in different residue classes are more or less the same?
ReplyDeleteAlso, I'm not sure if it matters in cython, but usually branching is quite expensive. Wouldn't it be better to just write a function _sum_over(list_of_primes)?
This is not obvious. It's not even obvious that there is at least one prime in each residue class. But, this follows from a more sophisticated version of the prime number theorem:
Deletehttp://terrytao.wordpress.com/2009/09/24/the-prime-number-theorem-in-arithmetic-progressions-and-dueling-conspiracies/
Thanks!
DeleteI've realized that perhaps a better code would be _sum_from_to(lower_bound, upper_bound) where the bounds would be approximated from the prime number theorem. The advantage being that the primes would be generated in different threads and not in only one. Of course, it all depends on the actual method used to generating primes. I see that the code currently uses a FLINT function n_is_prime(). It may be interesting to try get rid of the overhead of calling an external function. After all, Miller-Rabin test needs to be run only six times in order to be deterministic for numbers up to 10^12 (see https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test#Deterministic_variants_of_the_test).