### THE @parallel DECORATOR IN SAGE

But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.

sage: def f(n):

....: return n*n

....:

sage: f(2),f(3),f(5)

(4, 9, 25)

This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:

sage: def g(N):

....: y = 0

....: for n in range(N+1):

....: y += f(n)

....: return y

....:

sage: %time g(10000000)

CPU times: user 17.5 s, sys: 214 ms, total: 17.7 s

Wall time: 17.6 s

333333383333335000000

In this example you could of course invoke the formula for the sum of the first $n$ squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .

However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.

Sage has a readily available mechanism to do exactly this: the

sage: @parallel(ncpus=2)

....: def h(N,parity):

....: y = 0

....: if parity=="even":

....: nums = range(0,N+1,2)

....: elif parity=="odd":

....: nums = range(1,N+1,2)

....: for n in nums:

....: y += f(n)

....: return y

Instead of calling h with its standard sequence of parameters, we can pass it a list of tuples, where each tuple is a valid sequence of inputs. Sage then sends each tuple of inputs off to an available processor and evaluates the function on them there. What's returned is a generator object that can iterate over all the outputs; we can always see the output directly by calling list() on this returned generator:

sage: for tup in list(h([(1000,"even"),(1000,"odd")])):

....: print(tup)

....:

(((1000, 'even'), {}), 167167000)

(((1000, 'odd'), {}), 166666500)

Note that the tuple of inputs is also returned. Because we're doing things in parallel, we need to know which output corresponds to which input, especially since processes may finish at different times and return order is not necessarily the same as the order of the input list.

Finally, we can write a wrapper function which calls our parallelized function and combines the returned results:

sage: def i(N):

....: y = 0

....: for output in h([(N,"even"),(N,"odd")]):

....: y += output[1]

....: return y

....:

sage: %time i(10000000)

CPU times: user 1.76 ms, sys: 33.2 ms, total: 35 ms

Wall time: 9.26 s

333333383333335000000

Note that i(10000000) produces the same output at g(10000000) but in about half the time, as the work is split between two processes instead of one. This is the basic gist of parallel computation: write code that can be partitioned into parts that can operate (relatively) independently; run those parts on different processors simultaneously; and then collect returned outputs and combine to produce desired result.

### PARALLELIZING THE ZERO SUM COMPUTATION

Let's take a look at the rank estimating zero formula again. Let $E$ be a rational elliptic curve with analytic rank $r$. Then

\begin{align*}

r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) &= \frac{1}{\pi \Delta}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right) \\

&+ \frac{1}{2\pi^2\Delta^2}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) \\

&+ \frac{1}{\pi \Delta}\sum_{\substack{n = p^k \\ n < e^{2\pi\Delta}}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)

\end{align*}

where

\begin{align*}

r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) &= \frac{1}{\pi \Delta}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right) \\

&+ \frac{1}{2\pi^2\Delta^2}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) \\

&+ \frac{1}{\pi \Delta}\sum_{\substack{n = p^k \\ n < e^{2\pi\Delta}}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)

\end{align*}

where

- $\gamma$ ranges over the nontrivial zeros of the $L$-function attached to $E$
- $\Delta$ is a positive parameter
- $\eta$ is the Euler-Mascheroni constant $=0.5772\ldots$
- $N$ is the conductor of $E$
- $\text{Li}_2(x)$ is the dilogarithm function, defined by $\text{Li}_2(x) = \sum_{k=1}^{\infty} \frac{x^k}{k^2}$
- $c_n$ is the $n$th coefficient of the logarithmic derivative of the $L$-function of $E$, which is zero when $n$ is not a perfect prime power.

The right hand side of the sum, which is what we actually compute, can be broken up into three parts: the first term involving the curve's conductor $N$; the second term involving the dilogarithm function $Li_2(x)$; and the sum over prime powers. The first two parts are quick to compute: evaluating them can basically be done in constant time regardless of the magnitude of $N$ or $\Delta$.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest $\Delta$ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except $2$ are odd, and all primes except $2$ and $3$ are either $1$ or $5$ remainder $6$. One can scale this up: given a list of small primes $[p_1,p_2,\ldots,p_n]$, all other primes fall into one of a relatively small number of residue classes modulo $p_1 p_2\cdots p_n$. For example, all primes beyond $2$, $3$, $5$ and $7$ have one of the following 48 remainders when you divide them by $210 = 2\cdot 3\cdot 5 \cdot 7$:

\begin{align*}

&1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,\\

&83, 89, 97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, \\

&151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209,

\end{align*}

and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to $e^{2\pi\Delta}$, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return $\sum_{n \equiv 1 (\text{mod } 210)}^{1000000} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)$. Processor 2 would do the same with all integers that are $11$ modulo $210$, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into $n$ chunks of approximately equal size (where $n$ is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the $\text{sinc}^2$ zero sum with (the option of) multiple processors:

Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

Same answer in a ninth of the time! Note that the parallelized method has some extra overhead, so even with 12 processors we're unlikely to get a full factor of 12 speedup. Nevertheless, the speed boost will allow us to run the zero sum code with larger $\Delta$ values, allowing us to investigate elliptic curves with even larger conductors.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest $\Delta$ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except $2$ are odd, and all primes except $2$ and $3$ are either $1$ or $5$ remainder $6$. One can scale this up: given a list of small primes $[p_1,p_2,\ldots,p_n]$, all other primes fall into one of a relatively small number of residue classes modulo $p_1 p_2\cdots p_n$. For example, all primes beyond $2$, $3$, $5$ and $7$ have one of the following 48 remainders when you divide them by $210 = 2\cdot 3\cdot 5 \cdot 7$:

\begin{align*}

&1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,\\

&83, 89, 97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, \\

&151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209,

\end{align*}

and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to $e^{2\pi\Delta}$, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return $\sum_{n \equiv 1 (\text{mod } 210)}^{1000000} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)$. Processor 2 would do the same with all integers that are $11$ modulo $210$, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into $n$ chunks of approximately equal size (where $n$ is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the $\text{sinc}^2$ zero sum with (the option of) multiple processors:

Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

sage: sage.parallel.ncpus.ncpus()

12

12

sage: E = EllipticCurve([12838,-51298])

sage: Z = LFunctionZeroSum(E)

sage: Z = LFunctionZeroSum(E)

sage: %time Z._zerosum_sincsquared_fast(Delta=2.6)

CPU times: user 36.7 s, sys: 62.9 ms, total: 36.8 s

Wall time: 36.8 s

2.8283629046

sage: %time Z._zerosum_sincsquared_parallel(Delta=2.6)

CPU times: user 7.87 ms, sys: 133 ms, total: 141 ms

Wall time: 4.06 s

2.8283629046

I know almost nothing about number theory. Is it clear that the numbers of primes in different residue classes are more or less the same?

ReplyDeleteAlso, I'm not sure if it matters in cython, but usually branching is quite expensive. Wouldn't it be better to just write a function _sum_over(list_of_primes)?

This is not obvious. It's not even obvious that there is at least one prime in each residue class. But, this follows from a more sophisticated version of the prime number theorem:

Deletehttp://terrytao.wordpress.com/2009/09/24/the-prime-number-theorem-in-arithmetic-progressions-and-dueling-conspiracies/

Thanks!

DeleteI've realized that perhaps a better code would be _sum_from_to(lower_bound, upper_bound) where the bounds would be approximated from the prime number theorem. The advantage being that the primes would be generated in different threads and not in only one. Of course, it all depends on the actual method used to generating primes. I see that the code currently uses a FLINT function n_is_prime(). It may be interesting to try get rid of the overhead of calling an external function. After all, Miller-Rabin test needs to be run only six times in order to be deterministic for numbers up to 10^12 (see https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test#Deterministic_variants_of_the_test).