I was person #100 to solve Project Euler problem 451 a while back. The problem and solution are detailed in the source code below. The solution was found in 28 seconds on a 20-core Xeon machine. As noted in the code, using the Chinese Remainder Theorem could help efficiency, but I don’t have that coded currently and so I iterated over potential C.R.T. candidates instead, which provided a quickly coded and self-contained solution.

  /* * Jason B. Hill / jason@jasonbhill.com * pe451.c / solves Project Euler problem 451 * build: gcc pe451.c -fopenmp -O3 -o pe451 * execute: ./pe451 * * I logged in to see that problem 451 had around 90 solutions, and decided to * attempt to be one of the first 100 solvers to enter the proper solution. As * such, I coded this a bit quickly. A "clean" solution would make use of the * Chinese Remainder Theorem, and the timing could probably be brought down * a tiny bit. But, all things considered, this runs very fast itself. * * Problem: Consider the number 15. There are eight positive integers less than * 15 which are coprime to 15: 1, 2, 4, 7, 8, 11, 13, 14. The modular inverses * of these numbers modulo 15 are: 1, 8, 4, 13, 2, 11, 7, 14 because * 1*1 mod 15=1 * 2*8=16 mod 15=1 * 4*4=16 mod 15=1 * 7*13=91 mod 15=1 * 11*11=121 mod 15=1 * 14*14=196 mod 15=1 * Let I(n) be the largest positive number m smaller than n-1 such that the * modular inverse of m modulo n equals m itself. So I(15)=11. Also I(100)=51 * and I(7)=1. Find the sum of I(n) for 3<=n<=2·10**7. * * The solution is documented in the code below. The code is in C with OpenMP. * * The result is found in 28 seconds on a 20-core Xeon and and 2:48 on a 2-core * core i7. Here is some of the output: * prime sieve created - 0.110000 seconds * list of 1270607 primes created - 0.050000 seconds * list of prime factors created for integers <= 20000000 - 5.350000 seconds * prime factor exponents computed for all integers - 0.980000 seconds */   #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <time.h>     /*****************************************************************************/ /* The following structs and routines are for integer factorization */ /*****************************************************************************/   /* * Notes on factoring: * * We need to consider the prime factorization of positive integers less than * 2*10^7. How many distinct prime factors can such numbers have? That's easy. * Since the product of the 8 smallest primes (2*3*5*7*11*13*17*19 = 9,699,690) * is less than 2*10^7 and the 9 smallest primes (9,699,690*23 = 223,092,870) * is greater than 2*10^7, we know that each positive integer less than 2*10^7 * can only have 8 distinct prime factors. Let's create a C-struct that can * keep track of this data for each factored integer. */   /* * A struct to hold factor information for an integer. * p holds a list of distinct prime factors. (at most 8) * e holds a list of exponents for the prime factors in p. * num_factors says how long the lists p and e are. */ struct factors_t { uint64_t p; // list of distinct primes uint8_t e; // list of exponents for the primes p[] uint64_t pp; // prime power factor (p[i]**e[i]) uint8_t num_factors; // how many distinct prime factors };     /* * More notes on factoring: * * Every positive integer n has a prime factor <= sqrt(n). We use this fact to * build a prime sieve. First, we construct a function to return the square * root of a uint64_t. Then, we'll use that function to create a sieve (which * is returned as a pointer/list of uint8_ts... effectively Booleans). */   /* * Return the square root of a uint64_t */ uint64_t sqrt_uint64(uint64_t n) { uint8_t shift = 1; uint64_t res, s;   while ((1<<shift) < n) shift += 1; res = (1<<((shift>>1) + 1)); while (1) { s = (n/res + res)/2; if (s >= res) return res; res = s; } }     /* * Return a prime sieve identifying all primes <= limit. * This is just a list of uint8_t's where 0 means the index is composite and * 1 means the index is prime. */ uint8_t * prime_sieve(uint64_t limit) { uint8_t *sieve; // this will be the pointer that is returned uint64_t i,j; uint64_t s = sqrt_uint64(limit);   // allocate memory for sieve sieve = malloc((limit + 1) * sizeof(uint8_t));   // set initial values in the sieve sieve = 0; sieve = 0; sieve = 1;   // set other initial odd values in sieve to 1, even values to 0. for (i=3;i<=limit;i++) { if (i%2==0) sieve[i] = 0; else sieve[i] = 1; }   // unset composite numbers (evens are already unset) for (i=3;i<=(s+1);i=i+2) { if (sieve[i]==1) { j = i*3; // sieve[i] prime, sieve[2*i] already 0 while (j<=limit) { sieve[j] = 0; j += 2*i; } } } return sieve; }     /* * Determine if a value is prime using a provided sieve. */ _Bool is_prime(uint64_t n, uint8_t * sieve) { if ((1 & sieve[n]) == 1) return 1; else return 0; }     /*****************************************************************************/ /* The following functions are specific to Project Euler problem 451 */ /*****************************************************************************/   /* * Notes: * * Given a number n, we're looking for the largest m < n-1 such that m^2 is 1 * modulo n. That's a mouthful. First, field theory tells us that we're only * interested in integers that are relatively prime with n (which is why the * problem asks for m < n-1 ... otherwise it would be a lot easier). * * Now, if n=p^k is a prime power and we have x^2 = 1 mod p^k, we consider: * a) 1 is a solution for x since 1*1=1 mod p^k = 1. * b) p^k-1 is a solution since (p^k-1)^2=p^(2k)-2p^k+1 (mod p^k) = 1. But, * that's too big since we need m < n=p^k-1. :-( * c) Any other solutions? Well... that's sort of complicated. Ugh. Let's first * consider the case when p=2. For n=p^1=2^1, we only have x=1 as a * solution. When n=p^2=2^2, we only have x=1 and x=3. For n=p^k=2^k, we * also find that 2^(k-1)+1 and 2^(k-1)-1 square to 1 modulo 2^k. For other * primes, this doesn't happen and we only have 1 and p^k-1 as solutions. * * Thus, what we're going to do is as follows: * 1) Factor each integer within the given range (factoring a range of numbers * is much faster than factoring each one individually). We'll store that * factorization information using the struct defined above. * 2) Because of step 1, determining if a number is prime or a power of a prime * is easy. In case of primes and prime powers, we consider if the prime is * even or odd, returning the appropriate maximum square root of 1 directly. * 3) When the integer is composite having more than a single distinct prime * factor, we use the Chinese Remainder Theorem "in spirit" and iterate * over potential candidates (instead of computing the C.R.T. result * directly). We're only considering possible m that are both relatively * prime with n AND such that m^2 is +1 or -1 modulo the prime power * factors of n. When a candidate does not satisfy these properties, we * simply move to the next candidate. */   /* * Compute the next candidate (the next largest number that may satisfy the * given equivalence relations). This is done relative to the largest non-2 * prime power in the factorization of n. * * n is the integer for which we're computing the largest square root. * m is the largest odd prime power factor of n. * c is the current candidate (assume uninitialized when set to 0). */ uint64_t next_candidate(uint64_t n, uint64_t m, uint64_t c) { uint64_t r;   // check if c is initialized if (c==0) c = n-1; // we can only consider values of \pm1 mod m r = c % m; if (r==m-1) return c-(m-2); else return c-2; }     /* * Determine if the current candidate is (1) relatively prime to n and, if it * is, (2) a square root of unity modulo n. We can tweak this later for timing * as the test for being relatively prime may cut performance a bit. We'll see. * * n is the integer for which we're computing the largest square root. * cnd is the current candidate. * factors is a list of factors_t structs (prime, exponent info) */ _Bool verify_candidate(uint64_t n, uint64_t cnd, struct factors_t * factors) { uint8_t i,j; uint64_t pp;   // verify that cnd is not divisible by any prime in factors[n].p[] for (i=0;i<factors[n].num_factors;i++) { if (cnd % factors[n].p[i] == 0) return 0; } // verify cnd modulo 2 when exponent of 2 in n is > 2 if (factors[n].p == 2 && factors[n].e > 2) { pp = factors[n].pp>>1; if (cnd % pp != 1 && cnd % pp != pp-1) return 0; } // verify other primes for (i=0;i<factors[n].num_factors;i++) { if (factors[n].p[i] == 2) continue; pp = factors[n].pp[i]; if (cnd % pp != 1 && cnd % pp != pp-1) return 0; } return 1; }     /* * Given a positive integer n, find the largest positive m < n-1 such that * gcd(n,m)=1 and m**2 (mod n) = 1. */ uint64_t largest_sqrt(uint64_t n, struct factors_t * factors) { uint8_t i; uint32_t j;   // if n is an odd prime, or a power of an odd prime, return 1 if (factors[n].num_factors == 1 && factors[n].p != 2) return 1;   // if n is a power of 2 if (factors[n].num_factors == 1) { // if n is 2 or 4 if (factors[n].e < 3) return 1; // if n is 2**e for e >= 3 return (factors[n].pp>>1)+1; }   // find the maximum odd prime power factor of n uint64_t pp = 1; uint64_t cnd;   for (i=0;i<factors[n].num_factors;i++) { if (factors[n].p[i] != 2 && factors[n].pp[i] > pp) { pp = factors[n].pp[i]; } }   // get the first candidate w.r.t. moppf cnd = next_candidate(n, pp, 0);   while (!verify_candidate(n, cnd, factors)) cnd = next_candidate(n, pp, cnd);   return cnd; }     /*****************************************************************************/ /* Execute */ /*****************************************************************************/ int main() { uint64_t s = 0; // final output value uint64_t i,j,k; // iterators uint64_t limit = 20000000; // upper bound for computations uint64_t num_primes = 0; // count for primes <= limit uint8_t e; // exponents for prime factoring int tid; // thread id for openmp double time_count; // timer clock_t start, start_w; // time variables   uint8_t *sieve; // prime sieve uint64_t *primes; // list of primes struct factors_t *factors; // prime factors and exponents     /* start outer timer */ start_w = clock();     /* make the prime sieve */ start = clock(); sieve = prime_sieve(limit); time_count = (double)(clock() - start) / CLOCKS_PER_SEC; printf("prime sieve created - %f seconds\n", time_count);     /* form list of primes from sieve */ start = clock(); // compute the number of primes in sieve for (i=2;i<=limit;i++) { if (is_prime(i, sieve)) { num_primes = num_primes + 1; } } primes = malloc(num_primes * sizeof(uint64_t)); j=0; for (i=2;i<=limit;i++) { if (is_prime(i, sieve)) { primes[j] = i; j++; } } time_count = (double)(clock() - start) / CLOCKS_PER_SEC; printf("list of %llu primes created - %f seconds\n", num_primes, time_count);     /* fill out a factors_t struct for each integer below limit */ start = clock(); // allocate memory for factors_t factors = malloc(sizeof(struct factors_t) * (limit + 1)); // set the initial number of factors for each number to 0 for (i=1;i<=limit;i++) factors[i].num_factors=0; // for each prime, add that prime as a factor to each of its multiples for (i=0;i<num_primes;i++) { j = 1; // start at 1*p for each prime p while (j*primes[i]<=limit) { k = factors[j*primes[i]].num_factors; // get proper index for p factors[j*primes[i]].p[k] = primes[i]; // add prime to p factors[j*primes[i]].num_factors++; // increase index j++; // increase multiple of prime } } time_count = (double)(clock() - start) / CLOCKS_PER_SEC; printf("list of prime factors created for integers <= %llu - %f seconds\n", limit, time_count);     /* compute exponents for each prime in factor lists */ start = clock(); for (i=2;i<=limit;i++) { for (j=0;j<factors[i].num_factors;j++) { e=1; k=factors[i].p[j]; while (i % (k*factors[i].p[j]) == 0) { e++; k*=factors[i].p[j]; } factors[i].e[j] = e; factors[i].pp[j] = k; } } time_count = (double)(clock() - start) / CLOCKS_PER_SEC; printf("prime factor exponents computed for all integers - %f seconds\n", time_count);     /* find largest square root of unity for each integer under limit; add to s */ #pragma omp parallel for reduction(+:s) shared(factors) schedule(dynamic,1000) for (i=3;i<=limit;i++) { s = s+largest_sqrt(i, factors); } time_count = (double)(clock() - start) / CLOCKS_PER_SEC; printf("result for 3<=i<=%llu: %llu - %f seconds\n", limit, s, time_count);     time_count = (double)(clock() - start_w) / CLOCKS_PER_SEC; printf("\ntotal time: %f seconds\n\n", time_count);   free(sieve); free(primes); free(factors);   return 0; }

A triangular pyramid is constructed using spherical balls so that each ball rests on exactly three balls of the next lower level. Then, we calculate the number of paths leading from the apex to each position: A path starts at the apex and progresses downwards to any of the three spheres directly below the current position. Consequently, the number of paths to reach a certain position is the sum of the numbers immediately above it (depending on the position, there are up to three numbers above it).

The result is Pascal’s pyramid and the numbers at each level n are the coefficients of the trinomial expansion $(x + y + z)^n$. How many coefficients in the expansion of $(x + y + z)^{200000}$ are multiples of $10^{12}$?

## Solution Using the Multinomial Theorem

The generalization of the binomial theorem is the multinomial theorem. It says that multinomials raised to exponents can be expanded using the formula
$(x_1+x_2+\cdots+x_m)^n=\sum_{{k_1+k_2+\cdots+k_m=n}\atop{0\le k_i\le n}}\left({n}\atop{k_1,k_2,\ldots,k_m}\right)\prod_{1\le t\le m}x_t^{k_t}$
where
$\left({n}\atop{k_1,k_2,\ldots,k_m}\right)=\frac{n!}{k_1!k_2!\cdots k_m!}.$
Of course, when m=2 this gives the binomial theorem. The sum is taken over all partitions $k_1+k_2+\cdots+k_m=n$ for integers $k_i$. If n=200000 abd m=3, then the terms in the expansion are given by
$\left({200000}\atop{k_1,k_2,k_3}\right)x_1^{k_1}x_2^{k_2}x_3^{k_3}=\frac{200000!}{k_1!k_2!k_3!}x_1^{k_1}x_2^{k_2}x_3^{k_3}$
where $k_1+k_2+k_3=200000$. It’s worth pointing out that each of the coefficients is an integer, and thus has a unique factorization into products of prime integers. Of course, there’s no way that we’re going to calculate these coefficients. We only need to know when they’re divisible by $10^{12}$. Thus, we only need to consider how many factors of 2 and 5 are involved.

First, we’ll create a function $p(n,d)$ that outputs how many factors of $d$ are included in $n!$. We have that
$p(n,d)=\left\lfloor\frac{n}{d}\right\rfloor+\left\lfloor\frac{n}{d^2}\right\rfloor+\left\lfloor\frac{n}{d^3}\right\rfloor+ \cdots+\left\lfloor\frac{n}{d^r}\right\rfloor,$
where $d^r$ is the highest power of $d$ dividing $n$. For instance, there are 199994 factors of 2 in 200000!. Since we’re wondering when our coefficients are divisible by $10^{12}=2^{12}5^{12}$, we’ll be using the values provided by $p(n,d)$ quite a bit for $d=2$ and $d=5$. We’ll store two lists:
$p2=[p(i,2)\text{ for }1\le i\le 200000]\quad\text{and}\quad p5=[p(i,5)\text{ for }1\le i\le 200000].$
For a given $k_1,k_2,k_3$, the corresponding coefficient is divisible by $10^{12}$ precisely when
$p2[k_1]+p2[k_2]+p2[k_3]<199983\ \text{and}\ p5[k_1]+p5[k_2]+p5[k_3]<49987.$
That is, this condition ensures that there are at least 12 more factors of 2 and 5 in the numerator of the fraction defining the coefficients.

Now, we know that $k_1+k_2+k_3=200000$, and we can exploit symmetry and avoid redundant computations if we assume $k_1\le k_2\le k_3$. Under this assumption, we always have
$k_1\le\left\lfloor\frac{200000}{3}\right\rfloor=66666.$
We know that $k_1+k_2+k_3=200000$ is impossible since 200000 isn't divisible by 3. It follows that we can only have (case 1) $k_1=k_2 < k_3$, or (case 2) $k_1 < k_2=k_3$, or (case 3) $k_1 < k_2 < k_3$.

In case 1, we iterate $0\le k_1\le 66666$, setting $k_2=k_1$ and $k_3=200000-k_1-k_2$. We check the condition, and when it is satisfied we record 3 new instances of coefficients (since we may permute the $k_i$ in 3 ways).

In case 2, we iterate $0\le k_1\le 66666$, and when $k_1$ is divisible by 2 we set $k_2=k_3=\frac{200000-k_1}{2}$. When the condition holds, we again record 3 new instance.

In case 3, we iterate $0\le k_1\le 66666$, and we iterate over $k_2=k_1+a$ where $1\le a < \left\lfloor\frac{200000-3k_1}{2}\right\rfloor$. Then $k_3=200000-k_1-k_2$. When the condition holds, we record 6 instances (since there are 6 permutations of 3 objects).

## Cython Solution

I’ll provide two implementations, the first written in Cython inside Sage. Then, I’ll write a parallel solution in C.

%cython   import time from libc.stdlib cimport malloc, free   head_time = time.time()   cdef unsigned long p(unsigned long k, unsigned long d): cdef unsigned long power = d cdef unsigned long exp = 0 while power <= k: exp += k / power power *= d return exp   cdef unsigned long * p_list(unsigned long n, unsigned long d): cdef unsigned long i = 0 cdef unsigned long * powers = <unsigned long *>malloc((n+1)*sizeof(unsigned long)) while i <= n: powers[i] = p(i,d) i += 1 return powers     run_time = time.time()   # form a list of number of times each n! is divisible by 2. cdef unsigned long * p2 = p_list(200000,2)   # form a list of number of times each n! is divisible by 5. cdef unsigned long * p5 = p_list(200000,5)   cdef unsigned long k1, k2, k3, a cdef unsigned long long result = 0   k1 = 0 while k1 <= 66666: # case 1: k1 = k2 < k3 k2 = k1 k3 = 200000 - k1 - k2 if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]): result += 3 # case 2: k1 < k2 = k3 if k1 % 2 == 0: k2 = (200000 - k1)/2 k3 = k2 if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]): result += 3 # case 3: k1 < k2 < k3 a = 1 while 2*a < (200000 - 3*k1): k2 = k1 + a k3 = 200000 - k1 - k2 if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]): result += 6 a += 1 k1 += 1     free(p2) free(p5)     elapsed_run = round(time.time() - run_time, 5) elapsed_head = round(time.time() - head_time, 5)   print "Result: %s" % result print "Runtime: %s seconds (total time: %s seconds)" % (elapsed_run, elapsed_head)

When executed, we find the correct result relatively quickly.

Result: 479742450 Runtime: 14.62538 seconds (total time: 14.62543 seconds)

## C with OpenMP Solution

#include <stdio.h> #include <stdlib.h> #include <malloc.h> #include <omp.h>   /*****************************************************************************/ /* function to determine how many factors of 'd' are in 'k!' */ /*****************************************************************************/ unsigned long p(unsigned long k, unsigned long d) { unsigned long power = d; unsigned long exp = 0; while (power <= k) { exp += k/power; power *= d; } return exp; }   /*****************************************************************************/ /* create a list [p(0,d),p(1,d),p(2,d), ... ,p(n,d)] and return pointer */ /*****************************************************************************/ unsigned long * p_list(unsigned long n, unsigned long d) { unsigned long i; unsigned long * powers = malloc((n+1)*sizeof(unsigned long)); for (i=0;i<=n;i++) powers[i] = p(i,d); return powers; }   /*****************************************************************************/ /* main */ /*****************************************************************************/ int main(int argc, char **argv) { unsigned long k1, k2, k3, a; unsigned long long result = 0;   unsigned long * p2 = p_list(200000, 2); unsigned long * p5 = p_list(200000, 5);     #pragma omp parallel for private(k1,k2,k3,a) reduction(+ : result) for (k1=0;k1<66667;k1++) { // case 1: k1 = k2 < k3 k2 = k1; k3 = 200000 - k1 - k2; if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) { result += 3; } // case 2: k1 < k2 = k3 if (k1 % 2 == 0) { k2 = (200000 - k1)/2; k3 = k2; if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) { result += 3; } } // case 3: k1 < k2 < k3 for (a=1;2*a<(200000-3*k1);a++) { k2 = k1 + a; k3 = 200000 - k1 - k2; if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) { result += 6; } } }   free(p2); free(p5);   printf("result: %lld\n", result);   return 0; }

This can be compiled and optimized using GCC as follows.

$gcc -O3 -fopenmp -o problem-154-omp problem-154-omp.c When executed on a 16-core machine, we get the following result. $ time ./problem-154-omp result: 479742450   real 0m1.487s

This appears to be the fastest solution currently known, according to the forum of solutions on Project Euler. The CPUs on the 16-core machine are pretty weak compared to modern standards. When running on a single core on a new Intel Core i7, the result is returned in about 4.7 seconds.

This problem was posted on the GAP (Groups Algorithms & Programming) Forum some time ago. Roughly a week later, this partial solution was posted.

Knowing that GAP runs on an interpreted language on top of a C kernel, I thought I may be able to do better with C. After prototyping the situation in Python, my fears were realized: The digits associated with the orbit are HUGE. So, I decided to code in C using the GMP library and OpenMP. The main idea is that the orbit can be computed in parallel, going forwards and backwards until the orbit points agree. Not only will this cut the computation time roughly in half (when compared to a non-parallel C/GMP solution), but it should go much faster than GAP. And indeed, it does. My post to the forum announcing the first known solution can be found here. The maximum digit length found in the orbit is 76,785.

### C/GMP/OpenMP Code

Here is the code. It requires the GNU MP Bignum Library (not in GCC) and OpenMP (in GCC). When the orbit points are within a specific digit length difference, only a single core continues the computation. Otherwise, both cores continue to compute the orbit in opposite directions. (There’s no makefile. I’ll give compiler instructions below.)

// Jason B. Hill // Jason.B.Hill@Colorado.edu // www.jasonbhill.com   #include <stdio.h> #include <stdlib.h> #include <gmp.h> #include <omp.h>   //#define p 32 //#define p 736 //#define p 25952 #define p 173176   /*****************************************************************************/ /* Transpositions a,b,c and compositions g,g^-1 */ /*****************************************************************************/   void a(mpz_t omega) { if(mpz_even_p(omega)==1) mpz_add_ui(omega, omega, 1); else mpz_sub_ui(omega, omega, 1); }   void b(mpz_t omega) { if(mpz_congruent_ui_p(omega, 0, 5)==1) mpz_add_ui(omega, omega, 4); else if(mpz_congruent_ui_p(omega, 4, 5)==1) mpz_sub_ui(omega, omega, 4); }   void c(mpz_t omega) { if(mpz_congruent_ui_p(omega, 1, 4)==1) { mpz_sub_ui(omega, omega, 1); mpz_divexact_ui(omega, omega, 4); mpz_mul_ui(omega, omega, 6); } else if(mpz_congruent_ui_p(omega, 0, 6)==1) { mpz_divexact_ui(omega, omega, 6); mpz_mul_ui(omega, omega, 4); mpz_add_ui(omega, omega, 1); } }   void g(mpz_t omega) { a(omega); b(omega); c(omega); }   void ginv(mpz_t omega) { c(omega); b(omega); a(omega); }   /*****************************************************************************/ /* Main */ /*****************************************************************************/   int main(void) { unsigned long n = p; unsigned long long i0 = 0; unsigned long long i1 = 0; int th_id; _Bool sstop = 0; size_t s = 0; size_t c0, c1; mpz_t omega0, omega1;   omp_set_num_threads(2);   mpz_init(omega0); mpz_init(omega1);   mpz_set_ui(omega0, n); mpz_set_ui(omega1, n);   c0 = mpz_sizeinbase(omega0, 10); c1 = mpz_sizeinbase(omega1, 10);   #pragma omp parallel private(th_id) \ shared(omega0,omega1,s,sstop,c0,c1,i0,i1) { th_id = omp_get_thread_num();   if(th_id == 1) { g(omega1); i1++; if(mpz_cmp(omega0,omega1)==0) sstop = 1; }   #pragma omp barrier   while(!sstop) { if(th_id == 0) { if(abs(c0 - c1) > 20) { ginv(omega0); c0 = mpz_sizeinbase(omega0, 10); i0++; } } else if(th_id == 1) { if(abs(c0 - c1) > 5) { g(omega1); c1 = mpz_sizeinbase(omega1, 10); i1++; } else { if(mpz_cmp(omega0,omega1)==0) sstop = 1; else { g(omega1); c1 = mpz_sizeinbase(omega1, 10); i1++; } } }   #pragma omp flush(sstop,c0,c1,i0,i1)   if(th_id == 0) { if(c0 > s) { s = c0; printf("Core 0: digit length increased to %ld\n", s); printf("iterations: %lld (core0) %lld (core1) %lld (total)\ \n\n",i0,i1,i0+i1); } if(c1 > s) { s = c1; printf("Core 1: digit length increased to %ld\n", s); printf("iterations: %lld (core0) %lld (core1) %lld (total)\ \n\n",i0,i1,i0+i1); } if((i0+i1)%100000000==0) { printf("digit length: %ld\n", s); printf("iterations: %lld (core0) %lld (core1) %lld (total)\ \n\n",i0,i1,i0+i1); } } } }   printf("total iterations: %lld\n",i0+i1); mpz_clear(omega0); mpz_clear(omega1);   return 0; }

Using GCC with OpenMP and GMP, one can compile the code on a multicore machine with, for example, the following command.

gcc -O3 -o length-residue-orbit-omp length-residue-orbit-omp.c -lgmp -fopenmp

### Success!

The result: total iterations: 47610700792, is returned in roughly 3.1 days on a 2.9 GHz 3rd generation Intel Core i7 (i7-3520M).

### Problem

Let $d(n)$ be defined as the sum of proper divisors of $n$ (numbers less than $n$ which divide evenly into $n$).
If $d(a) = b$ and $d(b) = a$, where $a\neq b$, then $a$ and $b$ are an amicable pair and each of a and b are called amicable numbers.

For example, the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110; therefore $d(220) = 284$. The proper divisors of 284 are 1, 2, 4, 71 and 142; so $d(284) = 220$.

Evaluate the sum of all the amicable numbers under 10000.

### Python Solution

I used the idea of creating a list and populating that list with values corresponding to the sum of divisors of numbers, and then I walk through the list and do some index and value checking.

import time   def sum_divisors(n): s = 0 for i in range(1,n): if n % i == 0: s += i return s   def amicable_pairs_xrange(low,high): L = [sum_divisors(i) for i in range(low,high + 1)] pairs = [] for i in range(high - low + 1): ind = L[i] if i + low < ind and low <= ind and ind <= high and L[ind - low] == i + low: pairs.append([i+low,ind]) return pairs   def sum_pairs(pairs): return sum([sum(pair) for pair in pairs])   start = time.time()   ans = sum_pairs(amicable_pairs_xrange(1,10000))   elapsed = time.time() - start   print("%s found in %s seconds") % (ans,elapsed)

This returns the correct result.

31626 found in 7.86725997925 seconds

### Cython Solution

One of the things I love about using Cython inside the Sage Notebook environment is that one can simply place a Cython statement inside Python code and often achieve some sort of speedup. All I’ve done in this first example is to do just that. I haven’t change the Python code at all yet.

%cython   import time   def sum_divisors(n): s = 0 for i in range(1,n): if n % i == 0: s += i return s   def amicable_pairs_xrange(low,high): L = [sum_divisors(i) for i in range(low,high + 1)] pairs = [] for i in range(high - low + 1): ind = L[i] if i + low < ind and low <= ind and ind <= high and L[ind - low] == i + low: pairs.append([i+low,ind]) return pairs   def sum_pairs(pairs): return sum([sum(pair) for pair in pairs])   start = time.time()   ans = sum_pairs(amicable_pairs_xrange(1,10000))   elapsed = time.time() - start   print("%s found in %s seconds") % (ans,elapsed)

This runs faster than the original, and that blows my mind.

31626 found in 4.98208904266 seconds

We can improve on this obviously, by rewriting the involved functions using C datatypes.

%cython   import time   cdef sum_divisors(unsigned int n): cdef unsigned int s = 0, i = 1 while i < n: if n % i == 0: s += i i += 1 return s   cdef sum_amicable_pairs(unsigned int low, unsigned int high): cdef unsigned int a = low, b, sum = 0 while a <= high: b = sum_divisors(a) if b > a and sum_divisors(b) == a: sum += a + b a += 1   return sum   start = time.time()   ans = sum_amicable_pairs(1,10000)   elapsed = time.time() - start   print("%s found in %s seconds") % (ans,elapsed)

We do achieve somewhat significant speedup now.

31626 found in 1.06677889824 seconds

### C Solution

If I take the fast Cython version and rewrite it in C, I don’t get much better performance.

#include <stdio.h> #include <stdlib.h>   unsigned int sum_divisors(unsigned int n) { unsigned int s = 0, i = 1; while(i < n) { if(n % i == 0) s = s + i; i++; } return s; }   unsigned int sum_amicable_pairs(unsigned int low, unsigned int high) { unsigned int a = low, b, sum = 0; while(a <= high) { b = sum_divisors(a); if(b > a && sum_divisors(b) == a) sum = sum + a + b; a++; } return sum; }   int main(int argc, char **argv) { unsigned int ans = sum_amicable_pairs(1,10000); printf("result: %d\n",ans); return 0; }

The result, compiled with heavy optimization, is only slightly faster than the Cython code.

$gcc -O3 -o problem-21 problem-21.c$ time ./problem-21 result: 31626   real 0m1.032s

About the only way to really improve this, beyond restructuring the algorithm, is to use OpenMP to run the process in parallel.

### C with OpenMP

A parallel version using shared memory via OpenMP looks something like this. (I’ve rewritten it slightly.)

#include <stdio.h> #include <stdlib.h> #include <omp.h>   unsigned int sum_divisors(unsigned int n) { unsigned int s = 0, i = 1; while(i < n) { if(n % i == 0) s = s + i; i++; } return s; }   int main(int argc, char **argv) { unsigned int a, b, nthreads, result = 0;   #pragma omp parallel nthreads = omp_get_num_threads(); printf("no of threads %d\n", nthreads);   #pragma omp parallel for reduction(+:result) private(b) for(a=0; a < 10000; a++) { b = sum_divisors(a); if(b > a && sum_divisors(b) == a) result = result + a + b; }   printf("result: %d\n",result); return 0; }

Now, we get much better performance. I’m running this on a 16-core machine, where 3 of the cores are currently being used.

$gcc -fopenmp -O3 -o problem-21-openmp problem-21-openmp.c$ time ./problem-21-openmp no of threads 16 result: 31626   real 0m0.140s