I was person #100 to solve Project Euler problem 451 a while back. The problem and solution are detailed in the source code below. The solution was found in 28 seconds on a 20-core Xeon machine. As noted in the code, using the Chinese Remainder Theorem could help efficiency, but I don’t have that coded currently and so I iterated over potential C.R.T. candidates instead, which provided a quickly coded and self-contained solution.

 
/*
 * Jason B. Hill / jason@jasonbhill.com
 * pe451.c / solves Project Euler problem 451
 * build: gcc pe451.c -fopenmp -O3 -o pe451
 * execute: ./pe451
 *
 * I logged in to see that problem 451 had around 90 solutions, and decided to
 * attempt to be one of the first 100 solvers to enter the proper solution. As
 * such, I coded this a bit quickly. A "clean" solution would make use of the
 * Chinese Remainder Theorem, and the timing could probably be brought down
 * a tiny bit. But, all things considered, this runs very fast itself.
 *
 * Problem: Consider the number 15. There are eight positive integers less than
 * 15 which are coprime to 15: 1, 2, 4, 7, 8, 11, 13, 14. The modular inverses
 * of these numbers modulo 15 are: 1, 8, 4, 13, 2, 11, 7, 14 because
 * 1*1 mod 15=1
 * 2*8=16 mod 15=1
 * 4*4=16 mod 15=1
 * 7*13=91 mod 15=1
 * 11*11=121 mod 15=1
 * 14*14=196 mod 15=1
 * Let I(n) be the largest positive number m smaller than n-1 such that the
 * modular inverse of m modulo n equals m itself. So I(15)=11. Also I(100)=51
 * and I(7)=1. Find the sum of I(n) for 3<=n<=2·10**7.
 *
 * The solution is documented in the code below. The code is in C with OpenMP.
 *
 * The result is found in 28 seconds on a 20-core Xeon and and 2:48 on a 2-core
 * core i7. Here is some of the output:
 * prime sieve created - 0.110000 seconds
 * list of 1270607 primes created - 0.050000 seconds
 * list of prime factors created for integers <= 20000000 - 5.350000 seconds
 * prime factor exponents computed for all integers - 0.980000 seconds
 */
 
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
 
 
/*****************************************************************************/
/* The following structs and routines are for integer factorization          */
/*****************************************************************************/
 
/*
 * Notes on factoring:
 *
 * We need to consider the prime factorization of positive integers less than
 * 2*10^7. How many distinct prime factors can such numbers have? That's easy.
 * Since the product of the 8 smallest primes (2*3*5*7*11*13*17*19 = 9,699,690)
 * is less than 2*10^7 and the 9 smallest primes (9,699,690*23 = 223,092,870)
 * is greater than 2*10^7, we know that each positive integer less than 2*10^7
 * can only have 8 distinct prime factors. Let's create a C-struct that can
 * keep track of this data for each factored integer.
 */
 
/*
 * A struct to hold factor information for an integer.
 * p holds a list of distinct prime factors. (at most 8)
 * e holds a list of exponents for the prime factors in p.
 * num_factors says how long the lists p and e are.
 */
struct factors_t {
    uint64_t    p[8]; // list of distinct primes
    uint8_t     e[8]; // list of exponents for the primes p[]
    uint64_t    pp[8]; // prime power factor (p[i]**e[i])
    uint8_t     num_factors; // how many distinct prime factors
};
 
 
/*
 * More notes on factoring:
 *
 * Every positive integer n has a prime factor <= sqrt(n). We use this fact to
 * build a prime sieve. First, we construct a function to return the square
 * root of a uint64_t. Then, we'll use that function to create a sieve (which
 * is returned as a pointer/list of uint8_ts... effectively Booleans).
 */
 
/*
 * Return the square root of a uint64_t
 */ 
uint64_t sqrt_uint64(uint64_t n) {
    uint8_t             shift = 1;
    uint64_t            res, s;
 
    while ((1<<shift) < n) shift += 1;
    res = (1<<((shift>>1) + 1));
    while (1) {
        s = (n/res + res)/2;
        if (s >= res) return res;
        res = s;
    }
}
 
 
/*
 * Return a prime sieve identifying all primes <= limit.
 * This is just a list of uint8_t's where 0 means the index is composite and
 * 1 means the index is prime.
 */ 
uint8_t * prime_sieve(uint64_t limit) {
    uint8_t            *sieve; // this will be the pointer that is returned
    uint64_t            i,j;
    uint64_t            s = sqrt_uint64(limit);
 
    // allocate memory for sieve
    sieve = malloc((limit + 1) * sizeof(uint8_t));
 
    // set initial values in the sieve
    sieve[0] = 0; sieve[1] = 0; sieve[2] = 1;
 
    // set other initial odd values in sieve to 1, even values to 0.
    for (i=3;i<=limit;i++) {
        if (i%2==0) sieve[i] = 0;
        else sieve[i] = 1;
    }
 
    // unset composite numbers (evens are already unset)
    for (i=3;i<=(s+1);i=i+2) {
        if (sieve[i]==1) {
            j = i*3; // sieve[i] prime, sieve[2*i] already 0
            while (j<=limit) {
                sieve[j] = 0;
                j += 2*i;
            }
        }
    }
    return sieve;
}
 
 
/*
 * Determine if a value is prime using a provided sieve.
 */
_Bool is_prime(uint64_t n, uint8_t * sieve) {
    if ((1 & sieve[n]) == 1) return 1;
    else return 0;
}
 
 
/*****************************************************************************/
/* The following functions are specific to Project Euler problem 451         */
/*****************************************************************************/
 
/*
 * Notes:
 *
 * Given a number n, we're looking for the largest m < n-1 such that m^2 is 1
 * modulo n. That's a mouthful. First, field theory tells us that we're only
 * interested in integers that are relatively prime with n (which is why the
 * problem asks for m < n-1 ... otherwise it would be a lot easier).
 *
 * Now, if n=p^k is a prime power and we have x^2 = 1 mod p^k, we consider:
 * a) 1 is a solution for x since 1*1=1 mod p^k = 1.
 * b) p^k-1 is a solution since (p^k-1)^2=p^(2k)-2p^k+1 (mod p^k) = 1. But,
 *    that's too big since we need m < n=p^k-1. :-(
 * c) Any other solutions? Well... that's sort of complicated. Ugh. Let's first
 *    consider the case when p=2. For n=p^1=2^1, we only have x=1 as a
 *    solution. When n=p^2=2^2, we only have x=1 and x=3. For n=p^k=2^k, we
 *    also find that 2^(k-1)+1 and 2^(k-1)-1 square to 1 modulo 2^k. For other
 *    primes, this doesn't happen and we only have 1 and p^k-1 as solutions.
 *
 * Thus, what we're going to do is as follows:
 * 1) Factor each integer within the given range (factoring a range of numbers
 *    is much faster than factoring each one individually). We'll store that
 *    factorization information using the struct defined above.
 * 2) Because of step 1, determining if a number is prime or a power of a prime
 *    is easy. In case of primes and prime powers, we consider if the prime is
 *    even or odd, returning the appropriate maximum square root of 1 directly.
 * 3) When the integer is composite having more than a single distinct prime
 *    factor, we use the Chinese Remainder Theorem "in spirit" and iterate
 *    over potential candidates (instead of computing the C.R.T. result
 *    directly). We're only considering possible m that are both relatively
 *    prime with n AND such that m^2 is +1 or -1 modulo the prime power
 *    factors of n. When a candidate does not satisfy these properties, we
 *    simply move to the next candidate.
 */
 
/*
 * Compute the next candidate (the next largest number that may satisfy the
 * given equivalence relations). This is done relative to the largest non-2
 * prime power in the factorization of n.
 *
 * n is the integer for which we're computing the largest square root.
 * m is the largest odd prime power factor of n.
 * c is the current candidate (assume uninitialized when set to 0).
 */
uint64_t next_candidate(uint64_t n, uint64_t m, uint64_t c) {
    uint64_t            r;
 
    // check if c is initialized
    if (c==0) c = n-1;
    // we can only consider values of \pm1 mod m
    r = c % m;
    if (r==m-1) return c-(m-2);
    else return c-2;
}
 
 
/*
 * Determine if the current candidate is (1) relatively prime to n and, if it
 * is, (2) a square root of unity modulo n. We can tweak this later for timing
 * as the test for being relatively prime may cut performance a bit. We'll see.
 *
 * n is the integer for which we're computing the largest square root.
 * cnd is the current candidate.
 * factors is a list of factors_t structs (prime, exponent info)
 */
_Bool verify_candidate(uint64_t n, uint64_t cnd, struct factors_t * factors) {
    uint8_t             i,j; 
    uint64_t            pp;
 
    // verify that cnd is not divisible by any prime in factors[n].p[]
    for (i=0;i<factors[n].num_factors;i++) {
        if (cnd % factors[n].p[i] == 0) return 0;
    }
    // verify cnd modulo 2 when exponent of 2 in n is > 2
    if (factors[n].p[0] == 2 && factors[n].e[0] > 2) {
        pp = factors[n].pp[0]>>1;
        if (cnd % pp != 1 && cnd % pp != pp-1) return 0;
    }
    // verify other primes
    for (i=0;i<factors[n].num_factors;i++) {
        if (factors[n].p[i] == 2) continue;
        pp = factors[n].pp[i];
        if (cnd % pp != 1 && cnd % pp != pp-1) return 0;
    }
    return 1;
}
 
 
/*
 * Given a positive integer n, find the largest positive m < n-1 such that
 * gcd(n,m)=1 and m**2 (mod n) = 1.
 */
uint64_t largest_sqrt(uint64_t n, struct factors_t * factors) {
    uint8_t             i;
    uint32_t            j;
 
    // if n is an odd prime, or a power of an odd prime, return 1
    if (factors[n].num_factors == 1 && factors[n].p[0] != 2) return 1;
 
    // if n is a power of 2
    if (factors[n].num_factors == 1) {
        // if n is 2 or 4
        if (factors[n].e[0] < 3) return 1;
        // if n is 2**e for e >= 3
        return (factors[n].pp[0]>>1)+1;
    }
 
    // find the maximum odd prime power factor of n
    uint64_t            pp = 1;
    uint64_t            cnd;
 
    for (i=0;i<factors[n].num_factors;i++) {
        if (factors[n].p[i] != 2 && factors[n].pp[i] > pp) {
            pp = factors[n].pp[i];
        }
    }
 
    // get the first candidate w.r.t. moppf
    cnd = next_candidate(n, pp, 0);
 
    while (!verify_candidate(n, cnd, factors)) cnd = next_candidate(n, pp, cnd);
 
    return cnd;
}
 
 
/*****************************************************************************/
/* Execute                                                                   */
/*****************************************************************************/
int main() {
    uint64_t            s = 0;              // final output value
    uint64_t            i,j,k;              // iterators
    uint64_t            limit = 20000000;   // upper bound for computations
    uint64_t            num_primes = 0;     // count for primes <= limit
    uint8_t             e;                  // exponents for prime factoring
    int                 tid;                // thread id for openmp
    double              time_count;         // timer
    clock_t             start, start_w;     // time variables
 
    uint8_t            *sieve;              // prime sieve
    uint64_t           *primes;             // list of primes
    struct factors_t   *factors;            // prime factors and exponents
 
 
    /* start outer timer */
    start_w = clock();
 
 
    /* make the prime sieve */
    start = clock();
    sieve = prime_sieve(limit);
    time_count = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("prime sieve created - %f seconds\n", time_count);
 
 
    /* form list of primes from sieve */
    start = clock();
    // compute the number of primes in sieve
    for (i=2;i<=limit;i++) {
        if (is_prime(i, sieve)) {
            num_primes = num_primes + 1;
        }
    }
    primes = malloc(num_primes * sizeof(uint64_t));
    j=0;
    for (i=2;i<=limit;i++) {
        if (is_prime(i, sieve)) {
            primes[j] = i;
            j++;
        }
    }
    time_count = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("list of %llu primes created - %f seconds\n", num_primes, time_count);
 
 
    /* fill out a factors_t struct for each integer below limit */
    start = clock();
    // allocate memory for factors_t
    factors = malloc(sizeof(struct factors_t) * (limit + 1));
    // set the initial number of factors for each number to 0
    for (i=1;i<=limit;i++) factors[i].num_factors=0;
    // for each prime, add that prime as a factor to each of its multiples
    for (i=0;i<num_primes;i++) {
        j = 1; // start at 1*p for each prime p
        while (j*primes[i]<=limit) {
            k = factors[j*primes[i]].num_factors; // get proper index for p
            factors[j*primes[i]].p[k] = primes[i]; // add prime to p
            factors[j*primes[i]].num_factors++; // increase index
            j++; // increase multiple of prime
        }
    }
    time_count = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("list of prime factors created for integers <= %llu - %f seconds\n", limit, time_count);
 
 
    /* compute exponents for each prime in factor lists */
    start = clock();
    for (i=2;i<=limit;i++) {
        for (j=0;j<factors[i].num_factors;j++) {
            e=1; k=factors[i].p[j];
            while (i % (k*factors[i].p[j]) == 0) {
                e++;
                k*=factors[i].p[j];
            }
            factors[i].e[j] = e;
            factors[i].pp[j] = k;
        }
    }
    time_count = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("prime factor exponents computed for all integers - %f seconds\n", time_count);
 
 
    /* find largest square root of unity for each integer under limit; add to s */
    #pragma omp parallel for reduction(+:s) shared(factors) schedule(dynamic,1000)
    for (i=3;i<=limit;i++) {
        s = s+largest_sqrt(i, factors);
    }
    time_count = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("result for 3<=i<=%llu: %llu - %f seconds\n", limit, s, time_count);
 
 
    time_count = (double)(clock() - start_w) / CLOCKS_PER_SEC;
    printf("\ntotal time: %f seconds\n\n", time_count);
 
    free(sieve);
    free(primes);
    free(factors);
 
    return 0;
}

While browsing the Project Euler problems, I found problem 97, which is an incredibly straightforward problem that is buried within a sea of much more challenging problems. It’s called “Large non-Mersenne Prime” and it states the following.

The first known prime found to exceed one million digits was discovered in 1999, and is a Mersenne prime of the form $2^{6972593}−1$; it contains exactly 2,098,960 digits. Subsequently other Mersenne primes, of the form $2^p−1$, have been found which contain more digits.

However, in 2004 there was found a massive non-Mersenne prime which contains 2,357,207 digits: $28433\times 2^{7830457}+1$.

Find the last ten digits of this prime number.

Solution in Python

Since we’re only looking for the final ten digits of the number, we can form a solution modulo a sufficiently large number, effectively forgetting about any larger power digits. We’ll do all of the powers of two first, then multiply by the constant, then add one. We’ll take the first ten digits of the result and that should be the answer we’re looking for. I’d say that this one is pretty quick and painless.

#!/usr/bin/env python
 
import time
 
start = time.time()
 
n = 2
for i in range(7830456):
    n = (2 * n) % 10000000000
 
n *= 28433
n += 1
 
n = n % 10000000000
 
elapsed = time.time() - start
 
print "Result %s found in %s seconds" % (n, elapsed)

Then, executing, we have:

$ python problem-97.py 
Result 8739992577 found in 1.08037400246 seconds

Solution in C

This is a situation where the C solution is nearly as fast to write as the Python solution, and the code really follows the same structure.

#include <stdio.h>
#include <stdlib.h>
 
int main(int argc, char **argv) {
    unsigned long       i;
    unsigned long long  n = 2;
 
    for (i=0;i<7830456;i++) {
        n = (n << 1) % 10000000000;
    }
 
    n *= 28433;
    n += 1;
 
    n = n % 10000000000;
 
    printf("Result %lld\n", n);
 
    return 0;
}

When we execute, the result is returned very quickly. (I’ve optimized the machine code using the -O3 flag as you can see below.)

$ gcc -O3 problem-97.c -o problem-97
$ time ./problem-97
Result 8739992577
 
real	0m0.030s
user	0m0.028s
sys	0m0.004s

That feels pretty fast, but I think a bottleneck is popping up when we bitshift (multiply by 2) AND reduce the result modolo 10000000000 at every single iteration of the for loop. The bitshifting is fast, but the modular arithmetic isn’t so much. Let’s rewrite the code so that we’re bitshifting at every iteration, but only using modular arithmetic every several iterations.

#include <stdio.h>
#include <stdlib.h>
 
 
int main(int argc, char **argv) {
    unsigned long       i,j;
    unsigned long long  n = 2;
 
    j = 0;
    for (i=0;i<7830456;i++) {
        //n = (n << 1) % 10000000000;
        n <<= 1;
        j++;
        if (j%10==0) {
            j = 0;
            n = n % 10000000000;
        }
    }
 
    n *= 28433;
    n += 1;
 
    n = n % 10000000000;
 
    printf("Result %lld\n", n);
 
    return 0;
}

That runs in roughly 1/3 the time of the original C code.

$ gcc -O3 problem-97.c -o problem-97
$ time ./problem-97
Result 8739992577
 
real	0m0.009s
user	0m0.008s
sys	0m0.000s

A triangular pyramid is constructed using spherical balls so that each ball rests on exactly three balls of the next lower level.

Then, we calculate the number of paths leading from the apex to each position: A path starts at the apex and progresses downwards to any of the three spheres directly below the current position. Consequently, the number of paths to reach a certain position is the sum of the numbers immediately above it (depending on the position, there are up to three numbers above it).

The result is Pascal’s pyramid and the numbers at each level n are the coefficients of the trinomial expansion $(x + y + z)^n$. How many coefficients in the expansion of $(x + y + z)^{200000}$ are multiples of $10^{12}$?

Solution Using the Multinomial Theorem

The generalization of the binomial theorem is the multinomial theorem. It says that multinomials raised to exponents can be expanded using the formula
\[
(x_1+x_2+\cdots+x_m)^n=\sum_{{k_1+k_2+\cdots+k_m=n}\atop{0\le k_i\le n}}\left({n}\atop{k_1,k_2,\ldots,k_m}\right)\prod_{1\le t\le m}x_t^{k_t}
\]
where
\[
\left({n}\atop{k_1,k_2,\ldots,k_m}\right)=\frac{n!}{k_1!k_2!\cdots k_m!}.
\]
Of course, when m=2 this gives the binomial theorem. The sum is taken over all partitions $k_1+k_2+\cdots+k_m=n$ for integers $k_i$. If n=200000 abd m=3, then the terms in the expansion are given by
\[
\left({200000}\atop{k_1,k_2,k_3}\right)x_1^{k_1}x_2^{k_2}x_3^{k_3}=\frac{200000!}{k_1!k_2!k_3!}x_1^{k_1}x_2^{k_2}x_3^{k_3}
\]
where $k_1+k_2+k_3=200000$. It’s worth pointing out that each of the coefficients is an integer, and thus has a unique factorization into products of prime integers. Of course, there’s no way that we’re going to calculate these coefficients. We only need to know when they’re divisible by $10^{12}$. Thus, we only need to consider how many factors of 2 and 5 are involved.

First, we’ll create a function $p(n,d)$ that outputs how many factors of $d$ are included in $n!$. We have that
\[
p(n,d)=\left\lfloor\frac{n}{d}\right\rfloor+\left\lfloor\frac{n}{d^2}\right\rfloor+\left\lfloor\frac{n}{d^3}\right\rfloor+ \cdots+\left\lfloor\frac{n}{d^r}\right\rfloor,
\]
where $d^r$ is the highest power of $d$ dividing $n$. For instance, there are 199994 factors of 2 in 200000!. Since we’re wondering when our coefficients are divisible by $10^{12}=2^{12}5^{12}$, we’ll be using the values provided by $p(n,d)$ quite a bit for $d=2$ and $d=5$. We’ll store two lists:
\[
p2=[p(i,2)\text{ for }1\le i\le 200000]\quad\text{and}\quad p5=[p(i,5)\text{ for }1\le i\le 200000].
\]
For a given $k_1,k_2,k_3$, the corresponding coefficient is divisible by $10^{12}$ precisely when
\[
p2[k_1]+p2[k_2]+p2[k_3]<199983\ \text{and}\ p5[k_1]+p5[k_2]+p5[k_3]<49987.
\]
That is, this condition ensures that there are at least 12 more factors of 2 and 5 in the numerator of the fraction defining the coefficients.

Now, we know that $k_1+k_2+k_3=200000$, and we can exploit symmetry and avoid redundant computations if we assume $k_1\le k_2\le k_3$. Under this assumption, we always have
\[
k_1\le\left\lfloor\frac{200000}{3}\right\rfloor=66666.
\]
We know that $k_1+k_2+k_3=200000$ is impossible since 200000 isn't divisible by 3. It follows that we can only have (case 1) $k_1=k_2 < k_3$, or (case 2) $k_1 < k_2=k_3$, or (case 3) $k_1 < k_2 < k_3$.

In case 1, we iterate $0\le k_1\le 66666$, setting $k_2=k_1$ and $k_3=200000-k_1-k_2$. We check the condition, and when it is satisfied we record 3 new instances of coefficients (since we may permute the $k_i$ in 3 ways).

In case 2, we iterate $0\le k_1\le 66666$, and when $k_1$ is divisible by 2 we set $k_2=k_3=\frac{200000-k_1}{2}$. When the condition holds, we again record 3 new instance.

In case 3, we iterate $0\le k_1\le 66666$, and we iterate over $k_2=k_1+a$ where $1\le a < \left\lfloor\frac{200000-3k_1}{2}\right\rfloor$. Then $k_3=200000-k_1-k_2$. When the condition holds, we record 6 instances (since there are 6 permutations of 3 objects).

Cython Solution

I’ll provide two implementations, the first written in Cython inside Sage. Then, I’ll write a parallel solution in C.

%cython
 
import time
from libc.stdlib cimport malloc, free
 
head_time = time.time()
 
cdef unsigned long p(unsigned long k, unsigned long d):
    cdef unsigned long power = d
    cdef unsigned long exp = 0
    while power <= k:
        exp += k / power
        power *= d
    return exp
 
cdef unsigned long * p_list(unsigned long n, unsigned long d):
    cdef unsigned long i = 0
    cdef unsigned long * powers = <unsigned long *>malloc((n+1)*sizeof(unsigned long))
    while i <= n:
        powers[i] = p(i,d)
        i += 1
    return powers
 
 
run_time = time.time()
 
# form a list of number of times each n! is divisible by 2.
cdef unsigned long * p2 = p_list(200000,2)
 
# form a list of number of times each n! is divisible by 5.
cdef unsigned long * p5 = p_list(200000,5)
 
cdef unsigned long k1, k2, k3, a
cdef unsigned long long result = 0
 
k1 = 0
while k1 <= 66666:
    # case 1: k1 = k2 < k3
    k2 = k1
    k3 = 200000 - k1 - k2
    if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]):
        result += 3
    # case 2: k1 < k2 = k3
    if k1 % 2 == 0:
        k2 = (200000 - k1)/2
        k3 = k2
        if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]):
            result += 3
    # case 3: k1 < k2 < k3
    a = 1
    while 2*a < (200000 - 3*k1):
        k2 = k1 + a
        k3 = 200000 - k1 - k2
        if 199982 >= (p2[k1]+p2[k2]+p2[k3]) and 49986 >= (p5[k1]+p5[k2]+p5[k3]):
            result += 6
        a += 1
    k1 += 1
 
 
free(p2)
free(p5)
 
 
elapsed_run = round(time.time() - run_time, 5)
elapsed_head = round(time.time() - head_time, 5)
 
print "Result: %s" % result
print "Runtime: %s seconds (total time: %s seconds)" % (elapsed_run, elapsed_head)

When executed, we find the correct result relatively quickly.

Result: 479742450
Runtime: 14.62538 seconds (total time: 14.62543 seconds)

C with OpenMP Solution

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <omp.h>
 
/*****************************************************************************/
/* function to determine how many factors of 'd' are in 'k!'                 */
/*****************************************************************************/
unsigned long p(unsigned long k, unsigned long d) {
    unsigned long power = d;
    unsigned long exp = 0;
    while (power <= k) {
        exp += k/power;
        power *= d;
    }
    return exp;
}
 
/*****************************************************************************/
/* create a list [p(0,d),p(1,d),p(2,d), ... ,p(n,d)] and return pointer      */
/*****************************************************************************/
unsigned long * p_list(unsigned long n, unsigned long d) {
    unsigned long i;
    unsigned long * powers = malloc((n+1)*sizeof(unsigned long));
    for (i=0;i<=n;i++) powers[i] = p(i,d);
    return powers;
}
 
/*****************************************************************************/
/* main                                                                      */
/*****************************************************************************/
int main(int argc, char **argv) {
    unsigned long k1, k2, k3, a;
    unsigned long long result = 0;
 
    unsigned long * p2 = p_list(200000, 2);
    unsigned long * p5 = p_list(200000, 5);
 
 
    #pragma omp parallel for private(k1,k2,k3,a) reduction(+ : result)
    for (k1=0;k1<66667;k1++) {
        // case 1: k1 = k2 < k3
        k2 = k1;
        k3 = 200000 - k1 - k2;
        if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) {
            result += 3;
        } 
        // case 2: k1 < k2 = k3
        if (k1 % 2 == 0) {
            k2 = (200000 - k1)/2;
            k3 = k2;
            if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) {
                result += 3;
            }
        }
        // case 3: k1 < k2 < k3
        for (a=1;2*a<(200000-3*k1);a++) {
            k2 = k1 + a;
            k3 = 200000 - k1 - k2;
            if (p2[k1]+p2[k2]+p2[k3]<199983 && p5[k1]+p5[k2]+p5[k3]<49987) {
                result += 6;
            }
        }
    }
 
    free(p2);
    free(p5);
 
    printf("result: %lld\n", result);
 
    return 0;
}

This can be compiled and optimized using GCC as follows.

$ gcc -O3 -fopenmp -o problem-154-omp problem-154-omp.c

When executed on a 16-core machine, we get the following result.

$ time ./problem-154-omp 
result: 479742450
 
real    0m1.487s

This appears to be the fastest solution currently known, according to the forum of solutions on Project Euler. The CPUs on the 16-core machine are pretty weak compared to modern standards. When running on a single core on a new Intel Core i7, the result is returned in about 4.7 seconds.


This problem was posted on the GAP (Groups Algorithms & Programming) Forum some time ago. Roughly a week later, this partial solution was posted.

Knowing that GAP runs on an interpreted language on top of a C kernel, I thought I may be able to do better with C. After prototyping the situation in Python, my fears were realized: The digits associated with the orbit are HUGE. So, I decided to code in C using the GMP library and OpenMP. The main idea is that the orbit can be computed in parallel, going forwards and backwards until the orbit points agree. Not only will this cut the computation time roughly in half (when compared to a non-parallel C/GMP solution), but it should go much faster than GAP. And indeed, it does. My post to the forum announcing the first known solution can be found here. The maximum digit length found in the orbit is 76,785.

C/GMP/OpenMP Code

Here is the code. It requires the GNU MP Bignum Library (not in GCC) and OpenMP (in GCC). When the orbit points are within a specific digit length difference, only a single core continues the computation. Otherwise, both cores continue to compute the orbit in opposite directions. (There’s no makefile. I’ll give compiler instructions below.)

// Jason B. Hill
// Jason.B.Hill@Colorado.edu
// www.jasonbhill.com
 
#include <stdio.h>
#include <stdlib.h>
#include <gmp.h>
#include <omp.h>
 
//#define p 32
//#define p 736
//#define p 25952
#define p 173176
 
/*****************************************************************************/
/* Transpositions a,b,c and compositions g,g^-1                              */
/*****************************************************************************/
 
void a(mpz_t omega) {
    if(mpz_even_p(omega)==1) mpz_add_ui(omega, omega, 1);
    else mpz_sub_ui(omega, omega, 1);
}
 
void b(mpz_t omega) {
    if(mpz_congruent_ui_p(omega, 0, 5)==1) mpz_add_ui(omega, omega, 4);
    else if(mpz_congruent_ui_p(omega, 4, 5)==1) mpz_sub_ui(omega, omega, 4);
}
 
void c(mpz_t omega) {
    if(mpz_congruent_ui_p(omega, 1, 4)==1) {
        mpz_sub_ui(omega, omega, 1);
        mpz_divexact_ui(omega, omega, 4);
        mpz_mul_ui(omega, omega, 6);
    } else if(mpz_congruent_ui_p(omega, 0, 6)==1) {
        mpz_divexact_ui(omega, omega, 6);
        mpz_mul_ui(omega, omega, 4);
        mpz_add_ui(omega, omega, 1);
    }
}
 
void g(mpz_t omega) {
    a(omega);
    b(omega);
    c(omega);
}
 
void ginv(mpz_t omega) {
    c(omega);
    b(omega);
    a(omega);
}
 
/*****************************************************************************/
/* Main                                                                      */
/*****************************************************************************/
 
int main(void) {
    unsigned long       n = p;
    unsigned long long  i0 = 0;
    unsigned long long  i1 = 0;
    int                 th_id;
    _Bool               sstop = 0;
    size_t              s = 0;
    size_t              c0, c1;
    mpz_t               omega0, omega1;
 
    omp_set_num_threads(2);
 
    mpz_init(omega0);
    mpz_init(omega1);
 
    mpz_set_ui(omega0, n);
    mpz_set_ui(omega1, n);
 
    c0 = mpz_sizeinbase(omega0, 10);
    c1 = mpz_sizeinbase(omega1, 10);
 
    #pragma omp parallel private(th_id) \
    shared(omega0,omega1,s,sstop,c0,c1,i0,i1)
    {
        th_id = omp_get_thread_num();
 
        if(th_id == 1) {
            g(omega1);
            i1++;
            if(mpz_cmp(omega0,omega1)==0) sstop = 1;
        }
 
        #pragma omp barrier
 
        while(!sstop) {
            if(th_id == 0) {
                if(abs(c0 - c1) > 20) {
                    ginv(omega0);
                    c0 = mpz_sizeinbase(omega0, 10);
                    i0++;
                }
            } else if(th_id == 1) {
                if(abs(c0 - c1) > 5) {
                    g(omega1);
                    c1 = mpz_sizeinbase(omega1, 10);
                    i1++;
                } else {
                    if(mpz_cmp(omega0,omega1)==0) sstop = 1;
                    else {
                        g(omega1);
                        c1 = mpz_sizeinbase(omega1, 10);
                        i1++;
                    }
                }
            }
 
            #pragma omp flush(sstop,c0,c1,i0,i1)
 
            if(th_id == 0) {
                if(c0 > s) {
                    s = c0;
                    printf("Core 0: digit length increased to %ld\n", s);
                    printf("iterations: %lld (core0) %lld (core1) %lld (total)\
\n\n",i0,i1,i0+i1);
                }
                if(c1 > s) {
                    s = c1;
                    printf("Core 1: digit length increased to %ld\n", s);
                    printf("iterations: %lld (core0) %lld (core1) %lld (total)\
\n\n",i0,i1,i0+i1);
                }
                if((i0+i1)%100000000==0) {
                    printf("digit length: %ld\n", s);
                    printf("iterations: %lld (core0) %lld (core1) %lld (total)\
\n\n",i0,i1,i0+i1);
                }
            }
        }
    }
 
    printf("total iterations: %lld\n",i0+i1);
    mpz_clear(omega0);
    mpz_clear(omega1);
 
    return 0;
}

Using GCC with OpenMP and GMP, one can compile the code on a multicore machine with, for example, the following command.

gcc -O3 -o length-residue-orbit-omp length-residue-orbit-omp.c -lgmp -fopenmp

Success!

The result: total iterations: 47610700792, is returned in roughly 3.1 days on a 2.9 GHz 3rd generation Intel Core i7 (i7-3520M).


This is something I use quite a bit for various problems and programming exercises, so I figured I could post it here. It’s a basic post that isn’t advanced at all, but that doesn’t mean that the implementation given below won’t save work for others. The idea is to create a list of primes in C by malloc’ing a sieve, then malloc’ing a list of specific length based on that sieve. The resulting list contains all the primes below a given limit (defined in the code). The first member of the list is an integer representing the length of the list.

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
 
#define                 bool    _Bool
 
static unsigned long    prime_limit = 1000000;
 
unsigned long sqrtld(unsigned long N) {
    int                 b = 1;
    unsigned long       res,s;
    while(1<<b<N) b+= 1;
    res = 1<<(b/2 + 1);
    for(;;) {
        s = (N/res + res)/2;
        if(s>=res) return res;
        res = s;
    }
}
 
unsigned long * make_primes(unsigned long limit) {
    unsigned long      *primes;
    unsigned long       i,j;
    unsigned long       s = sqrtld(prime_limit);
    unsigned long       n = 0;
    bool *sieve = malloc((prime_limit + 1) * sizeof(bool));
    sieve[0] = 0;
    sieve[1] = 0;
    for(i=2; i<=prime_limit; i++) sieve[i] = 1;
    j = 4;
    while(j<=prime_limit) {
        sieve[j] = 0;
        j += 2;
    }
    for(i=3; i<=s; i+=2) {
        if(sieve[i] == 1) {
            j = i * 3;
            while(j<=prime_limit) {
                sieve[j] = 0;
                j += 2 * i;
            }
        }
    }
    for(i=2;i<=prime_limit;i++) if(sieve[i]==1) n += 1;
    primes = malloc((n + 1) * sizeof(unsigned long));
    primes[0] = n;
    j = 1;
    for(i=2;i<=prime_limit;i++) if(sieve[i]==1) {
        primes[j] = i;
        j++;
    }
    free(sieve);
    return primes;
}
 
 
int main(void) {
 
    unsigned long * primes = make_primes(prime_limit);
 
    printf("There are %ld primes <= %ld\n",primes[0],prime_limit);
 
 
    free(primes);
    return 0;
}

Say one wanted to form a list of all primes below 1,000,000. That’s what the above program does by default, since “prime_limit = 1000000.” If one compiles this and executes, you would get something like what follows. The timing is relatively respectable.

$ gcc -O3 -o prime-sieve prime-sieve.c 
$ time ./prime-sieve 
There are 78498 primes <= 1000000
 
real	0m0.008s
user	0m0.004s
sys	0m0.000s

The code is linked here: prime-sieve.c


Problem

Let $d(n)$ be defined as the sum of proper divisors of $n$ (numbers less than $n$ which divide evenly into $n$).
If $d(a) = b$ and $d(b) = a$, where $a\neq b$, then $a$ and $b$ are an amicable pair and each of a and b are called amicable numbers.

For example, the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110; therefore $d(220) = 284$. The proper divisors of 284 are 1, 2, 4, 71 and 142; so $d(284) = 220$.

Evaluate the sum of all the amicable numbers under 10000.

Python Solution

I used the idea of creating a list and populating that list with values corresponding to the sum of divisors of numbers, and then I walk through the list and do some index and value checking.

import time
 
def sum_divisors(n):
    s = 0
    for i in range(1,n):
        if n % i == 0: s += i
    return s
 
def amicable_pairs_xrange(low,high):
    L = [sum_divisors(i) for i in range(low,high + 1)]
    pairs = []
    for i in range(high - low + 1):
        ind = L[i]
        if i + low < ind and low <= ind and ind <= high and L[ind - low] == i + low:
            pairs.append([i+low,ind])
    return pairs
 
def sum_pairs(pairs):
    return sum([sum(pair) for pair in pairs])
 
start = time.time()
 
ans = sum_pairs(amicable_pairs_xrange(1,10000))
 
elapsed = time.time() - start
 
print("%s found in %s seconds") % (ans,elapsed)

This returns the correct result.

31626 found in 7.86725997925 seconds

Cython Solution

One of the things I love about using Cython inside the Sage Notebook environment is that one can simply place a Cython statement inside Python code and often achieve some sort of speedup. All I’ve done in this first example is to do just that. I haven’t change the Python code at all yet.

%cython
 
import time
 
def sum_divisors(n):
    s = 0
    for i in range(1,n):
        if n % i == 0: s += i
    return s
 
def amicable_pairs_xrange(low,high):
    L = [sum_divisors(i) for i in range(low,high + 1)]
    pairs = []
    for i in range(high - low + 1):
        ind = L[i]
        if i + low < ind and low <= ind and ind <= high and L[ind - low] == i + low:
            pairs.append([i+low,ind])
    return pairs
 
def sum_pairs(pairs):
    return sum([sum(pair) for pair in pairs])
 
start = time.time()
 
ans = sum_pairs(amicable_pairs_xrange(1,10000))
 
elapsed = time.time() - start
 
print("%s found in %s seconds") % (ans,elapsed)

This runs faster than the original, and that blows my mind.

31626 found in 4.98208904266 seconds

We can improve on this obviously, by rewriting the involved functions using C datatypes.

%cython
 
import time
 
cdef sum_divisors(unsigned int n):
    cdef unsigned int s = 0, i = 1
    while i < n:
        if n % i == 0: s += i
        i += 1
    return s
 
cdef sum_amicable_pairs(unsigned int low, unsigned int high):
    cdef unsigned int a = low, b, sum = 0
    while a <= high:
        b = sum_divisors(a)
        if b > a and sum_divisors(b) == a:
            sum += a + b
        a += 1
 
    return sum
 
start = time.time()
 
ans = sum_amicable_pairs(1,10000)
 
elapsed = time.time() - start
 
print("%s found in %s seconds") % (ans,elapsed)

We do achieve somewhat significant speedup now.

31626 found in 1.06677889824 seconds

C Solution

If I take the fast Cython version and rewrite it in C, I don’t get much better performance.

#include <stdio.h>
#include <stdlib.h>
 
unsigned int sum_divisors(unsigned int n) {
    unsigned int s = 0, i = 1;
    while(i < n) {
        if(n % i == 0) s = s + i;
        i++;
    }
    return s;
}
 
unsigned int sum_amicable_pairs(unsigned int low, unsigned int high) {
    unsigned int a = low, b, sum = 0;
    while(a <= high) {
        b = sum_divisors(a);
        if(b > a && sum_divisors(b) == a) sum = sum + a + b;
        a++;
    }
    return sum;
}
 
int main(int argc, char **argv) {
    unsigned int ans = sum_amicable_pairs(1,10000);
    printf("result: %d\n",ans);
    return 0;
}

The result, compiled with heavy optimization, is only slightly faster than the Cython code.

$ gcc -O3 -o problem-21 problem-21.c 
$ time ./problem-21
result: 31626
 
real	0m1.032s

About the only way to really improve this, beyond restructuring the algorithm, is to use OpenMP to run the process in parallel.

C with OpenMP

A parallel version using shared memory via OpenMP looks something like this. (I’ve rewritten it slightly.)

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
 
unsigned int sum_divisors(unsigned int n) {
    unsigned int s = 0, i = 1;
    while(i < n) {
        if(n % i == 0) s = s + i;
        i++;
    }
    return s;
}
 
int main(int argc, char **argv) {
    unsigned int a, b, nthreads, result = 0;
 
    #pragma omp parallel
    nthreads = omp_get_num_threads();
    printf("no of threads %d\n", nthreads);
 
    #pragma omp parallel for reduction(+:result) private(b)
    for(a=0; a < 10000; a++) {
        b = sum_divisors(a);
        if(b > a && sum_divisors(b) == a) result = result + a + b;
    }
 
    printf("result: %d\n",result);
    return 0;
}

Now, we get much better performance. I’m running this on a 16-core machine, where 3 of the cores are currently being used.

$ gcc -fopenmp -O3 -o problem-21-openmp problem-21-openmp.c 
$ time ./problem-21-openmp 
no of threads 16
result: 31626
 
real	0m0.140s

Problem: You are given the following information, but you may prefer to do some research for yourself.

  • 1 Jan 1900 was a Monday.
  • Thirty days has September,
    April, June and November.
    All the rest have thirty-one,
    Saving February alone,
    Which has twenty-eight, rain or shine.
    And on leap years, twenty-nine.
  • A leap year occurs on any year evenly divisible by 4, but not on a century unless it is divisible by 400.

How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?

Approach

There are several different ways to approach this. The easiest, I think, is to use the Gaussian formula for day of week. It is a purely mathematical formula that I have encoded in the following Python code.

Python Solution

import time
from math import floor
 
"""
Gaussian algorithm to determine day of week
"""
def day_of_week(year, month, day):
    """
    w = (d+floor(2.6*m-0.2)+y+floor(y/4)+floor(c/4)-2*c) mod 7
 
    Y = year - 1 for January or February
    Y = year for other months
    d = day (1 to 31)
    m = shifted month (March = 1, February = 12)
    y = last two digits of Y
    c = first two digits of Y
    w = day of week (Sunday = 0, Saturday = 6)
    """
 
    d = day
    m = (month - 3) % 12 + 1
    if m &gt; 10: Y = year - 1
    else: Y = year
    y = Y % 100
    c = (Y - (Y % 100)) / 100
 
    w = (d + floor(2.6 * m - 0.2) + y + floor(y/4) + floor(c/4) - 2*c) % 7
 
    return int(w)
 
"""
Compute the number of months starting on a given day of the week in a century
"""
def months_start_range(day,year_start,year_end):
    total = 0
    for year in range(year_start, year_end + 1):
        for month in range(1,13):
            if day_of_week(year, month, 1) == day: total += 1
    return total
 
start = time.time()
 
total = months_start_range(0,1901,2000)
 
elapsed = time.time() - start
 
print("%s found in %s seconds") % (total,elapsed)

This returns the correct result.

171 found in 0.0681998729706 seconds

That will run faster if executed directed in Python. Remember, I’m using the Sage notebook environment to execute most Python here. I’m also writing Cython in the same environment.

Cython Solution

There is a nearly trivial rewriting to Cython of the above Python code.

%cython
 
import time
from math import floor
 
"""
Gaussian algorithm to determine day of week
"""
cdef day_of_week(int year, int month, int day):
    """
    w = (d+floor(2.6*m-0.2)+y+floor(y/4)+floor(c/4)-2*c) mod 7
 
    Y = year - 1 for January or February
    Y = year for other months
    d = day (1 to 31)
    m = shifted month (March = 1, February = 12)
    y = last two digits of Y
    c = first two digits of Y
    w = day of week (Sunday = 0, Saturday = 6)
    """
 
    cdef int d = day
    cdef int m = (month - 3) % 12 + 1
    cdef int Y
    if m &gt; 10: Y = year - 1
    else: Y = year
    y = Y % 100
    cdef int c = (Y - (Y % 100)) / 100
 
    cdef double w
    w = (d + floor(2.6 * m - 0.2) + y + floor(y/4) + floor(c/4) - 2*c) % 7
 
    return int(w)
 
"""
Compute the number of months starting on a given day of the week in range of years
"""
cdef months_start_range(int day, int year_start,int year_end):
    cdef unsigned int total = 0
    cdef int year, month
    for year in range(year_start, year_end + 1):
        for month in range(1,13):
            if day_of_week(year, month, 1) == day: total += 1
    return total
 
start = time.time()
 
total = months_start_range(0,1901,2000)
 
elapsed = time.time() - start
 
print("%s found in %s seconds") % (total,elapsed)

The code is a bit longer, but it executes much faster.

171 found in 0.00387215614319 seconds

The Cython code runs roughly 18 times faster.

C Solution

The Cython code was used as a model to create more efficient C code. The only issue here is in maintaining the correct datatypes (not too hard, but compared to Cython it is a pain).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
 
int day_of_week(int year, int month, int day) {
    // Using the Gaussian algorithm
    int d = day;
 
    double m = (double) ((month - 3) % 12 + 1);
    int Y;
    if(m > 10) Y = year - 1;
    else Y = year;
    int y = Y % 100;
    int c = (Y - (Y % 100)) / 100;
 
    int w = ((d+(int)floor(2.6*m-0.2)+y+ y/4 + c/4 -2*c))%7;
 
    return w; 
}
 
long months_start_range(int day, int year_start, int year_end) {
    unsigned long total = 0;
    int year, month;
    for(year = year_start; year < year_end; year++) {
        for(month = 1; month <= 12; month++) {
            if(day_of_week(year, month, 1)==day) total++;
        }
    }
    return total;
 
}
 
int main(int argc, char **argv) {
 
    int iter = 0;
    long total;
    while(iter < 100000) {
        total = months_start_range(0,1901,2000);
        iter++;
    }
    printf("Solution: %ld\n",total);
 
    return 0;
}

Notice that this executes the loop 100,000 times, as I’m trying to get a good idea of what the average runtime is. We compile with optimization and the [[[-lm]]] math option. We get the following result.

gcc -O3 -o problem-19 problem-19.c -lm
$ time ./problem-19
Solution: 171
 
real	0m6.240s

The C code runs roughly 62 times as fast as the Cython and roughly 1124 times as fast as the Python. Each iteration executes in about 6.2000e-5 seconds.


Problem: The following iterative sequence is defined for the set of positive integers:

\[n\rightarrow\begin{cases}n/2 & n \text{ even}\\ 3n+1 & n \text{ odd}\end{cases}\]

Using the rule above and starting with 13, we generate the following sequence:

\[13\rightarrow 40\rightarrow 20\rightarrow 10\rightarrow 5\rightarrow 16\rightarrow 8\rightarrow 4\rightarrow 2\rightarrow 1\]

It can be seen that this sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1. Which starting number, under one million, produces the longest chain? Note: Once the chain starts the terms are allowed to go above one million.

Idea Behind a Solution

I’ll refer to the “Collatz length of $n$” as the length of the chain from an integer $n$ to 1 using the above described sequence. If we were to calculate the Collatz length of each integer separately, that would be incredibly inefficient. In Python, that would look something like this.

First Python Solution

import time
 
start = time.time()
 
def collatz(n, count=1):
    while n > 1:
        count += 1
        if n % 2 == 0:
            n = n/2
        else:
            n = 3*n + 1
    return count
 
max = [0,0]
for i in range(1000000):
    c = collatz(i)
    if c > max[0]:
        max[0] = c
        max[1] = i
 
elapsed = (time.time() - start)
print "found %s at length %s in %s seconds" % (max[1],max[0],elapsed)

Now, this will actually determine the solution, but it is going to take a while, as shown when we run the code.

found 837799 at length 525 in 46.6846499443 seconds.

A Better Python Solution

What I’m going to do is to cache any Collatz numbers for integers below one million. The idea is that we can use the cached values to make calculations of new Collatz numbers more efficient. But, we don’t want to record every single number in the Collatz sequences that we’ll be using, because some of the sequences actually reach up into the hundreds of millions. We’ll make a list called TO_ADD, and we’ll only populate that with numbers for which Collatz lengths are unknown. Once known, the Collatz lengths will be stored for repeated use.

import time
 
start = time.time()
 
limit = 1000000
collatz_length = [0] * limit
collatz_length[1] = 1
max_length = [1,1]
 
for i in range(1,1000000):
    n,s = i,0
    TO_ADD = [] # collatz_length not yet known
    while n > limit - 1 or collatz_length[n] < 1:
        TO_ADD.append(n)
        if n % 2 == 0: n = n/2
        else: n = 3*n + 1
        s += 1
    # collatz_length now known from previous calculations
    p = collatz_length[n]
    for j in range(s):
        m = TO_ADD[j]
        if m < limit:
            new_length = collatz_length[n] + s - j
            collatz_length[m] = new_length
            if new_length > max_length[1]: max_length = [i,new_length]
 
elapsed = (time.time() - start)
print "found %s at length %s in %s seconds" % (max_length[0],max_length[1],elapsed)

This should return the same result, but in significantly less time.

found 837799 at length 525 in 5.96128201485 seconds

A First Cython Solution

If we take our original approach of computing each Collatz length from scratch, this might actually work slightly better in Cython.

%cython
 
import time
 
cdef collatz(unsigned int n):
    cdef unsigned count = 1
    while n > 1:
        count += 1
        if n % 2 == 0:
            n = n/2
        else:
            n = 3*n + 1
    return count
 
cdef find_max_collatz(unsigned int min, unsigned int max):
    cdef unsigned int m = 1
    cdef unsigned long num = 1
    cdef unsigned int count = 1
    cdef unsigned long iter = min
    while iter < max:
        count = collatz(iter)
        if count > m:
            m = count
            num = iter
        iter += 1
    return num
 
start = time.time()
max_found = find_max_collatz(1,1000000)   
elapsed = (time.time() - start)
print "found %s in %s seconds" % (max_found,elapsed)

In fact, when executed, we find that it is significantly better than our efficient Python code.

found 837799 in 0.604798078537 seconds

This just goes to show that even low efficiency machine/compiled code can drastically outperform efficient Python. But, how far can we take this Cython refinement? What if we were to recode our more efficient algorithm in Cython? It may look something like this.

A Better Cython Solution

%cython
 
import time
from libc.stdlib cimport malloc, free
 
cdef find_max_collatz(unsigned long int max):
    cdef int *collatz_length = <int *>malloc(max * sizeof(int))
    cdef list TO_ADD # holds numbers of unknown collatz length
    cdef unsigned long iter, j, m, n, s, p, ind, new_length, max_length = 0
 
    # set initial collatz lengths
    iter = 0
    while iter < max:
        collatz_length[iter] = 0
        iter += 1
    collatz_length[1] = 1
 
    # iterate to max and find collatz lengths
    iter = 1
    while iter < max:
        n,s = iter,0
        TO_ADD = []
        while n > max - 1 or collatz_length[n] < 1:
            TO_ADD.append(n)
            if n % 2 == 0: n = n/2
            else: n = 3*n + 1
            s += 1
        # collatz length now known from previous calculations
        p = collatz_length[n]
        j = 0
        while j < s:
            m = TO_ADD[j]
            if m < max:
                new_length = collatz_length[n] + s - j
                collatz_length[m] = new_length
                if new_length > max_length:
                    max_length = new_length
                    ind = m
            j += 1
        iter += 1
 
    free(collatz_length)
    return ind
 
start = time.time()
max_collatz = find_max_collatz(1000000)
elapsed = (time.time() - start)
print "found %s in %s seconds" % (max_collatz,elapsed)

This gives us some relatively good results:

found 837799 in 0.46523308754 seconds

Still, it isn’t a great improvement over the naive Cython code. What’s going on? I bet that the TO_ADD data structure could be changed from a Python list (notice the “cdef list” definition) to a malloc’d C array. That will be a bit more work, but my gut instincts tell me that this is probably the bottleneck in our current Cython code. Let’s rewrite it a bit.

%cython
 
import time
from libc.stdlib cimport malloc, free
 
cdef find_max_collatz(unsigned long int max):
    cdef int *collatz_length = <int *>malloc(max * sizeof(int))
    cdef int *TO_ADD = <int *>malloc(600 * sizeof(int))
    cdef unsigned long iter, j, m, n, s, p, ind, new_length, max_length = 0
 
    # set initial collatz lengths and TO_ADD numbers
    iter = 0
    while iter < max:
        collatz_length[iter] = 0
        iter += 1
    collatz_length[1] = 1
    iter = 0
    while iter < 600:
        TO_ADD[iter] = 0
        iter += 1
 
    # iterate to max and find collatz lengths
    iter = 1
    while iter < max:
        n,s = iter,0
        while n > max - 1 or collatz_length[n] < 1:
            TO_ADD[s] = n
            if n % 2 == 0: n = n/2
            else: n = 3*n + 1
            s += 1
        # collatz length now known from previous calculations
        p = collatz_length[n]
        j = 0
        while j < s:
            m = TO_ADD[j]
            TO_ADD[j] = 0
            if m < max:
                new_length = collatz_length[n] + s - j
                collatz_length[m] = new_length
                if new_length > max_length:
                    max_length = new_length
                    ind = m
            j += 1
        iter += 1
 
    free(collatz_length)
    free(TO_ADD)
    return ind
 
start = time.time()
max_collatz = find_max_collatz(1000000)
elapsed = (time.time() - start)
print "found %s in %s seconds" % (max_collatz,elapsed)

Now, when we execute this code we get the following.

found 837799 in 0.0465848445892 seconds

That’s much better. So, by using Cython and writing things a bit more efficiently, the code executes 1119 times as fast.

C Solution

If I structure the algorithm in the same way, I don’t expect to gain much by rewriting things in C, but I’ll see what happens.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
int find_max_collatz(unsigned long max) {
    unsigned int collatz_length[max];
    unsigned int TO_ADD[600];
    unsigned long iter, j, m, n, s, p, ind, new_length, max_length = 0;
 
    // set initial collatz lengths and TO_ADD numbers
    iter = 0;
    while(iter < max) {
        collatz_length[iter] = 0;
        iter++;
    }
    collatz_length[1] = 1;
    iter = 0;
    while(iter < 600) {
        TO_ADD[iter] = 0;
        iter++;
    }
    // iterate to max and find collatz lengths
    iter = 1;
    while(iter < max) {
        n = iter;
        s = 0;
        while(n > max - 1 || collatz_length[n] < 1) {
            TO_ADD[s] = n;
            if(n % 2 == 0) n = n/2;
            else n = 3*n + 1;
            s++;
        }
        // collatz length now known from previous calculations
        p = collatz_length[n];
        j = 0;
        while(j < s) {
            m = TO_ADD[j];
            TO_ADD[j] = 0;
            if(m < max) {
                new_length = collatz_length[n] + s - j;
                collatz_length[m] = new_length;
                if(new_length > max_length) {
                    max_length = new_length;
                    ind = m;
                }
            }
            j++;
        }
        iter++;
    } 
    return ind;
}
 
 
int main(int argc, char **argv) {
    unsigned int max, i;
    time_t start, end;
    double total_time;
 
    start = time(NULL);
 
    for(i=0;i<1000;i++) max = find_max_collatz(1000000);
 
    end = time(NULL);
    total_time = difftime(end,start);
 
    printf("%d found in %lf seconds.\n",max,total_time);
 
    return 0;
}

We’re using 1,000 iterations to try and get a good idea of how this runs.

$ time ./problem-14
837799 found in 38.000000 seconds.
 
real	0m38.871s
user	0m38.120s

So, in C we’re getting each iteration in roughly 0.038 seconds, which is ever so slightly faster than the Cython code.


Problem: A palindormic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009, which is 91 times 99 . Find the largest palindrome made from the product of two 3-digit numbers.

Python Solution

There are a couple of ways to do this in Python (that I can think of) and I’m going to test some things out first, just to see which performs better. In the following block of code, I’ve written two functions. One checks if a number is a palindrome by reversing the digits arithmetically. The second function tests the same thing by casting the integer as a string and using a list slice to reverse the characters. Let’s see which one is faster.

import time
 
def is_palindrome1(num):
    reversed = 0
    original = num
 
    if num < 10: return True
    if num % 10 == 0: return False
 
    while num >= 1:
        reversed = (reversed * 10) + (num % 10)
        num = num/10
 
    if original == reversed: return True
    else: return False
 
def is_palindrome2(num):
    if str(num)==str(num)[::-1]: return True
    else: return False
 
start = time.time()
for i in range(1000,1000000): a = is_palindrome1(i)
elapsed = (time.time() - start)
 
print "first function took %s seconds" % elapsed
 
start = time.time()
for i in range(1000,1000000): a = is_palindrome2(i)
elapsed = (time.time() - start)
 
print "second function took %s seconds" % elapsed

Running this little block of code, which tests all numbers between 1,000 and 999,999, we get the following result.

first function took 3.07841110229 seconds
second function took 1.60997390747 seconds

For the python solution, we’ll stick to the second function. When we attempt the Cython solution, we’ll come back and see which performs better again. For now, here’s a Python solution.

import time
 
def find_max_palindrome(min=100,max=999):
    max_palindrome = 0
    for a in range(min,max + 1):
        for b in range(a + 1, max + 1): # avoid duplicates
            prod = a*b
            if prod > max_palindrome and str(prod)==(str(prod)[::-1]):
                max_palindrome = prod
    return max_palindrome
 
start = time.time()
L = find_max_palindrome()
elapsed = (time.time() - start)
 
print "%s found in %s seconds" % (L,elapsed)

There are a few things to notice here. First, I’m only concerned with the maximum palindrome, and so I don’t spend any time storing other palindromes once they are known to be non-maximum. Also, the if statement first checks if the given product is larger than the maximum known palindrome before using the string cast and list slice (which are more expensive) to check whether or not the number is even a palindrome. This should speed up our code a bit since the greater than comparison will often fail, regardless of whether the product in question is a palindrome. When we run this, we get the following.

906609 found in 0.134979963303 seconds

We could speed this up a tiny bit by writing our own iterators.

import time
 
def find_max_palindrome(min=100,max=999):
    max_palindrome = 0
    a = 999
    while a > 99:
        b = 999
        while b >= a:
            prod = a*b
            if prod > max_palindrome and str(prod)==(str(prod)[::-1]):
                max_palindrome = prod
            b -= 1
        a -= 1
    return max_palindrome
 
start = time.time()
L = find_max_palindrome()
elapsed = (time.time() - start)
 
print "%s found in %s seconds" % (L,elapsed)

This gives us the following, which yields roughly 88% the runtime of the original.

906609 found in 0.119297981262 seconds

Cython Solution

We’ll take our fastest Python solution and rewrite it in Cython. It looks like this.

%cython
 
import time
 
cdef find_max_palindrome(unsigned int min=100,unsigned int max=999):
    cdef unsigned int max_palindrome = 0
    cdef unsigned int prod, b, a = 999
    while a > 99:
        b = 999
        while b >= a:
            prod = a*b
            if prod > max_palindrome and str(prod)==(str(prod)[::-1]):
                max_palindrome = prod
            b -= 1
        a -= 1
    return max_palindrome
 
start = time.time()
L = find_max_palindrome()
elapsed = (time.time() - start)
 
print "%s found in %s seconds" % (L,elapsed)

Executing that gives the following result.

906609 found in 0.00826597213745 seconds

Thus, the initial Cython version is roughly 14.43 times as fast as the Python version. But, here we’re using Python’s string casting and list slicing capabilities, whereas a compiled C code is probably going to be more efficient if we perform our arithmetic palindrome checking. Let’s see.

%cython
 
import time
 
cdef is_palindrome(unsigned int num):
    cdef unsigned int reversed = 0
    cdef unsigned int original = num
 
    if num < 10: return True
    if num % 10 == 0: return False
 
    while num >= 1:
        reversed = (reversed * 10) + (num % 10)
        num = num/10
 
    if original == reversed: return True
    else: return False
 
cdef find_max_palindrome():
    cdef unsigned int max_palindrome = 0
    cdef unsigned int a = 999
    cdef unsigned int b = 999
    cdef unsigned int prod
 
    while a > 99:
        b = 999
        while b >= a:
            prod = a*b
            if prod > max_palindrome and is_palindrome(prod):
                max_palindrome = prod
            b = b -1
        a = a - 1
 
    return max_palindrome
 
start = time.time()
L = find_max_palindrome()
elapsed = (time.time() - start)
 
print "%s found in %s seconds" % (L,elapsed)

Executing this, we obtain the following.

906609 found in 0.00146412849426 seconds

This is basically what I had expected: The arithmetic version is much faster in Cython. Now, our Cython code runs roughly 81.5 times as fast as our Python code.

C Solution

Let’s see exactly how much faster we can make this in C. (I expect it to be significantly faster.)

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
int is_palindrome(unsigned int num) {
    unsigned int reversed = 0;
    unsigned int original = num;
 
    if (num < 10) return 1;
    if (num % 10 == 0) return 0;
 
    while (num >= 1) {
        reversed = (reversed * 10) + (num % 10);
        num = num/10;
    }
 
    if (original == reversed) return 1;
    else return 0;
}
 
int main(int argc, char **argv) {
    float ratio = 1./CLOCKS_PER_SEC;
    clock_t t1 = clock();
    unsigned int max_palindrome = 0;
    unsigned int a, b, prod;
 
    unsigned long int c = 10000;
 
    while (c > 0) {
    a = 999;
    while (a > 99) {
        b = 999;
        while (b >= a) {
            prod = a*b;
            if (prod > max_palindrome && is_palindrome(prod)) {
                max_palindrome = prod;
            }
            b--;
        }
        a--;
    }
    c--;
    }
 
    clock_t t2 = clock();
    printf("%d found in %f seconds for 10,000 iterations\n", max_palindrome, \
        ratio*(long)t1+ratio*(long)t2);
 
    return 0;
}

Notice that we’re going to attempt 10,000 iterations here. When executed, after compiling in GCC with a “-O3″ optimization flag, we get the following.

906609 found in 3.740000 seconds for 10,000 iterations

So, each iteration is running in about 0.000374 seconds. This means that the C code is roughly 3.903 times as fast as the Cython and 319 times as fast as the Python.


Problem: The prime factors of 13195 are 5, 7, 13 and 29. What is the largest prime factor of the number 600851475143?

Solutions: I’ll give a Sage, Python, Cython and C solution.

Sage Solution

There is a single-line solution in Sage.

sage: max(ZZ(600851475143).prime_factors())
6857

Let’s time this with 100,000 iterations in Sage.

import time
 
start = time.time()
for i in range(100000):
    a = max(ZZ(600851475143).prime_factors())
elapsed = (time.time() - start)
 
print "result %s returned after %s seconds (100,000 iterations)." % (a, elapsed)

We get the following result.

result 6857 returned after 4.41913223267 seconds (100,000 iterations).

So, the calculation in Sage averages somewhere around 4.42e-05 seconds. That’s relatively quick. Let’s now write our own Python code.

Python Solution

We’ll start factoring an input number by removing multiples of 2. Then, we’ll look at odd integers in increasing value until the number is completely factored. This is far from the most efficient factorization algorithm, but it will work. And, since we’re only interested in the largest factor, we don’t need to store much in memory.

import time
 
def largest_prime_factor(n):
 
    largest_factor = 1
 
    # remove any factors of 2 first
    while n % 2 == 0:
        largest_factor = 2
        n = n/2
 
    # now look at odd factors
    p = 3
    while n != 1:
        while n % p == 0:
            largest_factor = p
            n = n/p
        p += 2
 
    return largest_factor
 
start = time.time()
for i in range(100000): a = largest_prime_factor(600851475143)
elapsed = (time.time() - start)
 
print "result %s returned after %s seconds (100,000 iterations)." % (a, elapsed)

We get the following result.

result 6857 returned after 93.1715428829 seconds (100,000 iterations).

This means that each iteration takes roughly 0.0009317 seconds, or about 21 times the time of the Sage code above. This makes sense here as the Sage code is using a much more efficient factorization algorithm. What happens when we code our Python into Cython?

Cython Solution

We’ll modify our Python code to Cython and see how fast it runs.

%cython
 
import time
 
cdef largest_prime_factor(unsigned long n):
 
    cdef unsigned long    largest_factor = 1
    cdef unsigned long    p = 3
 
    # remove any factors of 2 first
    while n % 2 == 0:
        largest_factor = 2
        n = n/2
 
    # now look at odd factors
    while n != 1:
        while n % p == 0:
            largest_factor = p
            n = n/p
        p += 2
 
    return largest_factor
 
start = time.time()
for i in range(100000): a = largest_prime_factor(600851475143)
elapsed = (time.time() - start)
 
print "result %s returned after %s seconds (100,000 iterations)." % (a, elapsed)

This returns the following result when executed.

result 6857 returned after 5.63884592056 seconds (100,000 iterations).

The Cython code is roughly 16.523 times faster than the Python code. The Cython code still takes roughly 27% longer than the Sage code, which is fine since Sage is using the PARI C library for factorizations, and our code was relatively cheap. For how much effort was put in to writing the Cython code, it performs very well. We can make this faster by writing the same code in C.

C Solution

The C code here has the same basic structure as the Python and Cython code.

#include <stdio.h>
#include <stdlib.h>
 
unsigned long largest_prime_factor(unsigned long n) {
    unsigned long   largest_factor = 1;
    unsigned long   p = 3;
    unsigned long   div = n;
 
    /* remove any factors of 2 first */
    while (div % 2 == 0) {
        largest_factor = 2;
        div = div/2;
    }
 
    /* now look at odd factors */
    while (div != 1) {
        while (div % p == 0) {
            largest_factor = p;
            div = div/p;
        }
        p = p + 2;
    }
 
    return largest_factor;
}
 
int main(int argc, char **argv) {
    unsigned long   factor;
    unsigned long   max = 100000;
    unsigned long   i = 1;
 
    while (i <= max) {
        factor = largest_prime_factor(600851475143);
        i++;
    }
    printf("%ld\n", factor);
 
    return 0;
}

We compile (with GCC) and then run the executable (which includes 100000 iterations of the required function) to find that it takes 2.576 seconds, which is about 58% the runtime of the original Sage code.