[  Tags    10, c, euler, haskell, optimisation, optimise, optimization, optimize, perl, primes, problem, project, speed, summary  ] 
[  Current Location 
  Home  ] 
[  Current Music 
  Klein Four Group  Finite Simple Group of Order Two  ] 
zerothorder told
how
he found a solution for Project Euler's Problem 10 in Haskell. The
problem is "Find the sum of all primes less than 2,000,000". It is given
here below:
primes :: [Integer]
primes = 2 : filter isPrime [3, 5 ..]
where
 only check divisibility of the numbers less than the square root of n
isPrime n = all (not . divides n) $ takeWhile (\p > p*p <= n) primes
divides n p = n `mod` p == 0
result = sum $ takeWhile (< 2000000) primes
main = do putStrLn( show result )
He says "If this doesn't give you a nerdgasm, I don't know what will.". The
problem is that this nerdgasm will last a long time. Benchmarking this
program gives that 20 iterations of it run at 310 seconds  less
than  about 15 seconds each (on my Pentium 4 2.4GHz machine running
Mandriva Linux Cooker). So next I tried a better Haskell
implmenetation that I recalled from a thread I started in the Haskell Café
mailing list about implementing
a sieve of
Eratosthenes in Haskell:
import Data.Int
primes :: Int64 > [Int64]
primes how_much = sieve [2..how_much] where
sieve (p:x) =
p : (if p <= mybound
then sieve (remove (p*p) x)
else x) where
remove what (a:as)  what > how_much = (a:as)
 a < what = a:(remove what as)
 a == what = (remove (what+step) as)
 a > what = a:(remove (what+step) as)
remove what [] = []
step = (if (p == 2) then p else (2*p))
sieve [] = []
mybound = ceiling(sqrt(fromIntegral how_much))
main = print (length (primes 1000000))
main = print (sum (primes 2000000))
This does not involve costly operations such as modulo or division
and 20 iterations of it run at 135 wallclocks seconds  over two times faster
than zeroth's Haskell version.
Now how about Perl? Since Perl has assignment, we have the advantage that
we can create a vector of bits that we will mark with the primes by iterating
over all numbers up to the root of the limit. Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
use Math::BigInt lib => 'GMP';
my $limit = 2_000_000;
my $primes_bitmask = "";
my $loop_to = int(sqrt($limit));
my $sum = 0;
my $total_sum = Math::BigInt>new('0');
for my $p (2 .. $loop_to)
{
if (vec($primes_bitmask, $p, 1) == 0)
{
$sum += $p;
my $i = $p * $p;
while ($i < $limit)
{
vec($primes_bitmask, $i, 1) = 1;
}
continue
{
$i += $p;
}
}
}
for my $p ($loop_to .. $limit)
{
if (vec($primes_bitmask, $p, 1) == 0)
{
if (($sum += $p) > (1 << 30))
{
$total_sum += $sum;
$sum = 0;
}
}
}
$total_sum += $sum;
print "$total_sum\n";
20 runs of it run at 95 walclock seconds  even faster than the Haskell
version. But it gets better. Since all the primes we encounter greater than
2 are not even, we can create a map of their pseduohalves and conserve on
memory and iterations. This is the Perl version:
#!/usr/bin/perl
use strict;
use warnings;
use Math::BigInt lib => 'GMP';
my $limit = 2_000_000;
my $primes_bitmask = "";
my $loop_to = (int(sqrt($limit)))>>1;
my $half_limit = ($limit1)>>1;
my $sum = 0+2;
my $total_sum = Math::BigInt>new('0');
for my $half (1 .. $loop_to)
{
if (vec($primes_bitmask, $half, 1) == 0)
{
my $p = (($half<<1)+1);
$sum += $p;
my $i = ($p * $p)>>1;
while ($i < $limit)
{
vec($primes_bitmask, $i, 1) = 1;
}
continue
{
$i += $p;
}
}
}
for my $half ($loop_to .. $half_limit)
{
if (vec($primes_bitmask, $half, 1) == 0)
{
if (($sum += (($half<<1)+1)) > (1 << 30))
{
$total_sum += $sum;
$sum = 0;
}
}
}
$total_sum += $sum;
print "$total_sum\n";
Running this 20 times takes 73 wallclock seconds, close to half that of my
Haskell version.
Then I wondered how long C will take. Here is a C implementation without
the halving:
#include <string.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#define limit 2000000
int8_t bitmask[(limit+1)/8];
int main(int argc, char * argv[])
{
int p, i;
int mark_limit;
long long sum = 0;
memset(bitmask, '\0', sizeof(bitmask));
mark_limit = (int)sqrt(limit);
for (p=2 ; p <= mark_limit ; p++)
{
if (! ( bitmask[p>>3]&(1 << (p&(81))) ) )
{
/* It is a prime. */
sum += p;
for (i=p*p;i<=limit;i+=p)
{
bitmask[i>>3] = (1 << (i&(81)));
}
}
}
for (; p <= limit; p++)
{
if (! ( bitmask[p>>3]&(1 << (p&(81))) ) )
{
sum += p;
}
}
printf("%lli\n", sum);
return 0;
}
This was too fast to measure with 20 runs alone, so 500 runs of it took 15
seconds, two or three orders of magnitude faster than the fastest Haskell
or Perl versions. But naturally, we can apply the halving paradigm there too:
#include <string.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#define limit 2000000
int8_t bitmask[(limit+1)/8/2];
int main(int argc, char * argv[])
{
int half, p, i;
int half_limit;
int loop_to;
long long sum = 0 + 2;
memset(bitmask, '\0', sizeof(bitmask));
loop_to=(((int)(sqrt(limit)))>>1);
half_limit = (limit1)>>1;
for (half=1 ; half <= loop_to ; half++)
{
if (! ( bitmask[half>>3]&(1 << (half&(81))) ) )
{
/* It is a prime. */
p = (half << 1)+1;
sum += p;
for (i = ((p*p)>>1) ; i < half_limit ; i+=p )
{
bitmask[i>>3] = (1 << (i&(81)));
}
}
}
for( ; half < half_limit ; half++)
{
if (! ( bitmask[half>>3]&(1 << (half&(81))) ) )
{
sum += (half<<1)+1;
}
}
printf("%lli\n", sum);
return 0;
}
500 runs of it take 10 wallclock seconds  54.35 times per second,
and 50% better than the previous C version. And I still haven't applied
platformspecific gcc optimisations.
I should also note that the executables generated by ghc are extremely
large in comparison to their C ones:
$ ls l c_* haskell_*
rwxrxrx 1 shlomi shlomi 6082 20091204 07:46 c_mine
rwxrxrx 1 shlomi shlomi 6103 20091204 07:46 c_mine_half
rwxrxrx 1 shlomi shlomi 6092 20091204 07:46 c_mine_micro_opt
rwxrxrx 1 shlomi shlomi 796825 20091204 07:56 haskell_mine
rwxrxrx 1 shlomi shlomi 571717 20091204 07:46 haskell_zeroth
c_mine_half is less than 1% the size of haskell_mine (and
runs faster). When talking about this to other people, they said that Haskell
has a very optimised primes sequence generator, which I can try using (which
should be over 3 times as fast), and that it has two kinds of integers, which
the other type is faster, and that it has a better way to emulate assignment.
But the bottom line is that the naïve and intuitive way to write such
programs in Haskell is underperformant, even in comparison to Perl, and 100
or 1,000 times as much in comparison to C.
I've written this as a separate post and not as a comment to the original
blog post because I'm very limited with the markup in the commenting there
(I'm going to post a comment there with a link to this blog post, though).
I should note that you can find all the code I mentioned inside
a
dedicated Github repository, and you can experiment with it further.
