The Evolution of Prime

Makefile mkindex.sh prime1.c prime2.c prime3.c prime4.c prime5.c prime6.c prime7.c prime8.c prime9.c primesum.c

The following is a series of prime number generators. Everything started at an CNGW Chaostreff and with following perl prime number generator:

perl -wle '$_ = 1; (1 x $_) !~ /^(11+)\1+$/ && print while $_++'

While this code is pretty cool, it is not very fast. The following is the "evolution" of a fast prime number generator written in C. The programs prime3.c and prime5.c didn't show any improvements above the preview versions and represent "dead ends" in the evolutionary tree of the program ....

The prime9.c program generates all prime numbers in the range from 2 .. 1 000 000 000 (1 billion) in only 3 seconds on an AMD AthlonXP 2500+ (1.8GHz) system with 333 MHz DDR RAM and an MSI K7N2 Delta motherboard.

This is already pretty fast - but much more is possible. Most of this programs are implementing an Erastostenes-Sieve. An Attkins-Sieve would be much better, but wouldn't be as simple as the programs listed here.

All programs are (C) Clifford Wolf (clifford@clifford.at) and are free under the terms of the GNU General Public License.



The CNGW Prime Number Generator Contest

Whoever is the first who commits a self-written prime number generator which is (a) not derived from my code here and (b) runns faster than an optimized gcc3 compiled prime7.c on my laptop (or any other laptop which can be brought to a CNGW Chaostreff) wins this contest.

Currently the only price is a free drink at the Chaostreff and my respect. Other prices are very welcome, if anyone feels like providing them.

Please send your submissions to cngw@cngw.org.



Makefile
ALL: prime1 prime2 prime3 prime4 prime5 prime6 prime7 prime8 prime9

CC = gcc
CFLAGS = -march=pentium3 -O3
CFLAGS += -fprefetch-loop-arrays
CFLAGS += -D'printf(...)='
# CFLAGS += -ggdb -pg  # gprof --no-graph -bl prime

test:
	@for x in ./prime{1,2,3,4,5,6,7,8,9}; do \
		echo; echo "$$x"; eval "time $$x"; \
	done

clean:
	rm -vf prime1 prime2 prime3 prime4 prime5 prime6 prime7 prime8 prime9
	rm -vf *~ core* gmon.out



prime1.c
/* THE EVOLUTION OF PRIME - prime1.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * First prime number generator. Rather slow.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000

int main() {
	int a, b, i=0, p[MAX_PRIMES];
	for (a=2; a<MAX_PRIMES; a++) {
		for (b=0; b<i; b++) {
			if ( p[b]*p[b] > a ) break;
			if ( a % p[b] == 0 ) goto is_not_prime;
		}
		printf("%d\n", a);
		p[i++] = a;
is_not_prime: ;
	}
	return 0;
}



prime2.c
/* THE EVOLUTION OF PRIME - prime2.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This is already much faster - no imul and idiv anymore.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000
char p[MAX_PRIMES]; /* zero-initialized */

int main() {
	int a, b;

	for (a=2; a<MAX_PRIMES; a++) {
		if ( p[a] ) continue;
		for (b=a+a; b<MAX_PRIMES; b+=a) p[b]=1;
		printf("%d\n", a);
	}
	return 0;
}



prime3.c
/* THE EVOLUTION OF PRIME - prime3.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Rewrite of the last program in IA-32 assembler. Doesn't improve speed
 * at all. I guess the problem is cpu cache misses.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000
char p[MAX_PRIMES+1]; /* zero-initialized, last byte to make sure loop 1 terminates */

int main() {
	int a,b;
	__asm__(
		"	movl	$1, %%edx		\n"
		" 1:	incl	%%edx			\n"
		"	cmpb	$0, p(%%edx)		\n"
		"	jne	1b			\n"
		"	movl	%%edx, %%eax		\n"
		" 2:	movb	$1, p(%%eax)		\n"
		"	addl	%%edx, %%eax		\n"
		"	cmpl	$10000000, %%eax	\n"
		"	jle	2b			\n"
		"	movb	$0, p(%%edx)		\n"
		"	cmpl	$10000000, %%edx	\n"
		"	jle	1b			\n"
		: /* no output */
		: /* no input */
		: "eax", "edx", "memory"
	);
	for (a=2; a<MAX_PRIMES; a++) if (!p[a]) printf("%d\n", a);
	return 0;
}



prime4.c
/* THE EVOLUTION OF PRIME - prime4.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Use a bitmask instead of whole bytes to store is-prime flags. This is
 * noticeable faster than the last program.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000
unsigned char p[MAX_PRIMES/8+1]; /* zero-initialized */

int main() {
	int a, b;
	for (a=2; a<MAX_PRIMES; a++) {
		if ( p[a>>3] & (1<<(a&7))) continue;
		for (b=a+a; b<MAX_PRIMES; b+=a)
			p[b>>3]=p[b>>3]|(1<<(b&7));
		printf("%d\n", a);
	}
	return 0;
}



prime5.c
/* THE EVOLUTION OF PRIME - prime5.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Use prefetch assembler statement and "queue" all access to higher pages
 * thru a fifo. Doesn't improve perfomance at all, it is slower than the
 * last programm due to the overhead of managing the fifo and defining
 * prefetch() to a null-statement doesn't change performance at all. I guess
 * that the alogrithm is just accessing the pages to fast so in any case the
 * cpu spends more time fetching the pages than executing real code. So the
 * solution would be to block multiple accesses to the same page to one unit.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000
unsigned char p[MAX_PRIMES/8+1]; /* zero-initialized */

#define prefetch(x) __asm__ __volatile__ ("prefetchnta (%0)" : : "r"(x))

#define PFP_LEN 64 // must be a power of 2

int main() {
	int a, b, t, p1, p2;
	int pfp[PFP_LEN];

	for (a=2; a<MAX_PRIMES; a++) {
		if ( p[a>>3] & (1<<(a&7))) continue;
		p1=p2=0; b=a+a;
		while (b<MAX_PRIMES && p2 < PFP_LEN-2) {
			prefetch(p+(b>>3));
			pfp[(p2++)&(PFP_LEN-1)] = b;
			b+=a;
		}
		while (b<MAX_PRIMES) {
			prefetch(p+(b>>3));
			pfp[(p2++)&(PFP_LEN-1)] = b;
			t = pfp[(p1++)&(PFP_LEN-1)];
			p[t>>3]=p[t>>3]|(1<<(t&7));
			b+=a;
		}
		prefetch(p+(a>>3));
		while (p1 < p2) {
			t = pfp[(p1++)&(PFP_LEN-1)];
			p[t>>3]=p[t>>3]|(1<<(t&7));
		}
		printf("%d\n", a);
	}

	return 0;
}



prime6.c
/* THE EVOLUTION OF PRIME - prime6.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Hande even numbers explicitely in the program code. This basically cuts
 * down the memory usage to the half and so should reduce the number of cpu
 * cache misses.
 */

#include <stdio.h>

#define MAX_PRIMES 10000000
unsigned char p[MAX_PRIMES/16+1]; /* zero-initialized */

int main() {
	int a, b;
	printf("%d\n", 2);
	for (a=3; a<MAX_PRIMES; a+=2) {
		if ( p[a>>4] & (1<<((a>>1)&7))) continue;
		for (b=a+a; b<MAX_PRIMES; b+=a)
			if (b&1) p[b>>4]=p[b>>4]|(1<<((b>>1)&7));
		printf("%d\n", a);
	}
	return 0;
}



prime7.c
/* THE EVOLUTION OF PRIME - prime7.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Use groups of prime numbers and increment them in parallel to keep CPU
 * caches hot. This produces much more machine code, but is about 20%
 * faster then prime6.c on my test system.
 *
 * The defines GROUP_SIZE and PAGE_BLOCK might need some fine-tuning for
 * other systems with different bus speeds and/or cache sizes.
 */

#include <stdio.h>

// max number of primes processed in parallel
#define GROUP_SIZE 2048
int group_c[GROUP_SIZE] __attribute__ ((aligned (4096))); // counters
int group_d[GROUP_SIZE] __attribute__ ((aligned (4096))); // deltas

// number of pages processed in one pass
#define PAGE_BLOCK 16

// 1 page = 4096 bytes = 4096*16 prime-flags
// 4096*16 = 65536 = 0x10000
#define PAGE_SIZE (0x10000*PAGE_BLOCK)
#define PAGE_MASK (~(PAGE_SIZE-1))

#define MAX_PRIMES 10000000
/* zero-initialized, add an entire page padding */
unsigned char p[MAX_PRIMES/16+PAGE_SIZE+1] __attribute__ ((aligned (4096)));

int main() {
	int a=3, b, g;
	int group_window, group_roof;
	int current_page, next_page;
	printf("%d\n", 2);
	while (a<MAX_PRIMES) {
		group_window = a+a > MAX_PRIMES ? MAX_PRIMES : a+a;
		for (group_roof=0; a < group_window && group_roof < GROUP_SIZE; a+=2) {
			if ( p[a>>4] & (1<<((a>>1)&7)) ) continue;
			group_c[group_roof] = a+a;
			group_d[group_roof++] = a;
			printf("%d\n", a);
		}
		if ( ! group_roof ) continue;
		current_page = group_c[0] & PAGE_MASK;

		while ( current_page <= (MAX_PRIMES & PAGE_MASK) ) {
			next_page = (MAX_PRIMES & PAGE_MASK) + PAGE_SIZE;
			for (g=0; g<group_roof; g++) {
				for (b=group_c[g]; current_page == (b & PAGE_MASK); b+=group_d[g])
					if (b&1) p[b>>4]=p[b>>4]|(1<<((b>>1)&7));
				if ( b < next_page ) next_page = b & PAGE_MASK;
				group_c[g]=b;
			}
			current_page = next_page;
		}
	}
	return 0;
}



prime8.c
/* THE EVOLUTION OF PRIME - prime8.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Replace the most important loop with hand-written assembler code. A
 * few more milliseconds performance won.
 */

#include <stdio.h>

// max number of primes processed in parallel
#define GROUP_SIZE 2048
int group_c[GROUP_SIZE] __attribute__ ((aligned (4096))); // counters
int group_d[GROUP_SIZE] __attribute__ ((aligned (4096))); // deltas

// number of pages processed in one pass
#define PAGE_BLOCK 16

// 1 page = 4096 bytes = 4096*16 prime-flags
// 4096*16 = 65536 = 0x10000
#define PAGE_SIZE (0x10000*PAGE_BLOCK)
#define PAGE_MASK (~(PAGE_SIZE-1))

#define MAX_PRIMES 10000000
/* zero-initialized, add an entire page padding */
unsigned char p[MAX_PRIMES/16+PAGE_SIZE+1] __attribute__ ((aligned (4096)));

int main() {
	int a=3, b, g;
	int group_window, group_roof;
	int current_page, next_page;
	printf("%d\n", 2);
	while (a<MAX_PRIMES) {
		group_window = a+a > MAX_PRIMES ? MAX_PRIMES : a+a;
		for (group_roof=0; a < group_window && group_roof < GROUP_SIZE; a+=2) {
			if ( p[a>>4] & (1<<((a>>1)&7)) ) continue;
			group_c[group_roof] = a+a;
			group_d[group_roof++] = a;
			printf("%d\n", a);
		}
		if ( ! group_roof ) continue;
		current_page = group_c[0] & PAGE_MASK;

		while ( current_page <= (MAX_PRIMES & PAGE_MASK) ) {
			next_page = (MAX_PRIMES & PAGE_MASK) + PAGE_SIZE;
			for (g=0; g<group_roof; g++) {
#if 0
				for (b=group_c[g]; current_page == (b & PAGE_MASK); b+=group_d[g])
					if (b&1) p[b>>4]=p[b>>4]|(1<<((b>>1)&7));
#else
				b = group_c[g];
				__asm__ (
					"	jmp	.Lx2			\n"
					".p2align 4				\n"
					".Lx0:					\n"
					"	movl	%[gc],	%%eax		\n"
					"	testb	$1,	%%al		\n"
					"	je	.Lx1			\n"
					"	shrl	%%eax			\n"
					"	btsl	%%eax,	%[pp]		\n"
					".p2align 4				\n"
					".Lx1:					\n"
					"	addl	%[gd],	%[gc]		\n"
					".Lx2:					\n"
					"	cmpl	%[cp],	%[gc]		\n"
					"	jl	.Lx0			\n"
					: /* outputs  */
						[gc] "+r" (b),
						[pp] "+o" (p)
					: /* inputs   */
						[gd] "r" (group_d[g]),
						[cp] "r" (current_page + PAGE_SIZE)
					: /* clobbers */
						"memory", "eax"
				);
#endif
				if ( b < next_page ) next_page = b & PAGE_MASK;
				group_c[g]=b;
			}
			current_page = next_page;
		}
	}
	return 0;
}



prime9.c
/* THE EVOLUTION OF PRIME - prime9.c
 *
 * Copyright (C) 2003 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * Limit range of prime numbers to process to sqrt(MAX_PRIMES) and start
 * with the square of a prime number (instead of it's double). Use even
 * deltas so we don't need the odd check in the main loop and can also
 * reduce the number of main-loop-cycles.
 *
 * Use the C version of the main loop because todays compiler generate
 * better code than this hand written assembler loop.
 */

#include <stdio.h>
#include <math.h>

// max number of primes processed in parallel
#define GROUP_SIZE 2048
int group_c[GROUP_SIZE] __attribute__ ((aligned (4096))); // counters
int group_d[GROUP_SIZE] __attribute__ ((aligned (4096))); // deltas

// number of pages processed in one pass
#define PAGE_BLOCK 16

// 1 page = 4096 bytes = 4096*8 prime-flags
// 4096*8 = 32768 = 0x8000
#define PAGE_SIZE (0x8000*PAGE_BLOCK)
#define PAGE_MASK (~(PAGE_SIZE-1))

#define MAX_PRIMES 10000000
/* zero-initialized, add an entire page padding */
unsigned char p[MAX_PRIMES/16+PAGE_SIZE+1] __attribute__ ((aligned (4096)));

int main() {
	int a=1, b, g;
	int group_window, group_roof;
	int current_page, next_page;
	int maxp_roof = MAX_PRIMES/2;
	int maxp_sqrt = sqrt(MAX_PRIMES)/2;
	printf("%d\n", 2);
	while (a<maxp_sqrt) {
		group_window = a+a > maxp_sqrt ? maxp_sqrt : a+a;
		for (group_roof=0; a < group_window && group_roof < GROUP_SIZE; a++) {
			if ( p[a>>3] & (1<<(a&7)) ) continue;
			group_d[group_roof] = 2*a+1;
			printf("%d\n", group_d[group_roof]);
			group_c[group_roof++] = 2*a*a+2*a;
		}
		if ( ! group_roof ) continue;
		current_page = group_c[0] & PAGE_MASK;

		while ( current_page <= (maxp_roof & PAGE_MASK) ) {
			next_page = (maxp_roof & PAGE_MASK) + PAGE_SIZE;
			for (g=0; g<group_roof; g++) {
#if 1
				for (b=group_c[g]; b < (current_page + PAGE_SIZE); b+=group_d[g])
					p[b>>3]=p[b>>3]|(1<<(b&7));
#else
				b = group_c[g];
				__asm__ (
					"	jmp	.Lx2			\n"
					".p2align 4				\n"
					".Lx0:					\n"
					"	btsl	%[gc],	%[pp]		\n"
					"	addl	%[gd],	%[gc]		\n"
					".Lx2:					\n"
					"	cmpl	%[cp],	%[gc]		\n"
					"	jl	.Lx0			\n"
					: /* outputs  */
						[gc] "+r" (b),
						[pp] "+o" (p)
					: /* inputs   */
						[gd] "r" (group_d[g]),
						[cp] "r" (current_page + PAGE_SIZE)
					: /* clobbers */
						"memory"
				);
#endif
				if ( b < next_page ) next_page = b & PAGE_MASK;
				group_c[g]=b;
			}
			current_page = next_page;
		}
	}
#ifndef printf
	for (; a<maxp_roof; a++)
		if ( ! (p[a>>3] & (1<<(a&7))) ) printf("%d\n", 2*a+1);
#endif
	return 0;
}



primesum.c
/* Primesum - Calculate the sum of all primes smaller than 4.000.000.000
 *
 * Copyright (C) 2003, 2007 Clifford Wolf
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 */

#include <stdio.h>
#include <math.h>

// Overview of the algorithm:
//
// I'm using a Sieve of Eratosthenes to get the prime numbers. A
// big problem with this algorithm is that in its straigt implementation
// the entire prime-number-bitmap is processed for each found prime
// number. Since the prime number bitmap is by far to large to fit in
// the CPU caches the CPU just spends most of the time waiting for the
// memory bus.
//
// Fortunately, every new prime number found in a Sieve of Eratosthenes
// algorithm will not find any new non-primes smaller than it's square
// (because this non-primes must be the product of smaller primes found
// already earlier). So we can simply 'collect' a larger number of primes
// and then sieving all their multiples in one pass keeping the number of
// CPU cache misses low.
//
// This program has two configuration parameters which can be tuned to
// create a best-fit to the CPU cache sizes: GROUP_SIZE and PAGE_BLOCK.
// Note that performance will drop rapidly if this configuration parameters
// have to high values.
//
// The array holding the prime/non-prime flags (p) has only bits for the
// odd numbers since even number are never prime (besides '2' of course,
// which is handled explicitely in the program). This makes addressing
// the bits in the bitmap harder but reduces the size of the bitmap by
// a factor of two which reduces the memory-bus traffic and so incrases
// performance.
//
// For reference and comparison after changes in the source code:
// The sum of all primes < 4.000.000.000 is: 370.412.807.102.643.725
//
// There might be better ways of generating the sum of all prime numbers
// up to a limit but my focus was on the prime number generation.
// Unfortunately this program is so fast that actually printing out the
// primes would cost much more performance than generating them. So I
// decided to only output the sum of the primes to keep this performance
// overhead low and get a better impression of how fast the actual prime
// number generator works.
//
// This program is pretty well-documented. Usually I'm far less-verbose
// with my sourcecode comments, except there is a complicated algorithm
// in place which really needs some comments to be well understood. Some
// time ago I've created a page on my website where I've tried to explain
// my coding-style philosophy: http://www.clifford.at/style.html

#define DEBUG 0

#define MAX_PRIMES 4000000000U

// Set this to one to enable the x86 assembler code. As usual, the compiler
// will generate much better code than this handwritten stuff. Left in s
// reference if someone else feels like optimizing the main loop in assembler
// some time in the future..
#define X86ASM 0

// number of pages (4k blocks) to processes in one pass. this number of pages
// and the group_c, group_d variables (together with the acutal program code)
// must fit into the CPU cache. This must be a power of two.
#define PAGE_BLOCK 32

// 1 page = 4096 bytes = 4096*8 prime-flags
// 4096*8 = 32768 = 0x8000
#define PAGE_SIZE (0x8000*PAGE_BLOCK)
#define PAGE_MASK (~(PAGE_SIZE-1))

// The actual sieving bitmap. There are only bits for the odd numbers here
// and we add one page padding so we can use less-aggressive (and faster)
// termination conditions in the loops below.
//
//	numerical_value = 2 * bit_position_in_p + 1
//
unsigned char p[MAX_PRIMES/16+PAGE_SIZE+1] __attribute__ ((aligned (4096)));

// GROUP_SIZE is the number of primes we process in parallel. We need
// to mark all multiples of this primes as non-primes. So we have a
// counter here holding the current multiple of the prime number
// (initialized to the square of the prime) and increment deltas (set to
// the double of the prime since we are not interested in the even
// multiples anyway). Both values are 'encoded' to apply to bit positions
// in p[] instead of the numerical values.
//
// Obviously the bit-position delta in group_d[] also is the numeric
// value of the original prime. Don't let yourself get confused by that.
#define GROUP_SIZE 8192
unsigned int group_c[GROUP_SIZE] __attribute__ ((aligned (4096))); // counters
unsigned int group_d[GROUP_SIZE] __attribute__ ((aligned (4096))); // deltas

int main()
{
	// the current bit position. we start with a=1 because the first
	// prime number we are interested in is 3
	unsigned int a=1;

	// maxp_roof is the highest bit-number p[bit] we are interested in
	unsigned int maxp_roof = MAX_PRIMES/2;

	// maxp_sqrt is the highest bit number for collecting primes for sieving
	// higher non-primes. when we have reached this limit the remaining primes
	// up to maxp_roof are already sieved.
	unsigned int maxp_sqrt = sqrt(MAX_PRIMES)/2;

	// the sum of the primes. initialized to 2 because we do not generate this
	// number in the actual algorithm.
	unsigned long long primesum = 2;

#if DEBUG
	printf("size of p[] = %.3f MB\n", sizeof(p) / (1024*1024.));
	printf("size of a page = %.3f kB\n", PAGE_SIZE / (8*1024.));
#endif

	// main loop: do the sieving and add up all primes smaller sqrt(MAX_PRIMES)
	while (a<maxp_sqrt)
	{
		// especially for the first few iterations we can't fill the group_* arrays.
		// this variable contains the number of valid entries in group_c and group_d.
		unsigned int group_roof = 0;

		unsigned int current_page = ~0;

		// Yep - this is an evil goto label! It is used to continue filling the group_*
		// variables when there is still free space in group_* and new pages have been
		// processed.
continue_filling_group_arrays:;

		// the maximum p[] bit position for which we have complete prime information
		// we can safely aquire primes up to this bit position for this group. This
		// is recalculated whenever continue_filling_group_arrays is jumped to.
		unsigned int group_window = 2*a*a+2*a;
		if (group_window > current_page)
			group_window = current_page;
		if (group_window > maxp_sqrt)
			group_window = maxp_sqrt;

		// increment a and look for new prime numbers (unset bits in p[]) smaller than 
		// the current group_window value and not more than GROUP_SIZE. Also add all
		// found primes to primesum and set current_page to the smallest group_c[]
		// entry found.
		for (; a < group_window && group_roof < GROUP_SIZE; a++) {
			if ( p[a>>3] & (1<<(a&7)) ) continue;
			primesum += 2*a+1;
			group_d[group_roof] = 2*a+1;
			group_c[group_roof] = 2*a*a+2*a;
			if (current_page > (group_c[0] & PAGE_MASK))
				current_page = group_c[0] & PAGE_MASK;
			group_roof++;
		}

#if DEBUG
		printf("num:%d window_size:%d current_page:%d\n", group_roof, group_window, current_page);
#endif

		// count up the group_c[] variables 'parallel' and mark all the found non-primes
		// by setting their bit in p[]. The 'parallel' increment is done by incrementing
		// the one group_c[] element after another until they exceed the current page.
		// After that the current_page variable is set to the page with the smallest
		// group_c[] value pointing to and group_c[] is incremented again until the end
		// of p[] is reached. Filling group_* is continued if there is free space and
		// we haven't yet reached maxp_sqrt.
		while ( current_page <= (maxp_roof & PAGE_MASK) )
		{
			unsigned int b, g;
			unsigned int next_page = (maxp_roof & PAGE_MASK) + PAGE_SIZE;
			for (g=0; g<group_roof; g++)
			{
#if !X86ASM
				for (b=group_c[g]; b < (current_page + PAGE_SIZE); b+=group_d[g])
					p[b>>3] |= (1<<(b&7));
#else
				b = group_c[g];
				__asm__ (
					"	jmp	.Lx2			\n"
					".p2align 4				\n"
					".Lx0:					\n"
					"	btsl	%[gc],	%[pp]		\n"
					"	addl	%[gd],	%[gc]		\n"
					".Lx2:					\n"
					"	cmpl	%[cp],	%[gc]		\n"
					"	jl	.Lx0			\n"
					: /* outputs  */
						[gc] "+r" (b),
						[pp] "+o" (p)
					: /* inputs   */
						[gd] "r" (group_d[g]),
						[cp] "r" (current_page + PAGE_SIZE)
					: /* clobbers */
						"memory"
				);
#endif
				if ( b < next_page )
					next_page = b & PAGE_MASK;
				group_c[g]=b;
			}
			current_page = next_page;

			if (group_roof < GROUP_SIZE && group_window < maxp_sqrt)
				goto continue_filling_group_arrays;
		}

#if DEBUG
		printf("finished main loop pass\n");
#endif
	}

#if DEBUG
	printf("Now make a scan over the remainging numbers\n");
#endif

	// add up the remaining primes larger or equal sqrt(MAX_PRIMES)
	for (; a<maxp_roof; a++)
		if ( !(p[a>>3] & (1<<(a&7))) )
			primesum += 2*a+1;

	printf("%Ld\n", primesum);

	return 0;
}




Credits
Clifford Wolf (clifford[at]clifford.at):
Most of the coding work and initial perl script.

Peter Meerwald (pmeerw[at]cosy.sbg.ac.at):
Did some final optimizations in prime8.c.

Jule P Riede (jule[at]quintessenz.org):
Wrote a straight-forward prime number generator in perl and so started
the whole "this program does the job faster" thread.