## [Tech Blog] Accelaration of Collatz conjecture

3/17/2015

The Wikipedia entry of the Collatz conjecture describes the simple algorithm used to generate so-called hailstone sequences:

Take any natural number n. If n is even, divide it by 2 to get n / 2. If n is odd, multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process (which has been called “Half Or Triple Plus One”, or HOTPO) indefinitely. The conjecture is that no matter what number you start with, you will always eventually reach 1.

In the examples, it states:

Numbers with a total stopping time longer than any smaller starting value form a sequence beginning with:

1, 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, … (sequence A006877 in OEIS).

By looking up the A006877 sequence (“In the `3x+1′ problem, these values for the starting value set new records for number of steps to reach 1″) in OEIS, and following the link embedded in “T. D. Noe, Table of n, a(n) for n = 1..130 (from Eric Roosendaal’s data)” (under LINKS), one finds this list which — supposedly — lists the numbers with a larger stopping count than of any smaller number:

2

3

6

7

9

18

25

27

54

73

97

129

171

231

313

327

649

703

871

1161

2223

2463

2919

3711

6171

10971

13255

17647

23529

26623

34239

35655

52527

77031

106239

142587

156159

216367

230631

410011

511935

626331

837799

1117065

1501353

1723519

2298025

3064033

3542887

3732423

5649499

6649279

8400511

11200681

14934241

15733191

31466382

36791535

63728127

127456254

169941673

226588897

268549803

537099606

670617279

1341234558

1412987847

1674652263

2610744987

4578853915

4890328815

9780657630

12212032815

12235060455

13371194527

17828259369

31694683323

63389366646

75128138247

133561134663

158294678119

166763117679

202485402111

404970804222

426635908975

568847878633

674190078379

881715740415

989345275647

1122382791663

1444338092271

1899148184679

2081751768559

2775669024745

3700892032993

3743559068799

7487118137598

7887663552367

10516884736489

14022512981985

19536224150271

26262557464201

27667550250351

38903934249727

48575069253735

51173735510107

60650353197163

80867137596217

100759293214567

134345724286089

223656998090055

397612441048987

530149921398649

706866561864865

942488749153153

1256651665537537

1675535554050049

2234047405400065

2978729873866753

3586720916237671

4320515538764287

4861718551722727

6482291402296969

7579309213675935

12769884180266527

17026512240355369

22702016320473825

45404032640947650

46785696846401151

Here is the naive implementation of the hailstone sequence length generation algorithm:

static inline int hailstone(unsigned long n) { int count = 0; while (n > 1) { if (n & 1) n = 3 * n + 1; else n >>= 1; ++count; } return count; }

By mapping this function to the elements of the above list, one obtains the following list:

2: 1

3: 7

6: 8

7: 16

9: 19

18: 20

25: 23

27: 111

54: 112

73: 115

97: 118

129: 121

171: 124

231: 127

313: 130

327: 143

649: 144

703: 170

871: 178

1161: 181

2223: 182

2463: 208

2919: 216

3711: 237

6171: 261

10971: 267

13255: 275

17647: 278

23529: 281

26623: 307

34239: 310

35655: 323

52527: 339

77031: 350

106239: 353

142587: 374

156159: 382

216367: 385

230631: 442

410011: 448

511935: 469

626331: 508

837799: 524

1117065: 527

1501353: 530

1723519: 556

2298025: 559

3064033: 562

3542887: 583

3732423: 596

5649499: 612

6649279: 664

8400511: 685

11200681: 688

14934241: 691

15733191: 704

31466382: 705

36791535: 744

63728127: 949

127456254: 950

169941673: 953

226588897: 956

268549803: 964

537099606: 965

670617279: 986

1341234558: 987

1412987847: 1000

1674652263: 1008

2610744987: 1050

4578853915: 1087

4890328815: 1131

9780657630: 1132

12212032815: 1153

12235060455: 1184

13371194527: 1210

17828259369: 1213

31694683323: 1219

63389366646: 1220

75128138247: 1228

133561134663: 1234

158294678119: 1242

166763117679: 1255

202485402111: 1307

404970804222: 1308

426635908975: 1321

568847878633: 1324

674190078379: 1332

881715740415: 1335

989345275647: 1348

1122382791663: 1356

1444338092271: 1408

1899148184679: 1411

2081751768559: 682

2775669024745: 685

3700892032993: 688

3743559068799: 794

7487118137598: 795

7887663552367: 808

10516884736489: 811

14022512981985: 814

19536224150271: 1585

26262557464201: 833

27667550250351: 846

38903934249727: 1617

48575069253735: 1638

51173735510107: 1651

60650353197163: 1659

80867137596217: 1662

100759293214567: 1820

134345724286089: 1823

223656998090055: 1847

397612441048987: 1853

530149921398649: 1856

706866561864865: 1859

942488749153153: 1862

1256651665537537: 1865

1675535554050049: 1868

2234047405400065: 1871

2978729873866753: 1874

3586720916237671: 1895

4320515538764287: 458

4861718551722727: 470

6482291402296969: 473

7579309213675935: 512

12769884180266527: 445

17026512240355369: 448

22702016320473825: 451

45404032640947650: 452

46785696846401151: 738

As it is easily visible, the first list (which is the left column in the second list) is “broken”: up to and including 1899148184679, the sequence lengths (right column in the second list) are indeed monotonically growing, but afterwards there are some numbers which have shorter sequence lengths than previous ones.

The problem becomes recalculating the sequence lengths for all odd numbers above 1899148184679 and checking if the current sequence length is greater than all previous ones.

Unless otherwise noted, tests were run on an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz with 4 GiB RAM running Ubuntu Linux 14.04

Using the naive implementation, one obtains ~1.37 M numbers checked per second (NCPS).

Using an inlined assembly language implementation of the function

static inline dword hailstone(qword n) { dword retval; asm volatile("mov %0,%%rax" : : "r"(n) : "rax"); asm volatile(".intel_syntax noprefix"); asm volatile("mov rsi,rax" : : : "rsi"); asm volatile("xor r8d,r8d" : : : "r8"); asm volatile(".align 16"); asm volatile("top:"); asm volatile("shr rax,1" : : : "rax"); asm volatile("test rsi,1"); asm volatile("lea rsi,[rsi + 2 * rsi + 1]" : : : "rsi"); asm volatile("cmovz rsi,rax" : : : "rsi"); asm volatile("inc r8d" : : : "r8"); asm volatile("mov rax,rsi" : : : "rax"); asm volatile("cmp rsi,1"); asm volatile("jnz top"); asm volatile(".att_syntax"); asm volatile("mov %%r8d,%0": "=r"(retval)); return retval; }

which uses conditional move instructions to avoid branch misprediction penalties improves the rate to ~2.02 M NCPS — a 48% improvement.

If we’re interested only in the sequence lengths of odd numbers, then the following function can be made to calculate them:

typedef unsigned int dword; /* 32 bit unsigned integer */ typedef unsigned long qword; /* 64 bit unsigned integer */ static inline dword bsfq(qword n) { qword retval = -1; asm volatile("bsfq %1,%0" : "=r"(retval) : "r"(n)); return retval; } static inline int hailstone(unsigned long n) { int count = 0; do { /* n is odd */ dword shifter = bsfq(n = 3 * n + 1); /* accumulate the number of divisions by 2 and the initial tripling + 1 step */ count += shifter + 1; /* n is even; do the required number of divisions by two */ n >>= shifter; } while (n > 1); return count; }

Using this function results in ~6.58 M NCPS — a 380% improvement over the naive implementation.

As all hailstone sequences eventually — presumably — end with 1, at least some of the numbers in the sequence are smaller than the initial number. Accordingly, it is possible to cache (precalculate and store) sequence lengths for some set of numbers (starting with 1) and once a particular sequence reaches a number which is in cache, accumulate the precalculated sequence length into the current count and terminate the calculation.

It is easily seen that numbers in hailstone sequences aren’t divisible by 3. Accordingly, one stores in the cache the sequence lengths of only those numbers that aren’t divisible by 3. This can be done by placing n’s sequence length in the cache at index n/3.

#define CACHEFILE_SIZE 1012645888 word cache[CACHEFILE_SIZE / sizeof(word)]; void load_cache(void) { FILE *fin = fopen("./hailstone-cache.bin", "rb"); if (CACHEFILE_SIZE != fread(cache, sizeof(byte), CACHEFILE_SIZE, fin)) { fprintf(stderr, "Read error on cache file\n"); exit(-1); } fclose(fin); } static inline dword bsfq(qword n) { qword retval = -1; asm volatile("bsfq %1,%0" : "=r"(retval) : "r"(n)); return retval; } static inline int hailstone(unsigned long n) { int count = 0; do { dword shifter = bsfq(n = 3 * n + 1); count += shifter + 1; n >>= shifter; /* here n is odd and not divisible by 3 */ if (n < sizeof(cache) / sizeof(word) * 3) { count += cache[n / 3]; n = 1; } } while (n > 1); return count; }

Using this ~1 GB sized cache (don’t ask why is the odd size) results in ~17.21 M NCPS — a 1156% improvement over the naive implementation.

The next step is to parallelize the sequence length calculations.

#define qCLKS_PER_SEC 3000000000UL /* processor clock rate */ static inline qword rdtsc(void) { register qword acc128lo asm ("rax"); register qword acc128hi asm ("rdx"); asm volatile ("rdtsc" : "=r"(acc128lo), "=r"(acc128hi)); return (acc128hi << 32) | acc128lo; } /* max_cnt has to be initialized to a nonzero value so it gets put in the DATA section instead of the COMMON section */ volatile unsigned max_cnt = 1; unsigned secs = 1; unsigned word cache[<# of words in cache file>]; typedef struct { qword starting_n; qword current_n; qword cnts; qword cached_cnts; dword increment; } hailstone_t; void *hailstone(void *parm) { qword i; ((hailstone_t *)parm)->cnts = 0; ((hailstone_t *)parm)->cached_cnts = 0; for (i = ((hailstone_t *)parm)->starting_n; 1; i += ((hailstone_t *)parm)->increment) { qword n = i; dword cnt = 0; ((hailstone_t *)parm)->current_n = n; do { qword shifter, idx; n += 2 * n + 1; ++cnt; shifter = bsfq(n); n >>= shifter; cnt += shifter; idx = n / 3; if (idx < sizeof(cache) / sizeof(*cache)) { cnt += cache[idx]; ((hailstone_t *)parm)->cached_cnts += cache[idx]; n = 1; } } while (n != 1); ((hailstone_t *)parm)->cnts += cnt; if (cnt > max_cnt) { max_cnt = cnt; printf("%lu: %u\n", i, cnt); fflush(stdout); } } }

Which can be invoked by:

int main(int argc, char *argv[]) { FILE *fin; dword nthreads = atoi(argv[1]); dword i; pthread_t *pThreads; hailstone_t *pParms; qword next_sec; struct timespec req = { 0, 50000000 }, rem; if (!nthreads) nthreads = 1; if (!(fin = fopen(CACHEFILE, "rb"))) { fprintf(stderr, "Couldn't open cache file '%s'\n", CACHEFILE); exit(-2); } if (1 != fread(cache, sizeof(cache), 1, fin)) { fprintf(stderr, "Read error on cache file '%s'\n", CACHEFILE); exit(-2); } fclose(fin); if (!(pThreads = (pthread_t *)malloc(nthreads * sizeof(pthread_t)))) { fprintf(stderr, "Couldn't allocate %lu bytes\n", nthreads * sizeof(pthread_t)); exit(-3); } if (!(pParms = (hailstone_t *)malloc(nthreads * sizeof(hailstone_t)))) { fprintf(stderr, "Couldn't allocate %lu bytes\n", nthreads * sizeof(hailstone_t)); exit(-4); } next_sec = rdtsc() + qCLKS_PER_SEC; for (i = 0; i < nthreads; ++i) { int rc; pParms[i].starting_n = STARTING_N + i * 2; pParms[i].increment = nthreads * 2; if (rc = pthread_create(pThreads + i, NULL, hailstone, (void *)(pParms + i))) { fprintf(stderr,"Error - pthread_create() return code: %d\n", rc); exit(-100); } } while (1) { nanosleep(&req, &rem); if (rdtsc() >= next_sec) { qword min_n = pParms[0].current_n, all_cnts = pParms[0].cnts, all_cached_cnts = pParms[0].cached_cnts; dword h, m, s; for (i = 1; i < nthreads; ++i) { if (pParms[i].current_n < min_n) min_n = pParms[i].current_n; all_cnts += pParms[i].cnts; all_cached_cnts += pParms[i].cached_cnts; } next_sec += qCLKS_PER_SEC; s = secs++; m = s / 60; s %= 60; h = m / 60; m %= 60; fprintf(stderr, "\r%02u:%02u:%02u %lu %.0f/sec (%4.1f%%)", h, m, s, min_n, (min_n - STARTING_N + 1) / 2. / secs, 100. * all_cached_cnts / all_cnts); } } /* Last thing that main() should do */ pthread_exit(NULL); }

Synchronized access to max_cnt is explicitly avoided as it presents a very significant performance hit (approx. one-third loss of performance) on parallel processing.

Using this (the final code) to run three threads results in ~29 M NCPS — an approx. 2000% improvement over the naive implementation.

On an Intel 4790K CPU with 4 physical cores and 4 virtual (HyperThreading) cores running at 4.6 GHz clock rate, using 32 GiB Crucial Ballistix DDR3 RAM overclocked to 2 GHz the code runs at 80 M NCPS. Unfortunately, a simple calculation shows that it would take 10 years at that rate to check all odd numbers up to 46785696846401151 (the last — wrong — element of the original list).