[Tech Blog] Accelaration of Collatz conjecture


The Wikipedia entry of the Collatz conjecture describes the simple algorithm used to generate so-called hailstone sequences:

Take any natural number n. If n is even, divide it by 2 to get n / 2. If n is odd, multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process (which has been called “Half Or Triple Plus One”, or HOTPO) indefinitely. The conjecture is that no matter what number you start with, you will always eventually reach 1.

In the examples, it states:

Numbers with a total stopping time longer than any smaller starting value form a sequence beginning with:

1, 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, … (sequence A006877 in OEIS).

By looking up the A006877 sequence (“In the `3x+1′ problem, these values for the starting value set new records for number of steps to reach 1″) in OEIS, and following the link embedded in “T. D. Noe, Table of n, a(n) for n = 1..130 (from Eric Roosendaal’s data)” (under LINKS), one finds this list which — supposedly — lists the numbers with a larger stopping count than of any smaller number:


Here is the naive implementation of the hailstone sequence length generation algorithm:

static inline int hailstone(unsigned long n)
    int count = 0;
    while (n > 1)
        if (n & 1)
            n = 3 * n + 1;
            n >>= 1;

    return count;

By mapping this function to the elements of the above list, one obtains the following list:

1: 0
2: 1
3: 7
6: 8
7: 16
9: 19
18: 20
25: 23
27: 111
54: 112
73: 115
97: 118
129: 121
171: 124
231: 127
313: 130
327: 143
649: 144
703: 170
871: 178
1161: 181
2223: 182
2463: 208
2919: 216
3711: 237
6171: 261
10971: 267
13255: 275
17647: 278
23529: 281
26623: 307
34239: 310
35655: 323
52527: 339
77031: 350
106239: 353
142587: 374
156159: 382
216367: 385
230631: 442
410011: 448
511935: 469
626331: 508
837799: 524
1117065: 527
1501353: 530
1723519: 556
2298025: 559
3064033: 562
3542887: 583
3732423: 596
5649499: 612
6649279: 664
8400511: 685
11200681: 688
14934241: 691
15733191: 704
31466382: 705
36791535: 744
63728127: 949
127456254: 950
169941673: 953
226588897: 956
268549803: 964
537099606: 965
670617279: 986
1341234558: 987
1412987847: 1000
1674652263: 1008
2610744987: 1050
4578853915: 1087
4890328815: 1131
9780657630: 1132
12212032815: 1153
12235060455: 1184
13371194527: 1210
17828259369: 1213
31694683323: 1219
63389366646: 1220
75128138247: 1228
133561134663: 1234
158294678119: 1242
166763117679: 1255
202485402111: 1307
404970804222: 1308
426635908975: 1321
568847878633: 1324
674190078379: 1332
881715740415: 1335
989345275647: 1348
1122382791663: 1356
1444338092271: 1408
1899148184679: 1411
2081751768559: 682
2775669024745: 685
3700892032993: 688
3743559068799: 794
7487118137598: 795
7887663552367: 808
10516884736489: 811
14022512981985: 814
19536224150271: 1585
26262557464201: 833
27667550250351: 846
38903934249727: 1617
48575069253735: 1638
51173735510107: 1651
60650353197163: 1659
80867137596217: 1662
100759293214567: 1820
134345724286089: 1823
223656998090055: 1847
397612441048987: 1853
530149921398649: 1856
706866561864865: 1859
942488749153153: 1862
1256651665537537: 1865
1675535554050049: 1868
2234047405400065: 1871
2978729873866753: 1874
3586720916237671: 1895
4320515538764287: 458
4861718551722727: 470
6482291402296969: 473
7579309213675935: 512
12769884180266527: 445
17026512240355369: 448
22702016320473825: 451
45404032640947650: 452
46785696846401151: 738

As it is easily visible, the first list (which is the left column in the second list) is “broken”: up to and including 1899148184679, the sequence lengths (right column in the second list) are indeed monotonically growing, but afterwards there are some numbers which have shorter sequence lengths than previous ones.

The problem becomes recalculating the sequence lengths for all odd numbers above 1899148184679 and checking if the current sequence length is greater than all previous ones.

Unless otherwise noted, tests were run on an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz with 4 GiB RAM running Ubuntu Linux 14.04

Using the naive implementation, one obtains ~1.37 M numbers checked per second (NCPS).

Using an inlined assembly language implementation of the function

static inline dword hailstone(qword n)
    dword retval;

    asm volatile("mov    %0,%%rax" : : "r"(n) : "rax");
    asm volatile(".intel_syntax noprefix");
    asm volatile("mov    rsi,rax" : : : "rsi");
    asm volatile("xor    r8d,r8d" : : : "r8");
    asm volatile(".align 16");
    asm volatile("top:");
    asm volatile("shr    rax,1" : : : "rax");
    asm volatile("test    rsi,1");
    asm volatile("lea    rsi,[rsi + 2 * rsi + 1]" : : : "rsi");
    asm volatile("cmovz    rsi,rax" : : : "rsi");
    asm volatile("inc    r8d" : : : "r8");
    asm volatile("mov    rax,rsi" : : : "rax");
    asm volatile("cmp    rsi,1");
    asm volatile("jnz    top");
    asm volatile(".att_syntax");
    asm volatile("mov    %%r8d,%0": "=r"(retval));

    return retval;

which uses conditional move instructions to avoid branch misprediction penalties improves the rate to ~2.02 M NCPS — a 48% improvement.

If we’re interested only in the sequence lengths of odd numbers, then the following function can be made to calculate them:

typedef unsigned int dword; 	/* 32 bit unsigned integer */
typedef unsigned long qword; 	/* 64 bit unsigned integer */

static inline dword bsfq(qword n)
    qword retval = -1;
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    return retval;

static inline int hailstone(unsigned long n)
    int count = 0;
        /* n is odd */
        dword shifter = bsfq(n = 3 * n + 1);
        /* accumulate the number of divisions by 2 and the initial
        tripling + 1 step */
        count += shifter + 1;
        /* n is even; do the required number of divisions by two */
        n >>= shifter;
    while (n > 1);
    return count;

Using this function results in ~6.58 M NCPS — a 380% improvement over the naive implementation.

As all hailstone sequences eventually — presumably — end with 1, at least some of the numbers in the sequence are smaller than the initial number. Accordingly, it is possible to cache (precalculate and store) sequence lengths for some set of numbers (starting with 1) and once a particular sequence reaches a number which is in cache, accumulate the precalculated sequence length into the current count and terminate the calculation.

It is easily seen that numbers in hailstone sequences aren’t divisible by 3. Accordingly, one stores in the cache the sequence lengths of only those numbers that aren’t divisible by 3. This can be done by placing n’s sequence length in the cache at index n/3.

#define    CACHEFILE_SIZE    1012645888

word cache[CACHEFILE_SIZE / sizeof(word)];

void load_cache(void)
    FILE *fin = fopen("./hailstone-cache.bin", "rb");
    if (CACHEFILE_SIZE != fread(cache, sizeof(byte), CACHEFILE_SIZE, fin))
        fprintf(stderr, "Read error on cache file\n");


static inline dword bsfq(qword n)
    qword retval = -1;
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    return retval;

static inline int hailstone(unsigned long n)
    int count = 0;
        dword shifter = bsfq(n = 3 * n + 1);
        count += shifter + 1;
        n >>= shifter;
        /* here n is odd and not divisible by 3 */
        if (n < sizeof(cache) / sizeof(word) * 3)
            count += cache[n / 3];

            n = 1;
    while (n > 1);
    return count;

Using this ~1 GB sized cache (don’t ask why is the odd size) results in ~17.21 M NCPS — a 1156% improvement over the naive implementation.

The next step is to parallelize the sequence length calculations.

#define qCLKS_PER_SEC	3000000000UL	/* processor clock rate */

static inline qword rdtsc(void) {
    register qword acc128lo asm ("rax");
    register qword acc128hi asm ("rdx");
    asm volatile ("rdtsc" : "=r"(acc128lo), "=r"(acc128hi));
    return (acc128hi << 32) | acc128lo;

/* max_cnt has to be initialized to a nonzero value so it gets put in the DATA
section instead of the COMMON section */
volatile unsigned max_cnt = 1;
unsigned secs = 1;

unsigned word cache[<# of words in cache file>];

typedef struct
    qword starting_n;
    qword current_n;
    qword cnts;
    qword cached_cnts;
    dword increment;
} hailstone_t;

void *hailstone(void *parm)
    qword i;
    ((hailstone_t *)parm)->cnts = 0;
    ((hailstone_t *)parm)->cached_cnts = 0;

    for (i = ((hailstone_t *)parm)->starting_n; 1;
     i += ((hailstone_t *)parm)->increment)
        qword n = i;
        dword cnt = 0;
        ((hailstone_t *)parm)->current_n = n;
            qword shifter, idx;

            n += 2 * n + 1;
            shifter = bsfq(n);
            n >>= shifter;
            cnt += shifter;

            idx = n / 3;
            if (idx < sizeof(cache) / sizeof(*cache))
                cnt += cache[idx];
                ((hailstone_t *)parm)->cached_cnts += cache[idx];

                n = 1;
        while (n != 1);

        ((hailstone_t *)parm)->cnts += cnt;

        if (cnt > max_cnt)
            max_cnt = cnt;
            printf("%lu: %u\n", i, cnt);

Which can be invoked by:

int main(int argc, char *argv[])
    FILE *fin;
    dword nthreads = atoi(argv[1]);
    dword i;
    pthread_t *pThreads;
    hailstone_t *pParms;
    qword next_sec;
    struct timespec req = { 0, 50000000 }, rem;
    if (!nthreads)
        nthreads = 1;
    if (!(fin = fopen(CACHEFILE, "rb")))
        fprintf(stderr, "Couldn't open cache file '%s'\n", CACHEFILE);

    if (1 != fread(cache, sizeof(cache), 1, fin))
        fprintf(stderr, "Read error on cache file '%s'\n", CACHEFILE);

    if (!(pThreads = (pthread_t *)malloc(nthreads * sizeof(pthread_t))))
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(pthread_t));

    if (!(pParms = (hailstone_t *)malloc(nthreads * sizeof(hailstone_t))))
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(hailstone_t));
    next_sec = rdtsc() + qCLKS_PER_SEC;

    for (i = 0; i < nthreads; ++i)
        int rc;

        pParms[i].starting_n = STARTING_N + i * 2;
        pParms[i].increment  = nthreads * 2;
        if (rc = pthread_create(pThreads + i, NULL,
                  hailstone, (void *)(pParms + i)))
            fprintf(stderr,"Error - pthread_create() return code: %d\n", rc);
    while (1)
        nanosleep(&req, &rem);
        if (rdtsc() >= next_sec)
            qword min_n = pParms[0].current_n, all_cnts = pParms[0].cnts,
              all_cached_cnts = pParms[0].cached_cnts;
            dword h, m, s;
            for (i = 1; i < nthreads; ++i)
                if (pParms[i].current_n < min_n)
                    min_n = pParms[i].current_n;
                all_cnts += pParms[i].cnts;
                all_cached_cnts += pParms[i].cached_cnts;

            next_sec += qCLKS_PER_SEC;
            s = secs++;
            m = s / 60;
            s %= 60;
            h = m / 60;
            m %= 60;
            fprintf(stderr, "\r%02u:%02u:%02u  %lu  %.0f/sec (%4.1f%%)",
           h, m, s, min_n, (min_n - STARTING_N + 1) / 2. / secs,
           100. * all_cached_cnts / all_cnts);

    /* Last thing that main() should do */

Synchronized access to max_cnt is explicitly avoided as it presents a very significant performance hit (approx. one-third loss of performance) on parallel processing.

Using this (the final code) to run three threads results in ~29 M NCPS — an approx. 2000% improvement over the naive implementation.

On an Intel 4790K CPU with 4 physical cores and 4 virtual (HyperThreading) cores running at 4.6 GHz clock rate, using 32 GiB Crucial Ballistix DDR3 RAM overclocked to 2 GHz the code runs at 80 M NCPS. Unfortunately, a simple calculation shows that it would take 10 years at that rate to check all odd numbers up to 46785696846401151 (the last — wrong — element of the original list).


Fixstars Launches the World’s Highest Density SSD, “SSD-3000M” For Media and Entertainment Professionals

The Highest Density and Performance Reliability for professionals.

Sunnyvale, CA – Feb 17, 2015 – Fixstars Solutions Inc., an innovator in flash storage solutions, today announced the start of sales for 3TB SSD, SSD-3000M, and 1TB SSD, SSD-1000M, in North America. The products feature enterprise level reliability and unprecedented sequential read/write performance aimed at professional content creation, Advanced driver assistance systems(ADAS), HPC, and Datacenters.

The 3TB SSD-3000M has the world’s highest capacity*1 for 2.5” SATA SSD. High capacity SSDs help reduce the number of drives required in professional setups reducing operational costs such as maintenance, energy, and chassis/rack infrastructure. More importantly a more reliable workflow with minimum handling failures is of significantly valuable. These disks integrate Fixstars’ proprietary NAND controller preventing latency spikes and performance deterioration ensuring consistent high performance. Applications for which fast and stable disk writes are crucial such as 4K video recording/editing and encrypted storage for film will benefit the most from Fixstars solid state disks.

_DSC3737 - flatten

“The SSD-3000M/1000M were released in Japan last November, and have been getting great feedbacks from our customers”, said Satoshi Miki, CEO & Co-Founder of Fixstars Corporation (Tokyo), “As an innovator of storage solutions, we are focused on providing high performance and reliability SSD solutions, to accelerate our customer’s business”.

For more information on the SSD-3000M/1000M, please visit our web site.

*1: As of Nov 18th 2014, according to a survey by Fixstars Corp.

About Fixstars

Fixstars Solutions is a technology company devoted to our philosophy “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of flash memory technology that delivers high speed IO as well as power savings, Fixstars Solutions provides “Green IT”, while accelerating customers’ business in various fields. Learn more about how Fixstars Solutions can accelerate your business in life science, manufacturing, finance, and media & entertainment. For more information, visit www.fixstars.com.



[Tech Blog] PCIe SSD for Genome assembly



A genome assembly software takes a huge number of small pieces of DNA sequences (called “read”), and tries to assemble them to create a long DNA sequence which represents the original chromosomes. It generally consumes not only large computational power, but also large working memory space. The required memory space depends on the input data size, but often in the tera-bytes. Although it is true that the price of DRAM has dramatically decreased, a workstation or a server with a few TB of memory is still very expensive. For example, IBM System x3690 X5 can install 2TB of memory, but the list price is more than $300k.

On the other hand, PCI Express SSD board is on the rise. Many hardware vendors like Intel, Fusion-IO (acquired by San Disk), OCZ (acquired by Toshiba), etc, release variety of PCIe-SSD boards. Generally, it has lower bandwidth than DRAM, but much higher bandwidth than a standard SATA/SAS SSD. The price is a little high compared with a standard SATA-SSD, but Fusion-IO ioFX 2.0 has 1.6TB, at $5,418.95 on Amazon.com. Even if you insert this board into a high-end workstation, the total price is still below $10,000, which is much cheaper than a 2TB memory server.

In this blog post, I would like to explore whether using an SSD in place of DRAM is going to yield a viable solution. We will use an open-source Genome Assembler called “velvet” as a benchmark software.


First, I downloaded, compiled and installed “velvet” from the velvet web site.

$ tar xvfz velvet_1.2.08.tgz
$ cd velvet_1.2.08
$ sudo cp velvetg velveth /usr/local/bin

In this procedure, ‘MAXKMERLENGTH=51’ sets the maximum number for “k-mer”. ‘OPENMP=1’ means that multi-threading by OpenMP is active. “k-mer” is a very important parameter in genome assembly, as it affects the quality of output DNA sequence. For more detail on k-mers, please refer to the velvet user manual.

Velvet has two process, velveth and velvetg. The velveth process creates a graph file to prepare the genome assembly. The memory required by velveth is not so large. The velvetg process, which is the actual assembling process, consumes much more memory and computation time.
Before we can start testing, we need input files. In this experiment, we will use two fastq files, SRR000021 and SRR000026, which were downloaded from this site . I processed these data by velveth as follows:

$ velveth SRR2126.k11  11 -short -fastq SRR2126.fastq

The first argument is the output directory name, the second argument is the length of the k-mer, the third argument specifies short read inputs, the 4th argument specifies the “fastq” file type, and the 5th argument is the input file.

The next step is assembly. The command to do so is as follows:

$ velvetg SRR2126.k11

The argument is the output directory generated by velveth. I measured the elapsed time of this command in several different hardware configurations:

1. Memory 4GB + Generic SATA HDD
2. Memory 4GB + Fusion IO ioFX
3. Memory 8GB

In case of configuration #2, I created a swap file on ioFX like this:

# dd if=/dev/zero of=/mnt/iofx/swap0 bs=1024 count=12582912
# chmod 600 swap0
# mkswap swap0 12582912
# swapoff –a
# swapon /mnt/iofx/swap0


The velvetg process uses about 8GB memory space for this input data, so roughly half of temporary data is spilled out to swap memory space for configurations 1 and 2. Below figure  shows the elapsed time for each configuration. Using the Generic SATA HDD, the process was not finished after 2 hours, so we decided to kill the process.



So is using PCIe-SSD a viable solution? It is hard to say, as 3x difference is not so small. In this particular experiment, only half of the memory space used by velvetg happened to be placed on the PCIe-SSD. As mentioned earlier, the real-life data that bioinformaticians deals with can be a few TBs. If almost all the memory space is on PCIe-SSD, the performance is expected to be much worse.

However, considering that HDD could not even complete the process in a reasonable amount of time, the PCIe-SSD card showed that it can vastly improve performance. This shows that the SSD can serve as a good compromise point, as DRAM is much more expensive than a SSD.



[Tech Blog] I/O Bench Mark Test for Windows Azure D-series


Update: Microsoft has just released a Persistent SSD storage. This is still in preview phase, and available in a limited region. Here is a link and expected throughput:

Recently Microsoft released a new series of Virtual Machine called D-Series. D-Series have a relatively new CPU (Xeon E5-2660, 2.2GHz) and local Solid State Drives (SSD). Amazon AWS already has SSD instances, and there are several SSD-only cloud vendors. I’m just excited that we can finally use SSD on Azure. In this blog post, I will show you the I/O benchmark result of the D-series VM.

Azure VM has two types of local storage; one is a persistent storage and the other is an ephemeral storage. The data on the persistent storage is reusable after the VM is rebooted, but the data on the ephemeral storage may be lost after rebooting. For the D-Series, SSD storage is an “ephemeral” storage. As shown below, the performance of SSD is outstanding but you have to save the data in a persistent storage before the VM is turned off. The size of the new SSD storage is limited, in case of D14 (a high end instance of D-Series), the size of local storage is 800GB. If you need more storage, you can use Azure Storage which is an elastic storage system for the cloud and provides several different ways to access it. An Azure VM just mounts Azure Storage with “mount” command as a local block device.

I prepared one D14 instance that CentOS release 6.5 was installed on with 30GB of persistent storage, 800GB SSD ephemeral storage, and 999GB Azure Storage is mounted. (Fig. 1). The specification of D14 is shown in table 1. In this blog post, the I/O performance of 3 different storage types are evaluated.


Fig. 1: Three different storages mounted on Azure VM


# of cores 16
Persistent Storage 30GB
Ephemeral Storage 800GB
Price $1.542/hour
Table 1. Specification of the D14 instance

At first, I tried a simple read/write test with “dd” and “hdparam”. This test is very useful for a quick check since these commands should be pre-installed in many Linux distributions. The actual command is:

– Read Test

# sync
# echo 3 > /proc/sys/vm/drop_caches
# hdparm –t

– Write Test

# dd if=/dev/zero of= ibs=1M obs=1M count=1024 oflag=direct

The command is simple. However, if you missed a command or command arguments the result will likely be incorrect. Line 1 & 2 in Read Test Commands are to remove data on a read buffer cache on RAM, and evaluate the performance of storage IO. In the Write Test command, the 1GB data is read from /dev/zero and written in . The last argument in the dd command (oflag=direct) is for the direct access to the storage medium without Linux caching reads or writes. The result is shown in Fig. 2. The measurements were conducted 12 times and the interval of each trial is 10 sec. These commands issue I/O operations sequentially.


Fig.2: Result of Simple Read/Write test

For HDD, the typical performance of a sequential Read/Write is around 100MB/s. For SATA/SAS SSD, it should be around 400MB/s. As shown in Fig. 2, Azure’s Ephemeral storage shows a remarkably high performance. They must use the highend PCIe-SSD having the typical performance of around 1GB/s. In the result of Read Test, you can see that the performance of the persistent storage (Blue Line) depends on the trial number. As I mentioned, I removed the Linux OS cache with sync; echo 3 > /proc/sys/vm/drop_caches commands, but It still looks like cache effect. I don’t have any concrete reason for this, but that might be due to the effect of cache on the host of Virtual Machine which we cannot control from a guest OS.

The simple Read/Write test is easy-to-use, but the result is not so reliable so we also evaluated the 3 different types of storage with FIO. FIO is a standard tool in Linux for evaluating I/O performance and has a bunch number of options. It may be difficult for a beginner to pick up right options. In this evaluation, I used the job file published by WinKey (http://www.winkey.jp/downloads/index.php/fio-crystaldiskmark, written in Japanese). With this job file, you can do the same types of I/O performance test as CrystalDiskMark (http://crystalmark.info/software/CrystalDiskMark/index-e.html ), which is a famous benchmark tool in Windows OS. The command is just typing :

# fio crystaldiskmark.fio

The target storage area is /tmp as default. If you want to evaluate a different directory, Edit Line 7 (directory=/tmp/) in crystaldiskmark.fio. The result is shown in Fig. 3. Note that Y-axis is logarithmic scale. The performance difference between the ephemeral storage and the persistent storage is dramatically large especially in Random Read/Write at Queue Depth* = 32.
* Queue Depth: The number of simultaneous input/output requests for a storage device.

Fig. 3: Result of FIO benchmark

As demonstrated in the FIO test results, the SSD performance on Azure D-Series is great but again, the problem is that SSDs are ephemeral storage. If your application boots up and shuts down instances often, you could develop a management tool or daemon to save your essential data in Azure storage or persistent local storage.



[Tech Blog] Power Management Using the Windows Azure Java SDK


Azure SDK’s

Windows Azure Portal offers an intuitive interface for performing tasks such as turning on/off a VM. However, when these tasks need to be performed in bulk, one soon comes to a realization that clicking through the user interface may not be the optimal method.

Fortunately, Windows Azure can be managed through REST API’s. And to makes these REST API’s easy to use, command-line tools and numerous SDK’s exist as a wrapper, available for PowerShell, Cross-Platform Command-Line Interface, .NET, PHP, Node.js, Python, Java, and Ruby.

Power Management code sample

After browsing through the javadoc for the SDK, one can find the methods start() and shutdown() implemented in class VirtualMachineOperationsImpl. Both methods require 3 parameters, all of which can be found from the Azure Portal.


But how does one create an instance of VirtualMachineOperationsImpl? This did not seem intuitive, so I downloaded the source code of the SDK and browsed through the test code, which led me to the following 3 lines of code:


Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);
OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName, deploymentName, virtualMachineName);


Basically, getVirtualMachinesOperations() method from class ComputeManagementClient returns an instance of VirtualMachineOperationsImpl. ComputeManagementClient class in turn requires an instance of the Configuration class, which requires the filename of the PublishSettings and the Subscription ID during instantiation.


The entire sample code is as follows:

import com.microsoft.windowsazure.core.*;
import com.microsoft.windowsazure.Configuration;
import com.microsoft.windowsazure.Configuration.*;
import com.microsoft.windowsazure.credentials.*;
import com.microsoft.windowsazure.management.configuration.*;
import com.microsoft.windowsazure.management.compute.*;
import com.microsoft.windowsazure.management.compute.models.*;
import org.apache.http.impl.client.HttpClientBuilder;
import java.util.concurrent.*;
import java.io.File;
import java.io.IOException;
import java.lang.String;

class AzureStartVM {
    public static void main(String[] args) {
        String publishSettingsFileName = System.getenv("AZURE_PUBLISH_SETTINGS_FILE");
        if (publishSettingsFileName == null) {
            System.err.println("Set AZURE_PUBLISH_SETTINGS_FILE.");
            System.err.println(" Ex. $ setenv AZURE_PUBLISH_SETTINGS_FILE $HOME/conf/XXX.publishsetting");
        File file = new File(publishSettingsFileName);
        if (!file.exists()) {
            System.err.println("File not found: " + publishSettingsFileName);

        String subscriptionId = "XXXXXXXXXXX";
        if (args.length != 3) {
            System.err.println("Usage: AzureStartVM serviceName deploymentName virtualMachineName");
        String serviceName = args[0];
        String deploymentName = args[1];           
        String virtualMachineName = args[2];

        try {                             
                Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
                ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);

                OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName,deploymentName,virtualMachineName);
        catch (Exception e){


Compile source

1)      Add Java SDK jar files to CLASSPATH
2)      Download the publish settings file locally and set its path to AZURE_PUBLISH_SETTINGS_FILE
3)      Set string value of subscriptionId in the source.
4)      Compile

$ javac AzureStartVM.java


Execute Code

$ java AzureStartVM serviceName deploymentName vmName


So Why Java?

The different tools and SDK’s are in different stages of development, and the required management functions may not be available for your language of choice. PowerShell is by far the most complete and the easiest to use, but this is not a solution available for Linux users.

Cross-Platform CLI would probably be the next choice in the chain of tools to try. However, this CLI uses node.js underneath to trigger the REST API requests, and there seems to be either a bug or an incompatibility issue with node.js on the CentOS-based Linux VM for Azure, which caused unreliability in its operation. Microsoft support was able to boil down the problem to the https module in node.js, and that this problem was non-existent in Ubuntu, but switching operating systems just for this purpose seemed rather nonsensical.

PHP and Python do not yet have power management modules implemented, and while implementing these modules for the SDK was an option, Java seemed to be the easier choice.

Why the start() and stop() instead of async options?

The implementation of the REST API for power management seems to be thread-unsafe and result in an error when executed for VMs in the same cloud service. This means that if CloudService1 contained VM1 and VM2, a start command cannot be triggered for both VM1 and VM2 simultaneously. Even on the Azure portal, one has to wait until VM1 is running before VM2 can be started. Therefore, it made sense to implement a blocking start/shutdown.

VM’s in other cloud services, on the other hand, can be started/shutdown simultaneously, so the code can be executed simultaneously using multiple terminals.  Hence, the async was not necessary.


The above command can be executed in a for-loop from shell to start multiple VMs at once.

While it may be trivial for a developer to come up with this solution, it is probably not part of the job description for Linux Administrators, who would probably benefit the most from this, to come up with this solution.

Fixstars Establishes Canadian Subsidiary

Sunnyvale, CA — November 3rd, 2014 — Fixstars Solutions, Inc. announced it’s Canadian subsidiary’s first official day of operations today.

Fixstars Solutions Canada will help to grow it’s parent companies’ business in the United States and abroad by expanding their research and development capabilities in multicore, storage, and cloud computing technologies. The new subsidiary will be headquartered in Victoria BC. Akihiro Asahara, the CEO of Fixstars Solutions will assume the role of Chairman while Owen Stampflee will assume the role of CEO.

About Fixstars Solutions

Fixstars Solutions is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars Solutions provides “Green IT,” while accelerating customers’ business in various fields. Learn more about how Fixstars Solutions can accelerate your business in life science, manufacturing, finance, and media & entertainment.
For more information, visit www.fixstars.com.

Fixstars Solutions Adopts Microsoft Azure to Boost Life Science Application Performance

Sunnyvale, CA — August 12th, 2014 — Fixstars Solutions Inc. today announced that it has adopted Microsoft Azure to help boost the performance of its life science application. The importance of computer simulation and analysis in life science field is growing, and some life science applications, such as Molecular Dynamic Simulation and DNA sequencing, consume a vast amount of computing resources. Building a private cluster server system only for these compute intensive applications can be complex and costly. Fixstars Solutions developed a computing platform for life science applications on Microsoft Azure, designed to provide an elastic and high performance cluster server system.

Fixstars Solutions has extensive knowledge of parallel processing technology, with a track record of numerous successful projects with hyper-scale cluster server systems. For the US Air Force Research Lab, Fixstars Solutions developed a simulation and video processing system made up of 2,016 SONY PS3® nodes. Recently, Fixstars and Mizuho Securities succeeded in accelerating derivative valuation system by 30-folds with Intel Xeon Phi® Coprocessors.

Fixstars Solutions worked closely with Microsoft to efficiently port a life science application to Microsoft Azure in three weeks. The nature of the target application is compute-intensive, and requires about 1000 CPU cores. After some initial benchmarking, it was found that the Microsoft Azure A8/A9 instances, which are optimized for high performance computing (HPC) applications, out-performed the competing cloud services, which lead to the decision to utilize Microsoft Azure. As a result of Microsoft Azure’s HPC capabilities, Fixstars Solutions was able to achieve a performance boost of more than 4-folds when compared to the software working originally on the private cluster server system.

“I am pleased to work with Microsoft to bring solutions to market that we believe will advance the life sciences field.” said Akihiro Asahara, CEO of Fixstars Solutions. “Using Microsoft Azure’s new compute intensive instances, A8 and A9, we can use a high performance cluster server system having over 1000 CPU cores available at all times.”

“The flexibility, speed and cost-saving of cloud computing has the power to accelerate crucial research in the life sciences field,” said Venkat Gattamneni, Senior Product Marketing Manager, Cloud and Enterprise, Microsoft. “We look forward to continued work with Fixstars Solutions to boost the performance of life sciences applications using the Microsoft Azure platform.”

Fixstars Solutions will release solutions built on Microsoft Azure, and provide technical service and consulting for clients using Microsoft Azure. Not only will Fixstars aid in porting to Microsoft Azure to achieve speed-up from parallel processing, but will also aid in enhancing system availability and reducing operation costs by using Azure’s management functions made available through Microsoft Azure SDK and command-line tools.


About Fixstars Solutions

Fixstars Solutions is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars Solutions provides “Green IT,” while accelerating customers’ business in various fields. Learn more about how Fixstars Solutions can accelerate your business in life science, manufacturing, finance, and media & entertainment.
For more information, visit www.fixstars.com.

Fixstars and Mizuho Securities Succeeds in Accelerating Derivative Valuation System by 30-folds with Intel Xeon Phi Coprocessor

Mizuho Securities becomes the first Financial Institution to deploy the coprocessor in the Production Environment

TOKYO — June 2, 2014 — Fixstars Corporation today announced that Fixstars and Mizuho Securities Co., Ltd achieved a 30-folds speed-up by porting Mizuho Securities’ Derivative Valuation System to Intel’s many-core processor, “Intel® Xeon Phi™ coprocessors”.  Mizuho Securities has become world’s first financial institution to utilize the Xeon Phi™ coprocessor in the Production Environment.

The recent low interest rates have increased the demand for structured bonds, which combines the use of bonds in conjunction with derivatives.  To meet these customer demands, Mizuho Securities had been investigating ways to efficiently perform the vast amount of computations required by their Derivative Valuation Systems, which led them to consider the use of Intel’s High-Performance Xeon Phi™ coprocessors.

Mizuho Securities, with the aid of their development partner Fixstars, succeeded in devising an algorithm to efficiently distribute the workloads across the cores of Intel® Xeon Phi™ coprocessor.  The Xeon Phi™ replaces an 8-core Xeon® system, which resulted in out-performing the previous system by a factor of 30.  In addition, by placing an emphasis on source-code readability, a highly maintainable system was created.

“I am happy to announce that Fixstars’ expertise in parallel processing in conjunction with Intel’s Technology had been able to improve upon Mizuho Securities’ Derivative Valuation System,” said Kosuke Hirano, Senior Executive Officer, Intel K.K.. “Intel will continue innovating with the highly parallel processing capability of Intel® Xeon Phi™ coprocessors.”

“I am very excited that our expertise in parallel programming techniques has enabled us to aid Mizuho Securities in the deployment of the New Derivative Valuation System,” said Miki Satoshi, CEO, Fixstars. “We had been involved in the project from the researching phase on what hardware to use, and I believe the smooth progress that led to this deployment could not have occurred without the mutual, underlying trust present in the partnership between Mizuho Securities’ and Fixstars, as well as the high capability and the reliability of Intel® Xeon Phi™ coprocessors.”

Mizuho Securities is already using the new system for Plain Vanilla Derivatives, with plans to shift to the new system this summer with Exotic Derivatives.

Fixstars will continue to speed up Mizuho Security’s business through provision of technical expertise in parallel processing.

* Intel, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

Initial Public Offering

Fixstars Corporation, a parent company of Fixstars Solutions inc., announced the pricing of its initial public offering of 100,000 shares of Class A common stock. Fixstars Corp’s Class A common stock will trade on the Tokyo Stock Exchange (Mothers) under the code “3687” from April 23, 2014.

See more detail at http://www.tse.or.jp/english/listing/companies/index_e.html

[Tech Blog] Super Computing 2013

Super Computing 2013 (SC13) has started in Denver, Colorado! The Super Computing top 500 list has already been updated, and there are no big changes to the top 5. Tianhe-2, a supercomputer developed in China, retained its position as the world’s No. 1 system.


One important but often ignored statistic for Super Computing is power consumption. For example the Tianhe-2 reaches 33.86 petaflop/s on the Linpack benchmark, but it consumes 17808 kW of power to do so! This is almost the entire output of a small thermal power station. The Green 500 ranks supercomputers by energy efficiency instead of just raw processing power. In terms of MFLOPS/Watt, the TSUBAME 2.5 developed by the Tokyo Institute of Technology should get their No. 1 spot. The latest Green 500 list will be available soon.

There are several techniques we can use to manage power consumption at the system level, but within each computing node we only have a few options. Heterogeneous setups, such as those using an NVIDIA/AMD GPU or Intel’s Xeon Phi have become the norm – and we think FPGA chips are another promising option. FPGA has pipeline parallelism, where several different tasks can be spawned in a push-pull configuration and each task has its own data supplied by the previous task – with or without host interaction. Several tech papers report that FPGA outperforms other chips in terms of both latency and energy efficiency. The standard FPGA languages (VHDL and Verilog-HDL) can be a complicated struggle for any software developer (like me!), but the chip vendor ALTERA has recently announced OpenCL support on their FPGA devices. This will make development faster, and cause fewer headaches. Fixstars has been a part of their early access program for their OpenCL SDK since last summer, and we provide OpenCL software and services to our clients.

Unfortunately, we don’t have a booth at SC13 this year, but we can be found at our partner Nallatech’s booth (#3519). We’ll even have coupons for copies of our OpenCL programming book.
Stop by and check out OpenCL on FPGA!