[Tech Blog] FlashAir: Uploading to Google Drive with LUA



In this example, we’re going to show you how to upload an image to Google Drive using LUA, and Google’s device API. The idea is for it to be run from something with limited input capabilities, such as Toshiba’s “FlashAir” SD card. Unfortunately we’ll still need a device with full input capabilities to set everything up, but once the setup is complete you’re left with a headless system that can upload indefinitely.


1. Setting up the Project in Google

The first step is to create a project in Google’s “Developers Console” Once this is done, enable the Google Drive API and SDK.

Next, go to the “Credentials” section (under “APIs & auth”) in your project and select “Create new client ID”. Choose “Installed application”, then “Other” for installed application type.

2. Authorizing Our Device

Now that that’s setup, we need to authorize our device with Google – which is a two-step process. First we need to send a post request to “accounts.google.com/o/oauth2/device/code”. Set the content type to “application/x-www-form-urlencoded”, with two fields client_id (the Client ID we generated above, under “Client ID for native application”), and scope (which should be set to “https://docs.google.com/feeds”). You can use several tools to accomplish this – but the chrome extension “postman” makes it super easy.


POST /o/oauth2/device/code HTTP/1.1
Host: accounts.google.com
Cache-Control: no-cache
Content-Type: application/x-www-form-urlencoded

client_id={Your client ID}

NOTE: Google tells you to use “https://www.googleapis.com/auth/drive” for scope, but this will return a “Invalid_scope: Not authorized to request the scope”. Using /feeds/ instead will grant us the Google Drive authorization we need.

Now for part two! The response will contain a user_code, and a verification_url. Navigate to that url (it’s probably https://www.google.com/device) then enter the user_code. Hold on the device code too!

Example Response:

"device_code": {Device code},
    "user_code": {Your user code},
    "verification_url": "https://www.google.com/device",
    "expires_in": 1800,
    "interval": 5

3. Getting a Permanent Refresh Token

Now that everything’s properly authorized by Google, we still need to get a refresh token that our app is going to use. We’re going to use that to get yet another token, the temporary “auth” token, which will actually let us upload to Google Drive. There are a lot of tokens. To get the refresh token, you send a post like the following (line breaks added for readability):

POST /o/oauth2/token HTTP/1.1
Host: accounts.google.com
Cache-Control: no-cache
Content-Type: application/x-www-form-urlencoded

client_id={Your full client ID}&
client_secret={Your client secret}&
code={Your device code]&

The grant type is actually “http://oauth.net/grant_type/device/1.0” however it gets switched to %’s.

You should receive something that looks like:

"access_token": {Your access token here},
    "token_type": "Bearer",
    "expires_in": 3600,
    "refresh_token": {Your refresh token here}

Finally, we can start scripting our headless upload! The main token we really need is the refresh_token. The access_token will work for a short time, but it will expire. With the refresh token we can always get a fresh access code when we need to (which is pretty constantly, as the authorization doesn’t last long).

4. Required Lua Imports

The only two Lua library imports which will be used are ssl.https and JSON which is a stand alone file which can be found at http://regex.info/blog/lua/json

Example code:

local https = require 'ssl.https'
local JSON = require 'JSON'

We will follow these by creating local variables containing all the necessary information.

Example code:

-- Basic account info
local client_id = “{Your full client ID}”
local client_secret = “{Your client secret}”
local refresh_token = “{Your refresh token}”
-- For refresh
local scope = "https://docs.google.com/feeds"
local response_type = "code"
local access_type = "offline"

5. Using the Refresh Token to Re-authenticate

Before we can upload anything, chances are we’re going to need to re-authenticate with the server, so lets put together a getAuth() method first. In our method we will set the message to be sent, find it’s length (as this is a required parameter), then make the https request. The response will arrive as an “Array” but it’s actually all one big string for the first and sole value as Lua can’t natively decode JSON. Conveniently we imported the JSON library earlier, so we can use that to parse it into a table and retrieve our new access token.

Example function:

local function getAuth()

  -- Set our message
  local mes="client_id="..client_id

  local length = string.len(mes)
  print("Sending: ["..mes.."]")
  print "\n"
  b, c, h = fa.request{
    url = "https://accounts.google.com/o/oauth2/token",
    headers = {
        ["Content-Type"] = "application/x-www-form-urlencoded",
        ["Content-Length"] = length,
    method = "POST",

  local tempTable = {}

  tempTable = cjson.decode(b)

  access_token = tempTable["access_token"]

6.The Lua Function that Does the Upload

Now that we have a new access token, we’re ready to upload! We’ll do this with another https post request using ssl.https. We just need to give it an image file, set our authorization code, and we’re done.


local function uploadTest(token)
  local fileSize = lfs.attributes(filePath,"size")
  b, c, h = fa.request{
    url = "https://www.googleapis.com/upload/drive/v2/files",
    headers = {
      ["Content-Type"] = "image/jpeg",
      ["Content-Length"] = fileSize, -- calculate file size
      ["authorization"] = "Bearer "..token,
    method = "POST",
    --NOTE: You probably want to set your own file here,
    --or maybe even pass it as a parameter!


If you’ve been following our examples up until this point, combining them in the order presented and running the two functions using the following code at the bottom of the file will result in an uploaded file up to your Google drive (see accounts.google.com.).

Final code to run the functions:




[Tech Blog] Accelaration of Collatz conjecture


The Wikipedia entry of the Collatz conjecture describes the simple algorithm used to generate so-called hailstone sequences:

Take any natural number n. If n is even, divide it by 2 to get n / 2. If n is odd, multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process (which has been called “Half Or Triple Plus One”, or HOTPO) indefinitely. The conjecture is that no matter what number you start with, you will always eventually reach 1.

In the examples, it states:

Numbers with a total stopping time longer than any smaller starting value form a sequence beginning with:

1, 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, … (sequence A006877 in OEIS).

By looking up the A006877 sequence (“In the `3x+1′ problem, these values for the starting value set new records for number of steps to reach 1”) in OEIS, and following the link embedded in “T. D. Noe, Table of n, a(n) for n = 1..130 (from Eric Roosendaal’s data)” (under LINKS), one finds this list which — supposedly — lists the numbers with a larger stopping count than of any smaller number:


Here is the naive implementation of the hailstone sequence length generation algorithm:

static inline int hailstone(unsigned long n)
    int count = 0;
    while (n > 1)
        if (n & 1)
            n = 3 * n + 1;
            n >>= 1;

    return count;

By mapping this function to the elements of the above list, one obtains the following list:

1: 0
2: 1
3: 7
6: 8
7: 16
9: 19
18: 20
25: 23
27: 111
54: 112
73: 115
97: 118
129: 121
171: 124
231: 127
313: 130
327: 143
649: 144
703: 170
871: 178
1161: 181
2223: 182
2463: 208
2919: 216
3711: 237
6171: 261
10971: 267
13255: 275
17647: 278
23529: 281
26623: 307
34239: 310
35655: 323
52527: 339
77031: 350
106239: 353
142587: 374
156159: 382
216367: 385
230631: 442
410011: 448
511935: 469
626331: 508
837799: 524
1117065: 527
1501353: 530
1723519: 556
2298025: 559
3064033: 562
3542887: 583
3732423: 596
5649499: 612
6649279: 664
8400511: 685
11200681: 688
14934241: 691
15733191: 704
31466382: 705
36791535: 744
63728127: 949
127456254: 950
169941673: 953
226588897: 956
268549803: 964
537099606: 965
670617279: 986
1341234558: 987
1412987847: 1000
1674652263: 1008
2610744987: 1050
4578853915: 1087
4890328815: 1131
9780657630: 1132
12212032815: 1153
12235060455: 1184
13371194527: 1210
17828259369: 1213
31694683323: 1219
63389366646: 1220
75128138247: 1228
133561134663: 1234
158294678119: 1242
166763117679: 1255
202485402111: 1307
404970804222: 1308
426635908975: 1321
568847878633: 1324
674190078379: 1332
881715740415: 1335
989345275647: 1348
1122382791663: 1356
1444338092271: 1408
1899148184679: 1411
2081751768559: 682
2775669024745: 685
3700892032993: 688
3743559068799: 794
7487118137598: 795
7887663552367: 808
10516884736489: 811
14022512981985: 814
19536224150271: 1585
26262557464201: 833
27667550250351: 846
38903934249727: 1617
48575069253735: 1638
51173735510107: 1651
60650353197163: 1659
80867137596217: 1662
100759293214567: 1820
134345724286089: 1823
223656998090055: 1847
397612441048987: 1853
530149921398649: 1856
706866561864865: 1859
942488749153153: 1862
1256651665537537: 1865
1675535554050049: 1868
2234047405400065: 1871
2978729873866753: 1874
3586720916237671: 1895
4320515538764287: 458
4861718551722727: 470
6482291402296969: 473
7579309213675935: 512
12769884180266527: 445
17026512240355369: 448
22702016320473825: 451
45404032640947650: 452
46785696846401151: 738

As it is easily visible, the first list (which is the left column in the second list) is “broken”: up to and including 1899148184679, the sequence lengths (right column in the second list) are indeed monotonically growing, but afterwards there are some numbers which have shorter sequence lengths than previous ones.

The problem becomes recalculating the sequence lengths for all odd numbers above 1899148184679 and checking if the current sequence length is greater than all previous ones.

Unless otherwise noted, tests were run on an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz with 4 GiB RAM running Ubuntu Linux 14.04

Using the naive implementation, one obtains ~1.37 M numbers checked per second (NCPS).

Using an inlined assembly language implementation of the function

static inline dword hailstone(qword n)
    dword retval;

    asm volatile("mov    %0,%%rax" : : "r"(n) : "rax");
    asm volatile(".intel_syntax noprefix");
    asm volatile("mov    rsi,rax" : : : "rsi");
    asm volatile("xor    r8d,r8d" : : : "r8");
    asm volatile(".align 16");
    asm volatile("top:");
    asm volatile("shr    rax,1" : : : "rax");
    asm volatile("test    rsi,1");
    asm volatile("lea    rsi,[rsi + 2 * rsi + 1]" : : : "rsi");
    asm volatile("cmovz    rsi,rax" : : : "rsi");
    asm volatile("inc    r8d" : : : "r8");
    asm volatile("mov    rax,rsi" : : : "rax");
    asm volatile("cmp    rsi,1");
    asm volatile("jnz    top");
    asm volatile(".att_syntax");
    asm volatile("mov    %%r8d,%0": "=r"(retval));

    return retval;

which uses conditional move instructions to avoid branch misprediction penalties improves the rate to ~2.02 M NCPS — a 48% improvement.

If we’re interested only in the sequence lengths of odd numbers, then the following function can be made to calculate them:

typedef unsigned int dword; 	/* 32 bit unsigned integer */
typedef unsigned long qword; 	/* 64 bit unsigned integer */

static inline dword bsfq(qword n)
    qword retval = -1;
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    return retval;

static inline int hailstone(unsigned long n)
    int count = 0;
        /* n is odd */
        dword shifter = bsfq(n = 3 * n + 1);
        /* accumulate the number of divisions by 2 and the initial
        tripling + 1 step */
        count += shifter + 1;
        /* n is even; do the required number of divisions by two */
        n >>= shifter;
    while (n > 1);
    return count;

Using this function results in ~6.58 M NCPS — a 380% improvement over the naive implementation.

As all hailstone sequences eventually — presumably — end with 1, at least some of the numbers in the sequence are smaller than the initial number. Accordingly, it is possible to cache (precalculate and store) sequence lengths for some set of numbers (starting with 1) and once a particular sequence reaches a number which is in cache, accumulate the precalculated sequence length into the current count and terminate the calculation.

It is easily seen that numbers in hailstone sequences aren’t divisible by 3. Accordingly, one stores in the cache the sequence lengths of only those numbers that aren’t divisible by 3. This can be done by placing n’s sequence length in the cache at index n/3.

#define    CACHEFILE_SIZE    1012645888

word cache[CACHEFILE_SIZE / sizeof(word)];

void load_cache(void)
    FILE *fin = fopen("./hailstone-cache.bin", "rb");
    if (CACHEFILE_SIZE != fread(cache, sizeof(byte), CACHEFILE_SIZE, fin))
        fprintf(stderr, "Read error on cache file\n");


static inline dword bsfq(qword n)
    qword retval = -1;
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    return retval;

static inline int hailstone(unsigned long n)
    int count = 0;
        dword shifter = bsfq(n = 3 * n + 1);
        count += shifter + 1;
        n >>= shifter;
        /* here n is odd and not divisible by 3 */
        if (n < sizeof(cache) / sizeof(word) * 3)
            count += cache[n / 3];

            n = 1;
    while (n > 1);
    return count;

Using this ~1 GB sized cache (don’t ask why is the odd size) results in ~17.21 M NCPS — a 1156% improvement over the naive implementation.

The next step is to parallelize the sequence length calculations.

#define qCLKS_PER_SEC	3000000000UL	/* processor clock rate */

static inline qword rdtsc(void) {
    register qword acc128lo asm ("rax");
    register qword acc128hi asm ("rdx");
    asm volatile ("rdtsc" : "=r"(acc128lo), "=r"(acc128hi));
    return (acc128hi << 32) | acc128lo;

/* max_cnt has to be initialized to a nonzero value so it gets put in the DATA
section instead of the COMMON section */
volatile unsigned max_cnt = 1;
unsigned secs = 1;

unsigned word cache[<# of words in cache file>];

typedef struct
    qword starting_n;
    qword current_n;
    qword cnts;
    qword cached_cnts;
    dword increment;
} hailstone_t;

void *hailstone(void *parm)
    qword i;
    ((hailstone_t *)parm)->cnts = 0;
    ((hailstone_t *)parm)->cached_cnts = 0;

    for (i = ((hailstone_t *)parm)->starting_n; 1;
     i += ((hailstone_t *)parm)->increment)
        qword n = i;
        dword cnt = 0;
        ((hailstone_t *)parm)->current_n = n;
            qword shifter, idx;

            n += 2 * n + 1;
            shifter = bsfq(n);
            n >>= shifter;
            cnt += shifter;

            idx = n / 3;
            if (idx < sizeof(cache) / sizeof(*cache))
                cnt += cache[idx];
                ((hailstone_t *)parm)->cached_cnts += cache[idx];

                n = 1;
        while (n != 1);

        ((hailstone_t *)parm)->cnts += cnt;

        if (cnt > max_cnt)
            max_cnt = cnt;
            printf("%lu: %u\n", i, cnt);

Which can be invoked by:

int main(int argc, char *argv[])
    FILE *fin;
    dword nthreads = atoi(argv[1]);
    dword i;
    pthread_t *pThreads;
    hailstone_t *pParms;
    qword next_sec;
    struct timespec req = { 0, 50000000 }, rem;
    if (!nthreads)
        nthreads = 1;
    if (!(fin = fopen(CACHEFILE, "rb")))
        fprintf(stderr, "Couldn't open cache file '%s'\n", CACHEFILE);

    if (1 != fread(cache, sizeof(cache), 1, fin))
        fprintf(stderr, "Read error on cache file '%s'\n", CACHEFILE);

    if (!(pThreads = (pthread_t *)malloc(nthreads * sizeof(pthread_t))))
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(pthread_t));

    if (!(pParms = (hailstone_t *)malloc(nthreads * sizeof(hailstone_t))))
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(hailstone_t));
    next_sec = rdtsc() + qCLKS_PER_SEC;

    for (i = 0; i < nthreads; ++i)
        int rc;

        pParms[i].starting_n = STARTING_N + i * 2;
        pParms[i].increment  = nthreads * 2;
        if (rc = pthread_create(pThreads + i, NULL,
                  hailstone, (void *)(pParms + i)))
            fprintf(stderr,"Error - pthread_create() return code: %d\n", rc);
    while (1)
        nanosleep(&req, &rem);
        if (rdtsc() >= next_sec)
            qword min_n = pParms[0].current_n, all_cnts = pParms[0].cnts,
              all_cached_cnts = pParms[0].cached_cnts;
            dword h, m, s;
            for (i = 1; i < nthreads; ++i)
                if (pParms[i].current_n < min_n)
                    min_n = pParms[i].current_n;
                all_cnts += pParms[i].cnts;
                all_cached_cnts += pParms[i].cached_cnts;

            next_sec += qCLKS_PER_SEC;
            s = secs++;
            m = s / 60;
            s %= 60;
            h = m / 60;
            m %= 60;
            fprintf(stderr, "\r%02u:%02u:%02u  %lu  %.0f/sec (%4.1f%%)",
           h, m, s, min_n, (min_n - STARTING_N + 1) / 2. / secs,
           100. * all_cached_cnts / all_cnts);

    /* Last thing that main() should do */

Synchronized access to max_cnt is explicitly avoided as it presents a very significant performance hit (approx. one-third loss of performance) on parallel processing.

Using this (the final code) to run three threads results in ~29 M NCPS — an approx. 2000% improvement over the naive implementation.

On an Intel 4790K CPU with 4 physical cores and 4 virtual (HyperThreading) cores running at 4.6 GHz clock rate, using 32 GiB Crucial Ballistix DDR3 RAM overclocked to 2 GHz the code runs at 80 M NCPS. Unfortunately, a simple calculation shows that it would take 10 years at that rate to check all odd numbers up to 46785696846401151 (the last — wrong — element of the original list).


[Tech Blog] PCIe SSD for Genome assembly



A genome assembly software takes a huge number of small pieces of DNA sequences (called “read”), and tries to assemble them to create a long DNA sequence which represents the original chromosomes. It generally consumes not only large computational power, but also large working memory space. The required memory space depends on the input data size, but often in the tera-bytes. Although it is true that the price of DRAM has dramatically decreased, a workstation or a server with a few TB of memory is still very expensive. For example, IBM System x3690 X5 can install 2TB of memory, but the list price is more than $300k.

On the other hand, PCI Express SSD board is on the rise. Many hardware vendors like Intel, Fusion-IO (acquired by San Disk), OCZ (acquired by Toshiba), etc, release variety of PCIe-SSD boards. Generally, it has lower bandwidth than DRAM, but much higher bandwidth than a standard SATA/SAS SSD. The price is a little high compared with a standard SATA-SSD, but Fusion-IO ioFX 2.0 has 1.6TB, at $5,418.95 on Amazon.com. Even if you insert this board into a high-end workstation, the total price is still below $10,000, which is much cheaper than a 2TB memory server.

In this blog post, I would like to explore whether using an SSD in place of DRAM is going to yield a viable solution. We will use an open-source Genome Assembler called “velvet” as a benchmark software.


First, I downloaded, compiled and installed “velvet” from the velvet web site.

$ tar xvfz velvet_1.2.08.tgz
$ cd velvet_1.2.08
$ sudo cp velvetg velveth /usr/local/bin

In this procedure, ‘MAXKMERLENGTH=51’ sets the maximum number for “k-mer”. ‘OPENMP=1’ means that multi-threading by OpenMP is active. “k-mer” is a very important parameter in genome assembly, as it affects the quality of output DNA sequence. For more detail on k-mers, please refer to the velvet user manual.

Velvet has two process, velveth and velvetg. The velveth process creates a graph file to prepare the genome assembly. The memory required by velveth is not so large. The velvetg process, which is the actual assembling process, consumes much more memory and computation time.
Before we can start testing, we need input files. In this experiment, we will use two fastq files, SRR000021 and SRR000026, which were downloaded from this site . I processed these data by velveth as follows:

$ velveth SRR2126.k11  11 -short -fastq SRR2126.fastq

The first argument is the output directory name, the second argument is the length of the k-mer, the third argument specifies short read inputs, the 4th argument specifies the “fastq” file type, and the 5th argument is the input file.

The next step is assembly. The command to do so is as follows:

$ velvetg SRR2126.k11

The argument is the output directory generated by velveth. I measured the elapsed time of this command in several different hardware configurations:

1. Memory 4GB + Generic SATA HDD
2. Memory 4GB + Fusion IO ioFX
3. Memory 8GB

In case of configuration #2, I created a swap file on ioFX like this:

# dd if=/dev/zero of=/mnt/iofx/swap0 bs=1024 count=12582912
# chmod 600 swap0
# mkswap swap0 12582912
# swapoff –a
# swapon /mnt/iofx/swap0


The velvetg process uses about 8GB memory space for this input data, so roughly half of temporary data is spilled out to swap memory space for configurations 1 and 2. Below figure  shows the elapsed time for each configuration. Using the Generic SATA HDD, the process was not finished after 2 hours, so we decided to kill the process.



So is using PCIe-SSD a viable solution? It is hard to say, as 3x difference is not so small. In this particular experiment, only half of the memory space used by velvetg happened to be placed on the PCIe-SSD. As mentioned earlier, the real-life data that bioinformaticians deals with can be a few TBs. If almost all the memory space is on PCIe-SSD, the performance is expected to be much worse.

However, considering that HDD could not even complete the process in a reasonable amount of time, the PCIe-SSD card showed that it can vastly improve performance. This shows that the SSD can serve as a good compromise point, as DRAM is much more expensive than a SSD.



[Tech Blog] I/O Bench Mark Test for Windows Azure D-series


Update: Microsoft has just released a Persistent SSD storage. This is still in preview phase, and available in a limited region. Here is a link and expected throughput:

Recently Microsoft released a new series of Virtual Machine called D-Series. D-Series have a relatively new CPU (Xeon E5-2660, 2.2GHz) and local Solid State Drives (SSD). Amazon AWS already has SSD instances, and there are several SSD-only cloud vendors. I’m just excited that we can finally use SSD on Azure. In this blog post, I will show you the I/O benchmark result of the D-series VM.

Azure VM has two types of local storage; one is a persistent storage and the other is an ephemeral storage. The data on the persistent storage is reusable after the VM is rebooted, but the data on the ephemeral storage may be lost after rebooting. For the D-Series, SSD storage is an “ephemeral” storage. As shown below, the performance of SSD is outstanding but you have to save the data in a persistent storage before the VM is turned off. The size of the new SSD storage is limited, in case of D14 (a high end instance of D-Series), the size of local storage is 800GB. If you need more storage, you can use Azure Storage which is an elastic storage system for the cloud and provides several different ways to access it. An Azure VM just mounts Azure Storage with “mount” command as a local block device.

I prepared one D14 instance that CentOS release 6.5 was installed on with 30GB of persistent storage, 800GB SSD ephemeral storage, and 999GB Azure Storage is mounted. (Fig. 1). The specification of D14 is shown in table 1. In this blog post, the I/O performance of 3 different storage types are evaluated.


Fig. 1: Three different storages mounted on Azure VM


# of cores 16
Persistent Storage 30GB
Ephemeral Storage 800GB
Price $1.542/hour
Table 1. Specification of the D14 instance

At first, I tried a simple read/write test with “dd” and “hdparam”. This test is very useful for a quick check since these commands should be pre-installed in many Linux distributions. The actual command is:

– Read Test

# sync
# echo 3 > /proc/sys/vm/drop_caches
# hdparm –t

– Write Test

# dd if=/dev/zero of= ibs=1M obs=1M count=1024 oflag=direct

The command is simple. However, if you missed a command or command arguments the result will likely be incorrect. Line 1 & 2 in Read Test Commands are to remove data on a read buffer cache on RAM, and evaluate the performance of storage IO. In the Write Test command, the 1GB data is read from /dev/zero and written in . The last argument in the dd command (oflag=direct) is for the direct access to the storage medium without Linux caching reads or writes. The result is shown in Fig. 2. The measurements were conducted 12 times and the interval of each trial is 10 sec. These commands issue I/O operations sequentially.


Fig.2: Result of Simple Read/Write test

For HDD, the typical performance of a sequential Read/Write is around 100MB/s. For SATA/SAS SSD, it should be around 400MB/s. As shown in Fig. 2, Azure’s Ephemeral storage shows a remarkably high performance. They must use the highend PCIe-SSD having the typical performance of around 1GB/s. In the result of Read Test, you can see that the performance of the persistent storage (Blue Line) depends on the trial number. As I mentioned, I removed the Linux OS cache with sync; echo 3 > /proc/sys/vm/drop_caches commands, but It still looks like cache effect. I don’t have any concrete reason for this, but that might be due to the effect of cache on the host of Virtual Machine which we cannot control from a guest OS.

The simple Read/Write test is easy-to-use, but the result is not so reliable so we also evaluated the 3 different types of storage with FIO. FIO is a standard tool in Linux for evaluating I/O performance and has a bunch number of options. It may be difficult for a beginner to pick up right options. In this evaluation, I used the job file published by WinKey (http://www.winkey.jp/downloads/index.php/fio-crystaldiskmark, written in Japanese). With this job file, you can do the same types of I/O performance test as CrystalDiskMark (http://crystalmark.info/software/CrystalDiskMark/index-e.html ), which is a famous benchmark tool in Windows OS. The command is just typing :

# fio crystaldiskmark.fio

The target storage area is /tmp as default. If you want to evaluate a different directory, Edit Line 7 (directory=/tmp/) in crystaldiskmark.fio. The result is shown in Fig. 3. Note that Y-axis is logarithmic scale. The performance difference between the ephemeral storage and the persistent storage is dramatically large especially in Random Read/Write at Queue Depth* = 32.
* Queue Depth: The number of simultaneous input/output requests for a storage device.

Fig. 3: Result of FIO benchmark

As demonstrated in the FIO test results, the SSD performance on Azure D-Series is great but again, the problem is that SSDs are ephemeral storage. If your application boots up and shuts down instances often, you could develop a management tool or daemon to save your essential data in Azure storage or persistent local storage.



[Tech Blog] Power Management Using the Windows Azure Java SDK


Azure SDK’s

Windows Azure Portal offers an intuitive interface for performing tasks such as turning on/off a VM. However, when these tasks need to be performed in bulk, one soon comes to a realization that clicking through the user interface may not be the optimal method.

Fortunately, Windows Azure can be managed through REST API’s. And to makes these REST API’s easy to use, command-line tools and numerous SDK’s exist as a wrapper, available for PowerShell, Cross-Platform Command-Line Interface, .NET, PHP, Node.js, Python, Java, and Ruby.

Power Management code sample

After browsing through the javadoc for the SDK, one can find the methods start() and shutdown() implemented in class VirtualMachineOperationsImpl. Both methods require 3 parameters, all of which can be found from the Azure Portal.


But how does one create an instance of VirtualMachineOperationsImpl? This did not seem intuitive, so I downloaded the source code of the SDK and browsed through the test code, which led me to the following 3 lines of code:


Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);
OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName, deploymentName, virtualMachineName);


Basically, getVirtualMachinesOperations() method from class ComputeManagementClient returns an instance of VirtualMachineOperationsImpl. ComputeManagementClient class in turn requires an instance of the Configuration class, which requires the filename of the PublishSettings and the Subscription ID during instantiation.


The entire sample code is as follows:

import com.microsoft.windowsazure.core.*;
import com.microsoft.windowsazure.Configuration;
import com.microsoft.windowsazure.Configuration.*;
import com.microsoft.windowsazure.credentials.*;
import com.microsoft.windowsazure.management.configuration.*;
import com.microsoft.windowsazure.management.compute.*;
import com.microsoft.windowsazure.management.compute.models.*;
import org.apache.http.impl.client.HttpClientBuilder;
import java.util.concurrent.*;
import java.io.File;
import java.io.IOException;
import java.lang.String;

class AzureStartVM {
    public static void main(String[] args) {
        String publishSettingsFileName = System.getenv("AZURE_PUBLISH_SETTINGS_FILE");
        if (publishSettingsFileName == null) {
            System.err.println("Set AZURE_PUBLISH_SETTINGS_FILE.");
            System.err.println(" Ex. $ setenv AZURE_PUBLISH_SETTINGS_FILE $HOME/conf/XXX.publishsetting");
        File file = new File(publishSettingsFileName);
        if (!file.exists()) {
            System.err.println("File not found: " + publishSettingsFileName);

        String subscriptionId = "XXXXXXXXXXX";
        if (args.length != 3) {
            System.err.println("Usage: AzureStartVM serviceName deploymentName virtualMachineName");
        String serviceName = args[0];
        String deploymentName = args[1];           
        String virtualMachineName = args[2];

        try {                             
                Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
                ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);

                OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName,deploymentName,virtualMachineName);
        catch (Exception e){


Compile source

1)      Add Java SDK jar files to CLASSPATH
2)      Download the publish settings file locally and set its path to AZURE_PUBLISH_SETTINGS_FILE
3)      Set string value of subscriptionId in the source.
4)      Compile

$ javac AzureStartVM.java


Execute Code

$ java AzureStartVM serviceName deploymentName vmName


So Why Java?

The different tools and SDK’s are in different stages of development, and the required management functions may not be available for your language of choice. PowerShell is by far the most complete and the easiest to use, but this is not a solution available for Linux users.

Cross-Platform CLI would probably be the next choice in the chain of tools to try. However, this CLI uses node.js underneath to trigger the REST API requests, and there seems to be either a bug or an incompatibility issue with node.js on the CentOS-based Linux VM for Azure, which caused unreliability in its operation. Microsoft support was able to boil down the problem to the https module in node.js, and that this problem was non-existent in Ubuntu, but switching operating systems just for this purpose seemed rather nonsensical.

PHP and Python do not yet have power management modules implemented, and while implementing these modules for the SDK was an option, Java seemed to be the easier choice.

Why the start() and stop() instead of async options?

The implementation of the REST API for power management seems to be thread-unsafe and result in an error when executed for VMs in the same cloud service. This means that if CloudService1 contained VM1 and VM2, a start command cannot be triggered for both VM1 and VM2 simultaneously. Even on the Azure portal, one has to wait until VM1 is running before VM2 can be started. Therefore, it made sense to implement a blocking start/shutdown.

VM’s in other cloud services, on the other hand, can be started/shutdown simultaneously, so the code can be executed simultaneously using multiple terminals.  Hence, the async was not necessary.


The above command can be executed in a for-loop from shell to start multiple VMs at once.

While it may be trivial for a developer to come up with this solution, it is probably not part of the job description for Linux Administrators, who would probably benefit the most from this, to come up with this solution.

[Tech Blog] Super Computing 2013

Super Computing 2013 (SC13) has started in Denver, Colorado! The Super Computing top 500 list has already been updated, and there are no big changes to the top 5. Tianhe-2, a supercomputer developed in China, retained its position as the world’s No. 1 system.


One important but often ignored statistic for Super Computing is power consumption. For example the Tianhe-2 reaches 33.86 petaflop/s on the Linpack benchmark, but it consumes 17808 kW of power to do so! This is almost the entire output of a small thermal power station. The Green 500 ranks supercomputers by energy efficiency instead of just raw processing power. In terms of MFLOPS/Watt, the TSUBAME 2.5 developed by the Tokyo Institute of Technology should get their No. 1 spot. The latest Green 500 list will be available soon.

There are several techniques we can use to manage power consumption at the system level, but within each computing node we only have a few options. Heterogeneous setups, such as those using an NVIDIA/AMD GPU or Intel’s Xeon Phi have become the norm – and we think FPGA chips are another promising option. FPGA has pipeline parallelism, where several different tasks can be spawned in a push-pull configuration and each task has its own data supplied by the previous task – with or without host interaction. Several tech papers report that FPGA outperforms other chips in terms of both latency and energy efficiency. The standard FPGA languages (VHDL and Verilog-HDL) can be a complicated struggle for any software developer (like me!), but the chip vendor ALTERA has recently announced OpenCL support on their FPGA devices. This will make development faster, and cause fewer headaches. Fixstars has been a part of their early access program for their OpenCL SDK since last summer, and we provide OpenCL software and services to our clients.

Unfortunately, we don’t have a booth at SC13 this year, but we can be found at our partner Nallatech’s booth (#3519). We’ll even have coupons for copies of our OpenCL programming book.
Stop by and check out OpenCL on FPGA!

[Tech Blog] Computational Geometry Engine and Our Patented GPP Technology


Computational geometry engines are central to both emerging and ubiquitous technologies. From online mapping to digital chip manufacturing, computational geometry algorithms reduce enormous sets of data into visualizable results that enable engineers and casual users to make informed decisions.

The plane sweep algorithm, although widely used in computational geometry, does not parallelize efficiently, rendering it incapable of benefiting from recent trends in multicore CPUs and general-purpose GPUs. Instead of the plane sweep, some researchers have proposed a uniform grid as a foundation for parallel algorithms of computational geometry. However, long-standing robustness and performance issues have deterred its wider adoption, at least in the case of overlay analysis. To remedy this, we have developed previously unknown, unique methods to perform snap rounding and compute efficiently the winding number of overlay faces on a uniform grid. We have implemented them as part of an extensible geometry engine to perform polygon overlay with OpenMP on CPUs and CUDA on GPUs. As previously announced, we have released a software product, called “Geometric Performance Primitives (GPP),” based on this implementation. The overall algorithm works on any polygon configuration, either degenerate, overlapping, self-overlapping, disjoint, or with holes. With typical data, it features time and space complexities of O(N + K), where N is the number of edges and K the number of intersections. Its single-threaded performance not only rivals the plane sweep, it achieves a parallel efficiency of 0.9 on our quad-core CPU, with an additional speedup factor of over 4 on our GPU. These performance results should extrapolate to distributed computing and other geometric operations.

To obtain baseline performance, we compared GPP against equivalent functionality found in the following commonly used GIS software tools: ArcGIS 10.1 for Desktop, GRASS 6.4.2, OpenJUMP 1.6.3 (with JTS 1.13 as backend), and QGIS 1.8.0 (with GEOS 3.3.2 as backend), whose algorithms execute on a single thread only and may or may not exploit the plane sweep. Nevertheless, note that their rich set of features may create additional overhead not yet present in GPP. With each of them, we applied a dissolve (merge) operation and a geometric intersection (boolean AND) of two pairs of layers taken from real-world datasets and listed the resulting execution performance. Its overall performance fared well against the existing solutions, reaching an average speed up of 16 x on CPU alone against ArcGIS, the fastest currently available software application we found.


Performance of ArcGIS versus GPP on real-world datasets. Please refer to the paper for more technical details.

At this year’s ACM SIGSPATIAL, our technical paper was selected out of over 200 very competitive entries. Please refer to the paper for more technical details. The SIGSPATIAL section of the ACM covers a broad range of innovative designs and implementations, with special attention paid to emerging trends in geospatial information systems (GIS).

So, what’s our next step? GPP has already gained some success in the Electronic Design Automation (EDA) industry. Our next target is object-relational database engines. Computational geometry processing is a necessary component of spatially aware systems, including database management systems (DBMS) such as Oracle Spatial and Graph, and PostGIS for PostgreSQL. For database management systems, spatially-aware extensions provide native support for storage and retrieval of geometric objects that are commonly quite large and complex. This turns a standard DBMS into a location-aware engine that can drive business and technology applications referencing complex, layered, and/or geographical data into an intuitive and visualizable form. In addition to generalized spatial queries for their databases, Oracle’s solution, for example, provides explicit support for geo-referenced images, topologies, 3D data types, including triangulated irregular networks (TINs), and point clouds with native support for LIDAR data. PostGIS, on the other hand, provides native support and algorithms for ESRI shapefiles, raster map algebra, and spatial reprojection. When combined with GPP’s highly-parallelized computational geometry engine, these functions can processes the large, multi-layered data sets necessary to perform automated emergency route generation, natural resource management, insurance analysis, petroleum exploration, and financial queries 10 times or more faster than conventional algorithms.

We are excited to attend SIGSPATIAL 2013 in November and look forward to discussing the opportunities for geospatial processing that GPP provides with developers and researchers from around the world. See you in Florida!

[Tech blog] An introduction to SIMD vectorization and NEON

1. What is vectorization?

Vectorization is a type of parallel computing where a single process performs multiple tasks at the same time. This is done by packing a set of elements together into an array and operating on them all at once. It’s especially useful for multimedia purposes such as gaming, image processing, video encoding/decoding and much more.  The process looks something like Figure A.

Fig A: SIMD Sample

Fig A: SIMD Sample

So how do we do this in actual code? And how does it compare with a scalar, one at a time approach? Let’s take a look. I’m going to be doing two implementations of the same addition function, one scalar and one with vectorization using ARM’s NEON intrinsics and gcc 4.7.2-2 (on Yellowdog Linux for ARM*).




The scalar function is very simple, it’s just a for loop adding two 16 member arrays.

void add_scalar(uint8_t * A, uint8_t * B, uint8_t * C){
    for(int i=0; i<16; i++){
         C[i] = A[i] + B[i];

The NEON function however looks a lot more complicated.

void add_neon(uint8_t * A, uint8_t * B, uint8_t *C){
        //Setup a couple vectors to hold our data
        uint8x16_t vectorA, vectorB vectorC;

        //Load our data into the vector's register
        vectorA = vld1q_u8(A);
        vectorB = vld1q_u8(B);

        //Add A and B together
        vectorC = vaddq_u8(vectorA, vectorB);

Those strange looking functions are NEON’s Intrinsics, they form an intermediate layer between assembly and C. They’re a bit confusing, but ARM’s infocenter goes into some detail about them and GCC has a great reference available here. So what do they do? Well, “uint8x16_t” is a vector type containing 16 8bit uints in an array and “ald1q_u8” loads 8bit uints into a vector. Finally, “vaddq_u8” adds two vectors made of 8bit uints together all at once, and returns the result. If you test this out you’ll notice that the neon function isn’t really any faster. This is because those two load functions take up a lot of time, and we’re doing so little work that the scalar solution catches up. If we could avoid those (by structuring our program to use vectors in the first place) we’d see a greater improvement.

Now lets take a look at another case where neon can really shine, matrix multiplication. Specifically 4×4 matrix multiplication, a common case in computer graphics.

// Our test matrices
        uint16_t matrixA[4][4] = {1,  2,  3,  4, \
                                5,  6,  7,  8, \
                                9,  10, 11, 12,\
                                13, 14, 15, 16 };

        uint16_t matrixB[4][4] = {16, 15, 14, 13,\
                                12, 11, 10, 9, \
                                8,  7,  6,  5, \
                                4,  3,  2, 1 };

        uint16_t matrixC[4][4];

Multiplying these together with a scalar function is fairly straightforward, we calculate the dotproduct of each value in matrixA (by rows) with each column in matrixB. We can do this in a somewhat efficient manner using for loops:

        for(i=0; i<4; i++){ //For each row in A
                for(j=0; j<4; j++){ //And each column in B
                        for(k=0; k<4; k++){ //for each item in that column
                                dotproduct = dotproduct + A[i][k]*B[k][j];
                                //use a running total to calculate the dp.
                        C[i][j] = dotproduct; //fill in C with our results.

Now using NEON…

        //Load matrixB into four vectors
        uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;

        vectorB1 = vld1_u16 (B[0]);
        vectorB2 = vld1_u16 (B[1]);
        vectorB3 = vld1_u16 (B[2]);
        vectorB4 = vld1_u16 (B[3]);

        //Temporary vectors to use with calculating the dotproduct
        uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;

        // For each row in A...
        for (i=0; i<4; i++){
                //Multiply the rows in B by each value in A's row
                vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
                vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
                vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
                vectorT4 = vmul_n_u16(vectorB4, A[i][3]);

                //Add them together
                vectorT1 = vadd_u16(vectorT1, vectorT2);
                vectorT1 = vadd_u16(vectorT1, vectorT3);
                vectorT1 = vadd_u16(vectorT1, vectorT4);

                //Output the dotproduct
                vst1_u16 (C[i], vectorT1);

That looks much more complicated, and in some ways it is. It’s also about three times as fast (including loads) on my test machine. Instead of stepping through each item in matrixA I’m stepping through each row, and calculating the dot product for four of them at a time. If we break down the matrix multiplication and look at the dotproduct calculations for the first row, you can hopefully see why this works:

C[0][0] = (1 * 16) + (2 * 12) + (3 * 8) + (4 * 4) (which is 80)
C[0][1] = (1 * 15) + (2 * 11) + (3 * 7) + (4 * 3) (70)
C[0][2] = (1 * 14) + (2 * 10) + (3 * 6) + (4 * 2) (60)
C[0][3] = (1 * 13) + (2 * 9) + (3 * 5) + (4 * 1) (50)

We’re multiplying each row of matrixB with a single value from matrixA at a time, something neon can do easily using “vmul_n_X”. We hold this data in temp vectors, add those vectors together with “vadd_X” (accumulating the result in vectorT1) then unload our new row into matrixC using “vstl_X”. A very different approach than the scalar solution but with the same results. My test program is attached to the blog post if you’d like to give it a try yourself.

2. Autovectorization

Interested in avoiding all this mess and still getting at least some of the benefits? Luckily there’s something called Auto-Vectorization, an automatic way to convert a scalar program into a Vectorized one. Research into auto-vectorization is still ongoing (and probably will be for quite some time), but there are several implementations available already. Gcc is by far the most popular and gcc 4.7(+), which includes support for autovectorization is already included in Yellowdog Linux 7.

Enabling Autovectorization in gcc is quite simple, but there are several tricks and hints you may need to give the compiler to get an optimal result. In gcc 4.7(+), auto-vectorization can be enabled by adding -03 or -ftree-vectorize to the command line (or CFLAGS). If you’re planning to use neon you’ll need to enable it with -mfpu=neon, although there are some issues. Gcc’s auto vectorization will ignore floats with neon unless you enable -funsafe-math-optimizations. Unfortunately using neon instructions with floats can lead to a loss of precision as neon does not adhere to IEEE 754.

Autovectorization can under the right circumstances significantly speed up a program, but it’s imperfect. Luckily there are ways to structure your code and hints you can give the compiler that will make sure it behaves properly. In future posts I will be covering these tricks and tips as well as going into detail about NEON assembly and how using it directly instead of intrinsics can give your app an even greater speed boost.

* Yellowdog Linux (YDL) is a Linux distribution developed by Fixstars Solutions. It is based on RHEL/Centos and used in Fixstars’ products. See http://www.ydl.net/ydl7/ for more details. A version of Yellowdog Linux optimized for ARM servers is currently in development.

[Tech blog] IBM 7R2 Appliance and YellowDog Linux: High Performance Hardware With an Open Software Stack

In July, Fixstars announced the PowerLinux 7R2 YellowDog Appliance as a new open standard platform able to efficiently support basic infrastructure services, large transactional databases, BigData, and HPC workloads. IBM’s latest Power7 offerings are very exciting and with Fixstars’ offering, the high performance hardware is available with an open software stack just like the commodity servers it is replacing.

The 7R2 is a 2U, rack mountable server that features two 3.5GHz, 8-core Power7 processors. Each has four threading units meaning an impressive 64 Penguins are displayed on boot if you have CONFIG_LOGO enabled, not to mention the actual performance the processors are able to deliver. The default configuration features 32GB of memory but up to 256GB is available. With six bays for SAS storage, plenty of local fast storage is available and with six PCI Express slots (five 8x, one 4x) there is no shortage of expansion available for FiberChannel, Infiniband, or any other type of external connectivity you need.

IBM marketing materials claim that the PowerLinux 7R2 is capable of efficiently running thousands of tasks in parallel while achieving massive scale-out flexibility and exploiting extreme memory bandwidth. In practise this means using fewer servers for increased performance at lower costs (http://public.dhe.ibm.com/common/ssi/ecm/en/poc03088usen/POC03088USEN.PDF) and reducing the rate at which costs grow as IT capabilities expand.  (http://public.dhe.ibm.com/common/ssi/ecm/en/pol03112usen/POL03112USEN.PDF).

You may wonder how the 7R2 Appliance compares to conventional, Intel servers; the Power7 processor is quite similar to the Xeon E5 and E7 families with 256KB of L1 cache per core.  While the Intel processors share up to 20MB of L3 cache, the Power7 has a dedicated 4MB of cache per core for a total of 32MB. Comparing the 7R2 to other platforms, it fits into the same segment as the Dell PowerEdge R820 series which as of October 2012 are priced between $13,000 and $18,000 each depending on processor selection, with 32GB of memory and 2x146GB SAS drives. At $14,990 (or $19,990 with PowerVM Standard Edition) the 7R2 is competitively priced with high end x86 servers, ending the complaint that the Power architecture is “too expensive” for adoption.

Fixstars is including the new Yellow Dog Linux 7 with the 7R2 appliance and just like previous versions, it is based on the industry standard Enterprise Linux. Yellow Dog 7 features a few tweaks for better performance on Power that developers will love. The biggest improvement is the inclusion of the GCC 4.7 compiler suite, something that was universally requested from Fixstars’ other project teams. GCC 4.7 features auto-vectorization that we’ve documented here (http://ydl.net/ydl7/support/AutoVectorizationDocs.shtml).

PowerVM Standard Edition supports most any virtualization function needed while Enterprise Edition with awe inspiring features such as Active Memory Sharing and Live Partition Mobility is also available. Server consolidation is an obvious use for a machine like the 7R2 and we’ve built out 7R2s with 160 running virtual machines and performance of both database and web server workloads didn’t suffer, processor intensive workloads of course suffered from such a small slice of CPU time. In more realistic consolidation tests, we found that a 7R2 configured with 16 LPARs had the best overall performance compared to the same configuration with either 1, 4, or 64 LPARs.

The Phoronix Test Suite demonstrates the YellowDog 7R2 Appliance’s capabilities using a variety of CPU and I/O intensive benchmarks. The 7R2 (configured with 64GB of memory and four 146GB SAS drives in a single Logical Volume) has an impressive crafty score (http://openbenchmarking.org/test/pts/crafty) of 111.78, runs the Parallel BZIP test (http://openbenchmarking.org/test/pts/compress-pbzip2) in only 3.18 seconds and also scores 47929 in the PHPBench test (http://openbenchmarking.org/test/pts/phpbench). Overall the results are fairly similar to the high end Intel Xeon offerings, with the 7R2 being faster in some tests and the Xeon being faster in others.

Over the coming months, we’ll be posting more about the 7R2 and how it integrates into different environments with Virtualization and BigData.