[Tech Blog] I/O Bench Mark Test for Windows Azure D-series


Update: Microsoft has just released a Persistent SSD storage. This is still in preview phase, and available in a limited region. Here is a link and expected throughput:

Recently Microsoft released a new series of Virtual Machine called D-Series. D-Series have a relatively new CPU (Xeon E5-2660, 2.2GHz) and local Solid State Drives (SSD). Amazon AWS already has SSD instances, and there are several SSD-only cloud vendors. I’m just excited that we can finally use SSD on Azure. In this blog post, I will show you the I/O benchmark result of the D-series VM.

Azure VM has two types of local storage; one is a persistent storage and the other is an ephemeral storage. The data on the persistent storage is reusable after the VM is rebooted, but the data on the ephemeral storage may be lost after rebooting. For the D-Series, SSD storage is an “ephemeral” storage. As shown below, the performance of SSD is outstanding but you have to save the data in a persistent storage before the VM is turned off. The size of the new SSD storage is limited, in case of D14 (a high end instance of D-Series), the size of local storage is 800GB. If you need more storage, you can use Azure Storage which is an elastic storage system for the cloud and provides several different ways to access it. An Azure VM just mounts Azure Storage with “mount” command as a local block device.

I prepared one D14 instance that CentOS release 6.5 was installed on with 30GB of persistent storage, 800GB SSD ephemeral storage, and 999GB Azure Storage is mounted. (Fig. 1). The specification of D14 is shown in table 1. In this blog post, the I/O performance of 3 different storage types are evaluated.


Fig. 1: Three different storages mounted on Azure VM


# of cores 16
Persistent Storage 30GB
Ephemeral Storage 800GB
Price $1.542/hour
Table 1. Specification of the D14 instance

At first, I tried a simple read/write test with “dd” and “hdparam”. This test is very useful for a quick check since these commands should be pre-installed in many Linux distributions. The actual command is:

– Read Test

# sync
# echo 3 > /proc/sys/vm/drop_caches
# hdparm –t

– Write Test

# dd if=/dev/zero of= ibs=1M obs=1M count=1024 oflag=direct

The command is simple. However, if you missed a command or command arguments the result will likely be incorrect. Line 1 & 2 in Read Test Commands are to remove data on a read buffer cache on RAM, and evaluate the performance of storage IO. In the Write Test command, the 1GB data is read from /dev/zero and written in . The last argument in the dd command (oflag=direct) is for the direct access to the storage medium without Linux caching reads or writes. The result is shown in Fig. 2. The measurements were conducted 12 times and the interval of each trial is 10 sec. These commands issue I/O operations sequentially.


Fig.2: Result of Simple Read/Write test

For HDD, the typical performance of a sequential Read/Write is around 100MB/s. For SATA/SAS SSD, it should be around 400MB/s. As shown in Fig. 2, Azure’s Ephemeral storage shows a remarkably high performance. They must use the highend PCIe-SSD having the typical performance of around 1GB/s. In the result of Read Test, you can see that the performance of the persistent storage (Blue Line) depends on the trial number. As I mentioned, I removed the Linux OS cache with sync; echo 3 > /proc/sys/vm/drop_caches commands, but It still looks like cache effect. I don’t have any concrete reason for this, but that might be due to the effect of cache on the host of Virtual Machine which we cannot control from a guest OS.

The simple Read/Write test is easy-to-use, but the result is not so reliable so we also evaluated the 3 different types of storage with FIO. FIO is a standard tool in Linux for evaluating I/O performance and has a bunch number of options. It may be difficult for a beginner to pick up right options. In this evaluation, I used the job file published by WinKey (http://www.winkey.jp/downloads/index.php/fio-crystaldiskmark, written in Japanese). With this job file, you can do the same types of I/O performance test as CrystalDiskMark (http://crystalmark.info/software/CrystalDiskMark/index-e.html ), which is a famous benchmark tool in Windows OS. The command is just typing :

# fio crystaldiskmark.fio

The target storage area is /tmp as default. If you want to evaluate a different directory, Edit Line 7 (directory=/tmp/) in crystaldiskmark.fio. The result is shown in Fig. 3. Note that Y-axis is logarithmic scale. The performance difference between the ephemeral storage and the persistent storage is dramatically large especially in Random Read/Write at Queue Depth* = 32.
* Queue Depth: The number of simultaneous input/output requests for a storage device.

Fig. 3: Result of FIO benchmark

As demonstrated in the FIO test results, the SSD performance on Azure D-Series is great but again, the problem is that SSDs are ephemeral storage. If your application boots up and shuts down instances often, you could develop a management tool or daemon to save your essential data in Azure storage or persistent local storage.



[Tech Blog] Power Management Using the Windows Azure Java SDK


Azure SDK’s

Windows Azure Portal offers an intuitive interface for performing tasks such as turning on/off a VM. However, when these tasks need to be performed in bulk, one soon comes to a realization that clicking through the user interface may not be the optimal method.

Fortunately, Windows Azure can be managed through REST API’s. And to makes these REST API’s easy to use, command-line tools and numerous SDK’s exist as a wrapper, available for PowerShell, Cross-Platform Command-Line Interface, .NET, PHP, Node.js, Python, Java, and Ruby.

Power Management code sample

After browsing through the javadoc for the SDK, one can find the methods start() and shutdown() implemented in class VirtualMachineOperationsImpl. Both methods require 3 parameters, all of which can be found from the Azure Portal.


But how does one create an instance of VirtualMachineOperationsImpl? This did not seem intuitive, so I downloaded the source code of the SDK and browsed through the test code, which led me to the following 3 lines of code:


Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);
OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName, deploymentName, virtualMachineName);


Basically, getVirtualMachinesOperations() method from class ComputeManagementClient returns an instance of VirtualMachineOperationsImpl. ComputeManagementClient class in turn requires an instance of the Configuration class, which requires the filename of the PublishSettings and the Subscription ID during instantiation.


The entire sample code is as follows:

import com.microsoft.windowsazure.core.*;
import com.microsoft.windowsazure.Configuration;
import com.microsoft.windowsazure.Configuration.*;
import com.microsoft.windowsazure.credentials.*;
import com.microsoft.windowsazure.management.configuration.*;
import com.microsoft.windowsazure.management.compute.*;
import com.microsoft.windowsazure.management.compute.models.*;
import org.apache.http.impl.client.HttpClientBuilder;
import java.util.concurrent.*;
import java.io.File;
import java.io.IOException;
import java.lang.String;

class AzureStartVM {
    public static void main(String[] args) {
        String publishSettingsFileName = System.getenv("AZURE_PUBLISH_SETTINGS_FILE");
        if (publishSettingsFileName == null) {
            System.err.println("Set AZURE_PUBLISH_SETTINGS_FILE.");
            System.err.println(" Ex. $ setenv AZURE_PUBLISH_SETTINGS_FILE $HOME/conf/XXX.publishsetting");
        File file = new File(publishSettingsFileName);
        if (!file.exists()) {
            System.err.println("File not found: " + publishSettingsFileName);

        String subscriptionId = "XXXXXXXXXXX";
        if (args.length != 3) {
            System.err.println("Usage: AzureStartVM serviceName deploymentName virtualMachineName");
        String serviceName = args[0];
        String deploymentName = args[1];           
        String virtualMachineName = args[2];

        try {                             
                Configuration config = PublishSettingsLoader.createManagementConfiguration(publishSettingsFileName, subscriptionId);
                ComputeManagementClient computeManagementClient = ComputeManagementService.create(config);

                OperationResponse operationResponse = computeManagementClient.getVirtualMachinesOperations().start(serviceName,deploymentName,virtualMachineName);
        catch (Exception e){


Compile source

1)      Add Java SDK jar files to CLASSPATH
2)      Download the publish settings file locally and set its path to AZURE_PUBLISH_SETTINGS_FILE
3)      Set string value of subscriptionId in the source.
4)      Compile

$ javac AzureStartVM.java


Execute Code

$ java AzureStartVM serviceName deploymentName vmName


So Why Java?

The different tools and SDK’s are in different stages of development, and the required management functions may not be available for your language of choice. PowerShell is by far the most complete and the easiest to use, but this is not a solution available for Linux users.

Cross-Platform CLI would probably be the next choice in the chain of tools to try. However, this CLI uses node.js underneath to trigger the REST API requests, and there seems to be either a bug or an incompatibility issue with node.js on the CentOS-based Linux VM for Azure, which caused unreliability in its operation. Microsoft support was able to boil down the problem to the https module in node.js, and that this problem was non-existent in Ubuntu, but switching operating systems just for this purpose seemed rather nonsensical.

PHP and Python do not yet have power management modules implemented, and while implementing these modules for the SDK was an option, Java seemed to be the easier choice.

Why the start() and stop() instead of async options?

The implementation of the REST API for power management seems to be thread-unsafe and result in an error when executed for VMs in the same cloud service. This means that if CloudService1 contained VM1 and VM2, a start command cannot be triggered for both VM1 and VM2 simultaneously. Even on the Azure portal, one has to wait until VM1 is running before VM2 can be started. Therefore, it made sense to implement a blocking start/shutdown.

VM’s in other cloud services, on the other hand, can be started/shutdown simultaneously, so the code can be executed simultaneously using multiple terminals.  Hence, the async was not necessary.


The above command can be executed in a for-loop from shell to start multiple VMs at once.

While it may be trivial for a developer to come up with this solution, it is probably not part of the job description for Linux Administrators, who would probably benefit the most from this, to come up with this solution.

Fixstars Establishes Canadian Subsidiary

Sunnyvale, CA — November 3rd, 2014 — Fixstars Solutions, Inc. announced it’s Canadian subsidiary’s first official day of operations today.

Fixstars Solutions Canada will help to grow it’s parent companies’ business in the United States and abroad by expanding their research and development capabilities in multicore, storage, and cloud computing technologies. The new subsidiary will be headquartered in Victoria BC. Akihiro Asahara, the CEO of Fixstars Solutions will assume the role of Chairman while Owen Stampflee will assume the role of CEO.

About Fixstars Solutions

Fixstars Solutions is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars Solutions provides “Green IT,” while accelerating customers’ business in various fields. Learn more about how Fixstars Solutions can accelerate your business in life science, manufacturing, finance, and media & entertainment.
For more information, visit www.fixstars.com.

Fixstars Solutions Adopts Microsoft Azure to Boost Life Science Application Performance

Sunnyvale, CA — August 12th, 2014 — Fixstars Solutions Inc. today announced that it has adopted Microsoft Azure to help boost the performance of its life science application. The importance of computer simulation and analysis in life science field is growing, and some life science applications, such as Molecular Dynamic Simulation and DNA sequencing, consume a vast amount of computing resources. Building a private cluster server system only for these compute intensive applications can be complex and costly. Fixstars Solutions developed a computing platform for life science applications on Microsoft Azure, designed to provide an elastic and high performance cluster server system.

Fixstars Solutions has extensive knowledge of parallel processing technology, with a track record of numerous successful projects with hyper-scale cluster server systems. For the US Air Force Research Lab, Fixstars Solutions developed a simulation and video processing system made up of 2,016 SONY PS3® nodes. Recently, Fixstars and Mizuho Securities succeeded in accelerating derivative valuation system by 30-folds with Intel Xeon Phi® Coprocessors.

Fixstars Solutions worked closely with Microsoft to efficiently port a life science application to Microsoft Azure in three weeks. The nature of the target application is compute-intensive, and requires about 1000 CPU cores. After some initial benchmarking, it was found that the Microsoft Azure A8/A9 instances, which are optimized for high performance computing (HPC) applications, out-performed the competing cloud services, which lead to the decision to utilize Microsoft Azure. As a result of Microsoft Azure’s HPC capabilities, Fixstars Solutions was able to achieve a performance boost of more than 4-folds when compared to the software working originally on the private cluster server system.

“I am pleased to work with Microsoft to bring solutions to market that we believe will advance the life sciences field.” said Akihiro Asahara, CEO of Fixstars Solutions. “Using Microsoft Azure’s new compute intensive instances, A8 and A9, we can use a high performance cluster server system having over 1000 CPU cores available at all times.”

“The flexibility, speed and cost-saving of cloud computing has the power to accelerate crucial research in the life sciences field,” said Venkat Gattamneni, Senior Product Marketing Manager, Cloud and Enterprise, Microsoft. “We look forward to continued work with Fixstars Solutions to boost the performance of life sciences applications using the Microsoft Azure platform.”

Fixstars Solutions will release solutions built on Microsoft Azure, and provide technical service and consulting for clients using Microsoft Azure. Not only will Fixstars aid in porting to Microsoft Azure to achieve speed-up from parallel processing, but will also aid in enhancing system availability and reducing operation costs by using Azure’s management functions made available through Microsoft Azure SDK and command-line tools.


About Fixstars Solutions

Fixstars Solutions is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars Solutions provides “Green IT,” while accelerating customers’ business in various fields. Learn more about how Fixstars Solutions can accelerate your business in life science, manufacturing, finance, and media & entertainment.
For more information, visit www.fixstars.com.

Fixstars and Mizuho Securities Succeeds in Accelerating Derivative Valuation System by 30-folds with Intel Xeon Phi Coprocessor

Mizuho Securities becomes the first Financial Institution to deploy the coprocessor in the Production Environment

TOKYO — June 2, 2014 — Fixstars Corporation today announced that Fixstars and Mizuho Securities Co., Ltd achieved a 30-folds speed-up by porting Mizuho Securities’ Derivative Valuation System to Intel’s many-core processor, “Intel® Xeon Phi™ coprocessors”.  Mizuho Securities has become world’s first financial institution to utilize the Xeon Phi™ coprocessor in the Production Environment.

The recent low interest rates have increased the demand for structured bonds, which combines the use of bonds in conjunction with derivatives.  To meet these customer demands, Mizuho Securities had been investigating ways to efficiently perform the vast amount of computations required by their Derivative Valuation Systems, which led them to consider the use of Intel’s High-Performance Xeon Phi™ coprocessors.

Mizuho Securities, with the aid of their development partner Fixstars, succeeded in devising an algorithm to efficiently distribute the workloads across the cores of Intel® Xeon Phi™ coprocessor.  The Xeon Phi™ replaces an 8-core Xeon® system, which resulted in out-performing the previous system by a factor of 30.  In addition, by placing an emphasis on source-code readability, a highly maintainable system was created.

“I am happy to announce that Fixstars’ expertise in parallel processing in conjunction with Intel’s Technology had been able to improve upon Mizuho Securities’ Derivative Valuation System,” said Kosuke Hirano, Senior Executive Officer, Intel K.K.. “Intel will continue innovating with the highly parallel processing capability of Intel® Xeon Phi™ coprocessors.”

“I am very excited that our expertise in parallel programming techniques has enabled us to aid Mizuho Securities in the deployment of the New Derivative Valuation System,” said Miki Satoshi, CEO, Fixstars. “We had been involved in the project from the researching phase on what hardware to use, and I believe the smooth progress that led to this deployment could not have occurred without the mutual, underlying trust present in the partnership between Mizuho Securities’ and Fixstars, as well as the high capability and the reliability of Intel® Xeon Phi™ coprocessors.”

Mizuho Securities is already using the new system for Plain Vanilla Derivatives, with plans to shift to the new system this summer with Exotic Derivatives.

Fixstars will continue to speed up Mizuho Security’s business through provision of technical expertise in parallel processing.

* Intel, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

Initial Public Offering

Fixstars Corporation, a parent company of Fixstars Solutions inc., announced the pricing of its initial public offering of 100,000 shares of Class A common stock. Fixstars Corp’s Class A common stock will trade on the Tokyo Stock Exchange (Mothers) under the code “3687” from April 23, 2014.

See more detail at http://www.tse.or.jp/english/listing/companies/index_e.html

[Tech Blog] Super Computing 2013

Super Computing 2013 (SC13) has started in Denver, Colorado! The Super Computing top 500 list has already been updated, and there are no big changes to the top 5. Tianhe-2, a supercomputer developed in China, retained its position as the world’s No. 1 system.


One important but often ignored statistic for Super Computing is power consumption. For example the Tianhe-2 reaches 33.86 petaflop/s on the Linpack benchmark, but it consumes 17808 kW of power to do so! This is almost the entire output of a small thermal power station. The Green 500 ranks supercomputers by energy efficiency instead of just raw processing power. In terms of MFLOPS/Watt, the TSUBAME 2.5 developed by the Tokyo Institute of Technology should get their No. 1 spot. The latest Green 500 list will be available soon.

There are several techniques we can use to manage power consumption at the system level, but within each computing node we only have a few options. Heterogeneous setups, such as those using an NVIDIA/AMD GPU or Intel’s Xeon Phi have become the norm – and we think FPGA chips are another promising option. FPGA has pipeline parallelism, where several different tasks can be spawned in a push-pull configuration and each task has its own data supplied by the previous task – with or without host interaction. Several tech papers report that FPGA outperforms other chips in terms of both latency and energy efficiency. The standard FPGA languages (VHDL and Verilog-HDL) can be a complicated struggle for any software developer (like me!), but the chip vendor ALTERA has recently announced OpenCL support on their FPGA devices. This will make development faster, and cause fewer headaches. Fixstars has been a part of their early access program for their OpenCL SDK since last summer, and we provide OpenCL software and services to our clients.

Unfortunately, we don’t have a booth at SC13 this year, but we can be found at our partner Nallatech’s booth (#3519). We’ll even have coupons for copies of our OpenCL programming book.
Stop by and check out OpenCL on FPGA!

[Tech Blog] Computational Geometry Engine and Our Patented GPP Technology


Computational geometry engines are central to both emerging and ubiquitous technologies. From online mapping to digital chip manufacturing, computational geometry algorithms reduce enormous sets of data into visualizable results that enable engineers and casual users to make informed decisions.

The plane sweep algorithm, although widely used in computational geometry, does not parallelize efficiently, rendering it incapable of benefiting from recent trends in multicore CPUs and general-purpose GPUs. Instead of the plane sweep, some researchers have proposed a uniform grid as a foundation for parallel algorithms of computational geometry. However, long-standing robustness and performance issues have deterred its wider adoption, at least in the case of overlay analysis. To remedy this, we have developed previously unknown, unique methods to perform snap rounding and compute efficiently the winding number of overlay faces on a uniform grid. We have implemented them as part of an extensible geometry engine to perform polygon overlay with OpenMP on CPUs and CUDA on GPUs. As previously announced, we have released a software product, called “Geometric Performance Primitives (GPP),” based on this implementation. The overall algorithm works on any polygon configuration, either degenerate, overlapping, self-overlapping, disjoint, or with holes. With typical data, it features time and space complexities of O(N + K), where N is the number of edges and K the number of intersections. Its single-threaded performance not only rivals the plane sweep, it achieves a parallel efficiency of 0.9 on our quad-core CPU, with an additional speedup factor of over 4 on our GPU. These performance results should extrapolate to distributed computing and other geometric operations.

To obtain baseline performance, we compared GPP against equivalent functionality found in the following commonly used GIS software tools: ArcGIS 10.1 for Desktop, GRASS 6.4.2, OpenJUMP 1.6.3 (with JTS 1.13 as backend), and QGIS 1.8.0 (with GEOS 3.3.2 as backend), whose algorithms execute on a single thread only and may or may not exploit the plane sweep. Nevertheless, note that their rich set of features may create additional overhead not yet present in GPP. With each of them, we applied a dissolve (merge) operation and a geometric intersection (boolean AND) of two pairs of layers taken from real-world datasets and listed the resulting execution performance. Its overall performance fared well against the existing solutions, reaching an average speed up of 16 x on CPU alone against ArcGIS, the fastest currently available software application we found.


Performance of ArcGIS versus GPP on real-world datasets. Please refer to the paper for more technical details.

At this year’s ACM SIGSPATIAL, our technical paper was selected out of over 200 very competitive entries. Please refer to the paper for more technical details. The SIGSPATIAL section of the ACM covers a broad range of innovative designs and implementations, with special attention paid to emerging trends in geospatial information systems (GIS).

So, what’s our next step? GPP has already gained some success in the Electronic Design Automation (EDA) industry. Our next target is object-relational database engines. Computational geometry processing is a necessary component of spatially aware systems, including database management systems (DBMS) such as Oracle Spatial and Graph, and PostGIS for PostgreSQL. For database management systems, spatially-aware extensions provide native support for storage and retrieval of geometric objects that are commonly quite large and complex. This turns a standard DBMS into a location-aware engine that can drive business and technology applications referencing complex, layered, and/or geographical data into an intuitive and visualizable form. In addition to generalized spatial queries for their databases, Oracle’s solution, for example, provides explicit support for geo-referenced images, topologies, 3D data types, including triangulated irregular networks (TINs), and point clouds with native support for LIDAR data. PostGIS, on the other hand, provides native support and algorithms for ESRI shapefiles, raster map algebra, and spatial reprojection. When combined with GPP’s highly-parallelized computational geometry engine, these functions can processes the large, multi-layered data sets necessary to perform automated emergency route generation, natural resource management, insurance analysis, petroleum exploration, and financial queries 10 times or more faster than conventional algorithms.

We are excited to attend SIGSPATIAL 2013 in November and look forward to discussing the opportunities for geospatial processing that GPP provides with developers and researchers from around the world. See you in Florida!

Fixstars to Launch “FlashAir™ Developers”, a Technical Information Website Supporting App Development Using the Toshiba FlashAir SDHC Memory Card.

Supporting App Development for SDHC Memory Cards with Embedded Wireless LAN Functionality.

FlashAir Developers Website

FlashAir Developers Website

SUNNYVALE, Calif. — August 1st, 2013 — Fixstars today announces the launch of a technical information website for developers of Toshiba FlashAir apps and services: “FlashAir Developers” http://www.flashair-developers.com/

FlashAir is an SDHC memory card with embedded Wireless LAN functionality. Since FlashAir works as a stand alone Wireless LAN access point, FlashAir embedded devices are able to both send files to and receive files from each other. In addition to wireless networking, FlashAir also retains normal SD card functionality, thus storing data and allowing typical Wi-Fi devices like PCs and smartphones to access its contents.

With FlashAir, users will be able to browse photos in a digital camera from a smartphone wirelessly, as well as download their favorite photos to any networked device, such as PCs, tablets, and smartphones.


How to use the FlashAir

The FlashAir Developers website provides API information for browsing files on FlashAir cards from typical Wi-Fi devices like PCs and smartphones, as well as providing app development tutorials, sample code, and support information with a FAQ and forum.  By using the APIs from the FlashAir Developers website, app developers will be able to develop apps that work with FlashAir for free.

“I am very pleased to provide the FlashAir API to all potential app developers through the Fixstars Solutions FlashAir Developers Website,” said Hiroto Nakai, Senior Manager, Flash Business Strategy Development, Memory Division, Toshiba Corporation Semiconductor & Storage Products Company. “On one hand, FlashAir is essentially an SDHC memory card with Wireless LAN. However it’s ideally suited to so much more than just digital cameras. I am expecting imaginative solutions to come from the developers who use the FlashAir Developers website.”

“I am very excited to support app developers through the FlashAir Developers website,” said Satoshi Miki, CEO, Fixstars Corporation. “We have been operating developer websites for various platforms and providing valuable support to a large number of software developers. Utilizing this resource, we hope to nurture FlashAir to become a platform for advanced devices and services.”

The FlashAir Developers website is operated by Fixstars in cooperation with Toshiba Corporation.

Additional Resources

The Toshiba FlashAir Product Information:

About Fixstars

Fixstars is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars provides “Green IT,” while accelerating customers’ business in various fields.  Learn more about how Fixstars can accelerate your business in medical imaging, manufacturing, finance, and media & entertainment.

For more information, visit www.fixstars.com.

Follow Us Online

[Tech blog] An introduction to SIMD vectorization and NEON

1. What is vectorization?

Vectorization is a type of parallel computing where a single process performs multiple tasks at the same time. This is done by packing a set of elements together into an array and operating on them all at once. It’s especially useful for multimedia purposes such as gaming, image processing, video encoding/decoding and much more.  The process looks something like Figure A.

Fig A: SIMD Sample

Fig A: SIMD Sample

So how do we do this in actual code? And how does it compare with a scalar, one at a time approach? Let’s take a look. I’m going to be doing two implementations of the same addition function, one scalar and one with vectorization using ARM’s NEON intrinsics and gcc 4.7.2-2 (on Yellowdog Linux for ARM*).




The scalar function is very simple, it’s just a for loop adding two 16 member arrays.

void add_scalar(uint8_t * A, uint8_t * B, uint8_t * C){
    for(int i=0; i<16; i++){
         C[i] = A[i] + B[i];

The NEON function however looks a lot more complicated.

void add_neon(uint8_t * A, uint8_t * B, uint8_t *C){
        //Setup a couple vectors to hold our data
        uint8x16_t vectorA, vectorB vectorC;

        //Load our data into the vector's register
        vectorA = vld1q_u8(A);
        vectorB = vld1q_u8(B);

        //Add A and B together
        vectorC = vaddq_u8(vectorA, vectorB);

Those strange looking functions are NEON’s Intrinsics, they form an intermediate layer between assembly and C. They’re a bit confusing, but ARM’s infocenter goes into some detail about them and GCC has a great reference available here. So what do they do? Well, “uint8x16_t” is a vector type containing 16 8bit uints in an array and “ald1q_u8” loads 8bit uints into a vector. Finally, “vaddq_u8” adds two vectors made of 8bit uints together all at once, and returns the result. If you test this out you’ll notice that the neon function isn’t really any faster. This is because those two load functions take up a lot of time, and we’re doing so little work that the scalar solution catches up. If we could avoid those (by structuring our program to use vectors in the first place) we’d see a greater improvement.

Now lets take a look at another case where neon can really shine, matrix multiplication. Specifically 4×4 matrix multiplication, a common case in computer graphics.

// Our test matrices
        uint16_t matrixA[4][4] = {1,  2,  3,  4, \
                                5,  6,  7,  8, \
                                9,  10, 11, 12,\
                                13, 14, 15, 16 };

        uint16_t matrixB[4][4] = {16, 15, 14, 13,\
                                12, 11, 10, 9, \
                                8,  7,  6,  5, \
                                4,  3,  2, 1 };

        uint16_t matrixC[4][4];

Multiplying these together with a scalar function is fairly straightforward, we calculate the dotproduct of each value in matrixA (by rows) with each column in matrixB. We can do this in a somewhat efficient manner using for loops:

        for(i=0; i<4; i++){ //For each row in A
                for(j=0; j<4; j++){ //And each column in B
                        for(k=0; k<4; k++){ //for each item in that column
                                dotproduct = dotproduct + A[i][k]*B[k][j];
                                //use a running total to calculate the dp.
                        C[i][j] = dotproduct; //fill in C with our results.

Now using NEON…

        //Load matrixB into four vectors
        uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;

        vectorB1 = vld1_u16 (B[0]);
        vectorB2 = vld1_u16 (B[1]);
        vectorB3 = vld1_u16 (B[2]);
        vectorB4 = vld1_u16 (B[3]);

        //Temporary vectors to use with calculating the dotproduct
        uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;

        // For each row in A...
        for (i=0; i<4; i++){
                //Multiply the rows in B by each value in A's row
                vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
                vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
                vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
                vectorT4 = vmul_n_u16(vectorB4, A[i][3]);

                //Add them together
                vectorT1 = vadd_u16(vectorT1, vectorT2);
                vectorT1 = vadd_u16(vectorT1, vectorT3);
                vectorT1 = vadd_u16(vectorT1, vectorT4);

                //Output the dotproduct
                vst1_u16 (C[i], vectorT1);

That looks much more complicated, and in some ways it is. It’s also about three times as fast (including loads) on my test machine. Instead of stepping through each item in matrixA I’m stepping through each row, and calculating the dot product for four of them at a time. If we break down the matrix multiplication and look at the dotproduct calculations for the first row, you can hopefully see why this works:

C[0][0] = (1 * 16) + (2 * 12) + (3 * 8) + (4 * 4) (which is 80)
C[0][1] = (1 * 15) + (2 * 11) + (3 * 7) + (4 * 3) (70)
C[0][2] = (1 * 14) + (2 * 10) + (3 * 6) + (4 * 2) (60)
C[0][3] = (1 * 13) + (2 * 9) + (3 * 5) + (4 * 1) (50)

We’re multiplying each row of matrixB with a single value from matrixA at a time, something neon can do easily using “vmul_n_X”. We hold this data in temp vectors, add those vectors together with “vadd_X” (accumulating the result in vectorT1) then unload our new row into matrixC using “vstl_X”. A very different approach than the scalar solution but with the same results. My test program is attached to the blog post if you’d like to give it a try yourself.

2. Autovectorization

Interested in avoiding all this mess and still getting at least some of the benefits? Luckily there’s something called Auto-Vectorization, an automatic way to convert a scalar program into a Vectorized one. Research into auto-vectorization is still ongoing (and probably will be for quite some time), but there are several implementations available already. Gcc is by far the most popular and gcc 4.7(+), which includes support for autovectorization is already included in Yellowdog Linux 7.

Enabling Autovectorization in gcc is quite simple, but there are several tricks and hints you may need to give the compiler to get an optimal result. In gcc 4.7(+), auto-vectorization can be enabled by adding -03 or -ftree-vectorize to the command line (or CFLAGS). If you’re planning to use neon you’ll need to enable it with -mfpu=neon, although there are some issues. Gcc’s auto vectorization will ignore floats with neon unless you enable -funsafe-math-optimizations. Unfortunately using neon instructions with floats can lead to a loss of precision as neon does not adhere to IEEE 754.

Autovectorization can under the right circumstances significantly speed up a program, but it’s imperfect. Luckily there are ways to structure your code and hints you can give the compiler that will make sure it behaves properly. In future posts I will be covering these tricks and tips as well as going into detail about NEON assembly and how using it directly instead of intrinsics can give your app an even greater speed boost.

* Yellowdog Linux (YDL) is a Linux distribution developed by Fixstars Solutions. It is based on RHEL/Centos and used in Fixstars’ products. See http://www.ydl.net/ydl7/ for more details. A version of Yellowdog Linux optimized for ARM servers is currently in development.