Fixstars and Mizuho Securities Succeeds in Accelerating Derivative Valuation System by 30-folds with Intel Xeon Phi Coprocessor

Mizuho Securities becomes the first Financial Institution to deploy the coprocessor in the Production Environment

TOKYO — June 2, 2014 — Fixstars Corporation today announced that Fixstars and Mizuho Securities Co., Ltd achieved a 30-folds speed-up by porting Mizuho Securities’ Derivative Valuation System to Intel’s many-core processor, “Intel® Xeon Phi™ coprocessors”.  Mizuho Securities has become world’s first financial institution to utilize the Xeon Phi™ coprocessor in the Production Environment.

The recent low interest rates have increased the demand for structured bonds, which combines the use of bonds in conjunction with derivatives.  To meet these customer demands, Mizuho Securities had been investigating ways to efficiently perform the vast amount of computations required by their Derivative Valuation Systems, which led them to consider the use of Intel’s High-Performance Xeon Phi™ coprocessors.

Mizuho Securities, with the aid of their development partner Fixstars, succeeded in devising an algorithm to efficiently distribute the workloads across the cores of Intel® Xeon Phi™ coprocessor.  The Xeon Phi™ replaces an 8-core Xeon® system, which resulted in out-performing the previous system by a factor of 30.  In addition, by placing an emphasis on source-code readability, a highly maintainable system was created.

“I am happy to announce that Fixstars’ expertise in parallel processing in conjunction with Intel’s Technology had been able to improve upon Mizuho Securities’ Derivative Valuation System,” said Kosuke Hirano, Senior Executive Officer, Intel K.K.. “Intel will continue innovating with the highly parallel processing capability of Intel® Xeon Phi™ coprocessors.”

“I am very excited that our expertise in parallel programming techniques has enabled us to aid Mizuho Securities in the deployment of the New Derivative Valuation System,” said Miki Satoshi, CEO, Fixstars. “We had been involved in the project from the researching phase on what hardware to use, and I believe the smooth progress that led to this deployment could not have occurred without the mutual, underlying trust present in the partnership between Mizuho Securities’ and Fixstars, as well as the high capability and the reliability of Intel® Xeon Phi™ coprocessors.”

Mizuho Securities is already using the new system for Plain Vanilla Derivatives, with plans to shift to the new system this summer with Exotic Derivatives.

Fixstars will continue to speed up Mizuho Security’s business through provision of technical expertise in parallel processing.

* Intel, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

Initial Public Offering

Fixstars Corporation, a parent company of Fixstars Solutions inc., announced the pricing of its initial public offering of 100,000 shares of Class A common stock. Fixstars Corp’s Class A common stock will trade on the Tokyo Stock Exchange (Mothers) under the code “3687″ from April 23, 2014.

See more detail at

[Tech Blog] Super Computing 2013

Super Computing 2013 (SC13) has started in Denver, Colorado! The Super Computing top 500 list has already been updated, and there are no big changes to the top 5. Tianhe-2, a supercomputer developed in China, retained its position as the world’s No. 1 system.


One important but often ignored statistic for Super Computing is power consumption. For example the Tianhe-2 reaches 33.86 petaflop/s on the Linpack benchmark, but it consumes 17808 kW of power to do so! This is almost the entire output of a small thermal power station. The Green 500 ranks supercomputers by energy efficiency instead of just raw processing power. In terms of MFLOPS/Watt, the TSUBAME 2.5 developed by the Tokyo Institute of Technology should get their No. 1 spot. The latest Green 500 list will be available soon.

There are several techniques we can use to manage power consumption at the system level, but within each computing node we only have a few options. Heterogeneous setups, such as those using an NVIDIA/AMD GPU or Intel’s Xeon Phi have become the norm – and we think FPGA chips are another promising option. FPGA has pipeline parallelism, where several different tasks can be spawned in a push-pull configuration and each task has its own data supplied by the previous task – with or without host interaction. Several tech papers report that FPGA outperforms other chips in terms of both latency and energy efficiency. The standard FPGA languages (VHDL and Verilog-HDL) can be a complicated struggle for any software developer (like me!), but the chip vendor ALTERA has recently announced OpenCL support on their FPGA devices. This will make development faster, and cause fewer headaches. Fixstars has been a part of their early access program for their OpenCL SDK since last summer, and we provide OpenCL software and services to our clients.

Unfortunately, we don’t have a booth at SC13 this year, but we can be found at our partner Nallatech’s booth (#3519). We’ll even have coupons for copies of our OpenCL programming book.
Stop by and check out OpenCL on FPGA!

[Tech Blog] Computational Geometry Engine and Our Patented GPP Technology


Computational geometry engines are central to both emerging and ubiquitous technologies. From online mapping to digital chip manufacturing, computational geometry algorithms reduce enormous sets of data into visualizable results that enable engineers and casual users to make informed decisions.

The plane sweep algorithm, although widely used in computational geometry, does not parallelize efficiently, rendering it incapable of benefiting from recent trends in multicore CPUs and general-purpose GPUs. Instead of the plane sweep, some researchers have proposed a uniform grid as a foundation for parallel algorithms of computational geometry. However, long-standing robustness and performance issues have deterred its wider adoption, at least in the case of overlay analysis. To remedy this, we have developed previously unknown, unique methods to perform snap rounding and compute efficiently the winding number of overlay faces on a uniform grid. We have implemented them as part of an extensible geometry engine to perform polygon overlay with OpenMP on CPUs and CUDA on GPUs. As previously announced, we have released a software product, called “Geometric Performance Primitives (GPP),” based on this implementation. The overall algorithm works on any polygon configuration, either degenerate, overlapping, self-overlapping, disjoint, or with holes. With typical data, it features time and space complexities of O(N + K), where N is the number of edges and K the number of intersections. Its single-threaded performance not only rivals the plane sweep, it achieves a parallel efficiency of 0.9 on our quad-core CPU, with an additional speedup factor of over 4 on our GPU. These performance results should extrapolate to distributed computing and other geometric operations.

To obtain baseline performance, we compared GPP against equivalent functionality found in the following commonly used GIS software tools: ArcGIS 10.1 for Desktop, GRASS 6.4.2, OpenJUMP 1.6.3 (with JTS 1.13 as backend), and QGIS 1.8.0 (with GEOS 3.3.2 as backend), whose algorithms execute on a single thread only and may or may not exploit the plane sweep. Nevertheless, note that their rich set of features may create additional overhead not yet present in GPP. With each of them, we applied a dissolve (merge) operation and a geometric intersection (boolean AND) of two pairs of layers taken from real-world datasets and listed the resulting execution performance. Its overall performance fared well against the existing solutions, reaching an average speed up of 16 x on CPU alone against ArcGIS, the fastest currently available software application we found.


Performance of ArcGIS versus GPP on real-world datasets. Please refer to the paper for more technical details.

At this year’s ACM SIGSPATIAL, our technical paper was selected out of over 200 very competitive entries. Please refer to the paper for more technical details. The SIGSPATIAL section of the ACM covers a broad range of innovative designs and implementations, with special attention paid to emerging trends in geospatial information systems (GIS).

So, what’s our next step? GPP has already gained some success in the Electronic Design Automation (EDA) industry. Our next target is object-relational database engines. Computational geometry processing is a necessary component of spatially aware systems, including database management systems (DBMS) such as Oracle Spatial and Graph, and PostGIS for PostgreSQL. For database management systems, spatially-aware extensions provide native support for storage and retrieval of geometric objects that are commonly quite large and complex. This turns a standard DBMS into a location-aware engine that can drive business and technology applications referencing complex, layered, and/or geographical data into an intuitive and visualizable form. In addition to generalized spatial queries for their databases, Oracle’s solution, for example, provides explicit support for geo-referenced images, topologies, 3D data types, including triangulated irregular networks (TINs), and point clouds with native support for LIDAR data. PostGIS, on the other hand, provides native support and algorithms for ESRI shapefiles, raster map algebra, and spatial reprojection. When combined with GPP’s highly-parallelized computational geometry engine, these functions can processes the large, multi-layered data sets necessary to perform automated emergency route generation, natural resource management, insurance analysis, petroleum exploration, and financial queries 10 times or more faster than conventional algorithms.

We are excited to attend SIGSPATIAL 2013 in November and look forward to discussing the opportunities for geospatial processing that GPP provides with developers and researchers from around the world. See you in Florida!

Fixstars to Launch “FlashAir™ Developers”, a Technical Information Website Supporting App Development Using the Toshiba FlashAir SDHC Memory Card.

Supporting App Development for SDHC Memory Cards with Embedded Wireless LAN Functionality.

FlashAir Developers Website

FlashAir Developers Website

SUNNYVALE, Calif. — August 1st, 2013 — Fixstars today announces the launch of a technical information website for developers of Toshiba FlashAir apps and services: “FlashAir Developers”

FlashAir is an SDHC memory card with embedded Wireless LAN functionality. Since FlashAir works as a stand alone Wireless LAN access point, FlashAir embedded devices are able to both send files to and receive files from each other. In addition to wireless networking, FlashAir also retains normal SD card functionality, thus storing data and allowing typical Wi-Fi devices like PCs and smartphones to access its contents.

With FlashAir, users will be able to browse photos in a digital camera from a smartphone wirelessly, as well as download their favorite photos to any networked device, such as PCs, tablets, and smartphones.


How to use the FlashAir

The FlashAir Developers website provides API information for browsing files on FlashAir cards from typical Wi-Fi devices like PCs and smartphones, as well as providing app development tutorials, sample code, and support information with a FAQ and forum.  By using the APIs from the FlashAir Developers website, app developers will be able to develop apps that work with FlashAir for free.

“I am very pleased to provide the FlashAir API to all potential app developers through the Fixstars Solutions FlashAir Developers Website,” said Hiroto Nakai, Senior Manager, Flash Business Strategy Development, Memory Division, Toshiba Corporation Semiconductor & Storage Products Company. “On one hand, FlashAir is essentially an SDHC memory card with Wireless LAN. However it’s ideally suited to so much more than just digital cameras. I am expecting imaginative solutions to come from the developers who use the FlashAir Developers website.”

“I am very excited to support app developers through the FlashAir Developers website,” said Satoshi Miki, CEO, Fixstars Corporation. “We have been operating developer websites for various platforms and providing valuable support to a large number of software developers. Utilizing this resource, we hope to nurture FlashAir to become a platform for advanced devices and services.”

The FlashAir Developers website is operated by Fixstars in cooperation with Toshiba Corporation.

Additional Resources

The Toshiba FlashAir Product Information:

About Fixstars

Fixstars is a software company devoted to “Speed up your Business”. Through its software parallelization/optimization expertise, its highly effective use of multi-core processors, and application acceleration for the next generation of memory technology that delivers high speed IO as well as power savings, Fixstars provides “Green IT,” while accelerating customers’ business in various fields.  Learn more about how Fixstars can accelerate your business in medical imaging, manufacturing, finance, and media & entertainment.

For more information, visit

Follow Us Online

[Tech blog] An introduction to SIMD vectorization and NEON

1. What is vectorization?

Vectorization is a type of parallel computing where a single process performs multiple tasks at the same time. This is done by packing a set of elements together into an array and operating on them all at once. It’s especially useful for multimedia purposes such as gaming, image processing, video encoding/decoding and much more.  The process looks something like Figure A.

Fig A: SIMD Sample

Fig A: SIMD Sample

So how do we do this in actual code? And how does it compare with a scalar, one at a time approach? Let’s take a look. I’m going to be doing two implementations of the same addition function, one scalar and one with vectorization using ARM’s NEON intrinsics and gcc 4.7.2-2 (on Yellowdog Linux for ARM*).




The scalar function is very simple, it’s just a for loop adding two 16 member arrays.

void add_scalar(uint8_t * A, uint8_t * B, uint8_t * C){
    for(int i=0; i<16; i++){
         C[i] = A[i] + B[i];

The NEON function however looks a lot more complicated.

void add_neon(uint8_t * A, uint8_t * B, uint8_t *C){
        //Setup a couple vectors to hold our data
        uint8x16_t vectorA, vectorB vectorC;

        //Load our data into the vector's register
        vectorA = vld1q_u8(A);
        vectorB = vld1q_u8(B);

        //Add A and B together
        vectorC = vaddq_u8(vectorA, vectorB);

Those strange looking functions are NEON’s Intrinsics, they form an intermediate layer between assembly and C. They’re a bit confusing, but ARM’s infocenter goes into some detail about them and GCC has a great reference available here. So what do they do? Well, “uint8x16_t” is a vector type containing 16 8bit uints in an array and “ald1q_u8” loads 8bit uints into a vector. Finally, “vaddq_u8” adds two vectors made of 8bit uints together all at once, and returns the result. If you test this out you’ll notice that the neon function isn’t really any faster. This is because those two load functions take up a lot of time, and we’re doing so little work that the scalar solution catches up. If we could avoid those (by structuring our program to use vectors in the first place) we’d see a greater improvement.

Now lets take a look at another case where neon can really shine, matrix multiplication. Specifically 4×4 matrix multiplication, a common case in computer graphics.

// Our test matrices
        uint16_t matrixA[4][4] = {1,  2,  3,  4, \
                                5,  6,  7,  8, \
                                9,  10, 11, 12,\
                                13, 14, 15, 16 };

        uint16_t matrixB[4][4] = {16, 15, 14, 13,\
                                12, 11, 10, 9, \
                                8,  7,  6,  5, \
                                4,  3,  2, 1 };

        uint16_t matrixC[4][4];

Multiplying these together with a scalar function is fairly straightforward, we calculate the dotproduct of each value in matrixA (by rows) with each column in matrixB. We can do this in a somewhat efficient manner using for loops:

        for(i=0; i<4; i++){ //For each row in A
                for(j=0; j<4; j++){ //And each column in B
                        for(k=0; k<4; k++){ //for each item in that column
                                dotproduct = dotproduct + A[i][k]*B[k][j];
                                //use a running total to calculate the dp.
                        C[i][j] = dotproduct; //fill in C with our results.

Now using NEON…

        //Load matrixB into four vectors
        uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;

        vectorB1 = vld1_u16 (B[0]);
        vectorB2 = vld1_u16 (B[1]);
        vectorB3 = vld1_u16 (B[2]);
        vectorB4 = vld1_u16 (B[3]);

        //Temporary vectors to use with calculating the dotproduct
        uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;

        // For each row in A...
        for (i=0; i<4; i++){
                //Multiply the rows in B by each value in A's row
                vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
                vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
                vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
                vectorT4 = vmul_n_u16(vectorB4, A[i][3]);

                //Add them together
                vectorT1 = vadd_u16(vectorT1, vectorT2);
                vectorT1 = vadd_u16(vectorT1, vectorT3);
                vectorT1 = vadd_u16(vectorT1, vectorT4);

                //Output the dotproduct
                vst1_u16 (C[i], vectorT1);

That looks much more complicated, and in some ways it is. It’s also about three times as fast (including loads) on my test machine. Instead of stepping through each item in matrixA I’m stepping through each row, and calculating the dot product for four of them at a time. If we break down the matrix multiplication and look at the dotproduct calculations for the first row, you can hopefully see why this works:

C[0][0] = (1 * 16) + (2 * 12) + (3 * 8) + (4 * 4) (which is 80)
C[0][1] = (1 * 15) + (2 * 11) + (3 * 7) + (4 * 3) (70)
C[0][2] = (1 * 14) + (2 * 10) + (3 * 6) + (4 * 2) (60)
C[0][3] = (1 * 13) + (2 * 9) + (3 * 5) + (4 * 1) (50)

We’re multiplying each row of matrixB with a single value from matrixA at a time, something neon can do easily using “vmul_n_X”. We hold this data in temp vectors, add those vectors together with “vadd_X” (accumulating the result in vectorT1) then unload our new row into matrixC using “vstl_X”. A very different approach than the scalar solution but with the same results. My test program is attached to the blog post if you’d like to give it a try yourself.

2. Autovectorization

Interested in avoiding all this mess and still getting at least some of the benefits? Luckily there’s something called Auto-Vectorization, an automatic way to convert a scalar program into a Vectorized one. Research into auto-vectorization is still ongoing (and probably will be for quite some time), but there are several implementations available already. Gcc is by far the most popular and gcc 4.7(+), which includes support for autovectorization is already included in Yellowdog Linux 7.

Enabling Autovectorization in gcc is quite simple, but there are several tricks and hints you may need to give the compiler to get an optimal result. In gcc 4.7(+), auto-vectorization can be enabled by adding -03 or -ftree-vectorize to the command line (or CFLAGS). If you’re planning to use neon you’ll need to enable it with -mfpu=neon, although there are some issues. Gcc’s auto vectorization will ignore floats with neon unless you enable -funsafe-math-optimizations. Unfortunately using neon instructions with floats can lead to a loss of precision as neon does not adhere to IEEE 754.

Autovectorization can under the right circumstances significantly speed up a program, but it’s imperfect. Luckily there are ways to structure your code and hints you can give the compiler that will make sure it behaves properly. In future posts I will be covering these tricks and tips as well as going into detail about NEON assembly and how using it directly instead of intrinsics can give your app an even greater speed boost.

* Yellowdog Linux (YDL) is a Linux distribution developed by Fixstars Solutions. It is based on RHEL/Centos and used in Fixstars’ products. See for more details. A version of Yellowdog Linux optimized for ARM servers is currently in development.

Geometric Performance Primitives, the world’s fastest multi-core geometry engine

Fixstars is pleased to announce the immediate release of its Geometric Performance Primitives library (GPP), the world’s fastest multi-core geometry engine.

With computational geometry tasks so important and currently established routines so inefficient, the GPP library stands to make dramatic improvements to the EDA and GIS communities. Early testing on parallelized GPU systems shows up to 25 times faster performance than reference CPUs. In addition to advanced parallelism, flexible hardware targeting, and support for boolean operations, snap rounding, and polygon relations, GPP places no artificial limits on data size, performs overlay analysis and geometry-on-geometry checks on all-angle geometry, and allows up to 53-bit coordinates.

GPP is available now. For more information, please visit our web site or contact us.

[Tech blog] IBM 7R2 Appliance and YellowDog Linux: High Performance Hardware With an Open Software Stack

In July, Fixstars announced the PowerLinux 7R2 YellowDog Appliance as a new open standard platform able to efficiently support basic infrastructure services, large transactional databases, BigData, and HPC workloads. IBM’s latest Power7 offerings are very exciting and with Fixstars’ offering, the high performance hardware is available with an open software stack just like the commodity servers it is replacing.

The 7R2 is a 2U, rack mountable server that features two 3.5GHz, 8-core Power7 processors. Each has four threading units meaning an impressive 64 Penguins are displayed on boot if you have CONFIG_LOGO enabled, not to mention the actual performance the processors are able to deliver. The default configuration features 32GB of memory but up to 256GB is available. With six bays for SAS storage, plenty of local fast storage is available and with six PCI Express slots (five 8x, one 4x) there is no shortage of expansion available for FiberChannel, Infiniband, or any other type of external connectivity you need.

IBM marketing materials claim that the PowerLinux 7R2 is capable of efficiently running thousands of tasks in parallel while achieving massive scale-out flexibility and exploiting extreme memory bandwidth. In practise this means using fewer servers for increased performance at lower costs ( and reducing the rate at which costs grow as IT capabilities expand.  (

You may wonder how the 7R2 Appliance compares to conventional, Intel servers; the Power7 processor is quite similar to the Xeon E5 and E7 families with 256KB of L1 cache per core.  While the Intel processors share up to 20MB of L3 cache, the Power7 has a dedicated 4MB of cache per core for a total of 32MB. Comparing the 7R2 to other platforms, it fits into the same segment as the Dell PowerEdge R820 series which as of October 2012 are priced between $13,000 and $18,000 each depending on processor selection, with 32GB of memory and 2x146GB SAS drives. At $14,990 (or $19,990 with PowerVM Standard Edition) the 7R2 is competitively priced with high end x86 servers, ending the complaint that the Power architecture is “too expensive” for adoption.

Fixstars is including the new Yellow Dog Linux 7 with the 7R2 appliance and just like previous versions, it is based on the industry standard Enterprise Linux. Yellow Dog 7 features a few tweaks for better performance on Power that developers will love. The biggest improvement is the inclusion of the GCC 4.7 compiler suite, something that was universally requested from Fixstars’ other project teams. GCC 4.7 features auto-vectorization that we’ve documented here (

PowerVM Standard Edition supports most any virtualization function needed while Enterprise Edition with awe inspiring features such as Active Memory Sharing and Live Partition Mobility is also available. Server consolidation is an obvious use for a machine like the 7R2 and we’ve built out 7R2s with 160 running virtual machines and performance of both database and web server workloads didn’t suffer, processor intensive workloads of course suffered from such a small slice of CPU time. In more realistic consolidation tests, we found that a 7R2 configured with 16 LPARs had the best overall performance compared to the same configuration with either 1, 4, or 64 LPARs.

The Phoronix Test Suite demonstrates the YellowDog 7R2 Appliance’s capabilities using a variety of CPU and I/O intensive benchmarks. The 7R2 (configured with 64GB of memory and four 146GB SAS drives in a single Logical Volume) has an impressive crafty score ( of 111.78, runs the Parallel BZIP test ( in only 3.18 seconds and also scores 47929 in the PHPBench test ( Overall the results are fairly similar to the high end Intel Xeon offerings, with the 7R2 being faster in some tests and the Xeon being faster in others.

Over the coming months, we’ll be posting more about the 7R2 and how it integrates into different environments with Virtualization and BigData.

Fixstars M³ Platform meets ALTERA FPGA

Overview of Software Stack of this demo
SALT LAKE CITY, November 12, 2012 -Fixstars Solutions, Inc. is pleased to announce that its image processing demo running on ALTERA’s latest Stratix V FPGA board is showing at Super Computing 2012.

With a port of several image processing routines using ALTERA’s OpenCL tools, Fixstars demonstrates the power and flexibility of its flagship optimization platform, M³, modified to take full advantage of the new Stratix V FPGA board. Combined, Fixstars M³ and ALTERA’s OpenCL dramatically simplify development, maximizing parallelization and minimizing time to market in applications such as medical imaging, inspection devices, surveillance systems, and media encoder/decoder products.

The most exciting international conference for high performance computing, SC gathers the best and brightest minds in supercomputing. Please join Fixstars Solutions at the ALTERA booth, #430 near the main entrance of the Salt Palace Convention Center, to experience the iPad high performance computing demonstration with M³ and the Stratix V FGPA.

For more information, please visit our web site or contact us.

The OpenCL Programming Book released in Korea

The OpenCL Programming Book has been released in Korea!

The book is a result of the collaboration of our team of engineers in providing a practical and easy to use tool for OpenCL programming. It starts with the basics of parallelization, covering the main concepts, techniques, and setting up a development environment for OpenCL. It concludes with a clear and useful example of the FFT and Mersenne Twister algorithms written in OpenCL, walking you through the programming process and providing you with the source-code. It is the perfect resource for those wishing to get started on programming in OpenCL.

This book can be purchased in bookstores in Korea and online.

For more information, see the publishers website: