News

[Tech blog] An introduction to SIMD vectorization and NEON

1. What is vectorization?

Vectorization is a type of parallel computing where a single process performs multiple tasks at the same time. This is done by packing a set of elements together into an array and operating on them all at once. It’s especially useful for multimedia purposes such as gaming, image processing, video encoding/decoding and much more.  The process looks something like Figure A.

Fig A: SIMD Sample

Fig A: SIMD Sample

So how do we do this in actual code? And how does it compare with a scalar, one at a time approach? Let’s take a look. I’m going to be doing two implementations of the same addition function, one scalar and one with vectorization using ARM’s NEON intrinsics and gcc 4.7.2-2 (on Yellowdog Linux for ARM*).

 

 

 

The scalar function is very simple, it’s just a for loop adding two 16 member arrays.

void add_scalar(uint8_t * A, uint8_t * B, uint8_t * C){
    for(int i=0; i<16; i++){
         C[i] = A[i] + B[i];
    }
}

The NEON function however looks a lot more complicated.

void add_neon(uint8_t * A, uint8_t * B, uint8_t *C){
        //Setup a couple vectors to hold our data
        uint8x16_t vectorA, vectorB vectorC;

        //Load our data into the vector's register
        vectorA = vld1q_u8(A);
        vectorB = vld1q_u8(B);

        //Add A and B together
        vectorC = vaddq_u8(vectorA, vectorB);
}

Those strange looking functions are NEON’s Intrinsics, they form an intermediate layer between assembly and C. They’re a bit confusing, but ARM’s infocenter goes into some detail about them and GCC has a great reference available here. So what do they do? Well, “uint8x16_t” is a vector type containing 16 8bit uints in an array and “ald1q_u8” loads 8bit uints into a vector. Finally, “vaddq_u8” adds two vectors made of 8bit uints together all at once, and returns the result. If you test this out you’ll notice that the neon function isn’t really any faster. This is because those two load functions take up a lot of time, and we’re doing so little work that the scalar solution catches up. If we could avoid those (by structuring our program to use vectors in the first place) we’d see a greater improvement.

Now lets take a look at another case where neon can really shine, matrix multiplication. Specifically 4×4 matrix multiplication, a common case in computer graphics.

// Our test matrices
        uint16_t matrixA[4][4] = {1,  2,  3,  4, \
                                5,  6,  7,  8, \
                                9,  10, 11, 12,\
                                13, 14, 15, 16 };

        uint16_t matrixB[4][4] = {16, 15, 14, 13,\
                                12, 11, 10, 9, \
                                8,  7,  6,  5, \
                                4,  3,  2, 1 };

        uint16_t matrixC[4][4];

Multiplying these together with a scalar function is fairly straightforward, we calculate the dotproduct of each value in matrixA (by rows) with each column in matrixB. We can do this in a somewhat efficient manner using for loops:

    //...
        for(i=0; i<4; i++){ //For each row in A
                for(j=0; j<4; j++){ //And each column in B
                        dotproduct=0;
                        for(k=0; k<4; k++){ //for each item in that column
                                dotproduct = dotproduct + A[i][k]*B[k][j];
                                //use a running total to calculate the dp.
                        }
                        C[i][j] = dotproduct; //fill in C with our results.
                }
        }
    //...

Now using NEON…

    //...
        //Load matrixB into four vectors
        uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;

        vectorB1 = vld1_u16 (B[0]);
        vectorB2 = vld1_u16 (B[1]);
        vectorB3 = vld1_u16 (B[2]);
        vectorB4 = vld1_u16 (B[3]);

        //Temporary vectors to use with calculating the dotproduct
        uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;

        // For each row in A...
        for (i=0; i<4; i++){
                //Multiply the rows in B by each value in A's row
                vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
                vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
                vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
                vectorT4 = vmul_n_u16(vectorB4, A[i][3]);

                //Add them together
                vectorT1 = vadd_u16(vectorT1, vectorT2);
                vectorT1 = vadd_u16(vectorT1, vectorT3);
                vectorT1 = vadd_u16(vectorT1, vectorT4);

                //Output the dotproduct
                vst1_u16 (C[i], vectorT1);
        }
    //...

That looks much more complicated, and in some ways it is. It’s also about three times as fast (including loads) on my test machine. Instead of stepping through each item in matrixA I’m stepping through each row, and calculating the dot product for four of them at a time. If we break down the matrix multiplication and look at the dotproduct calculations for the first row, you can hopefully see why this works:

C[0][0] = (1 * 16) + (2 * 12) + (3 * 8) + (4 * 4) (which is 80)
C[0][1] = (1 * 15) + (2 * 11) + (3 * 7) + (4 * 3) (70)
C[0][2] = (1 * 14) + (2 * 10) + (3 * 6) + (4 * 2) (60)
C[0][3] = (1 * 13) + (2 * 9) + (3 * 5) + (4 * 1) (50)
etc…

We’re multiplying each row of matrixB with a single value from matrixA at a time, something neon can do easily using “vmul_n_X”. We hold this data in temp vectors, add those vectors together with “vadd_X” (accumulating the result in vectorT1) then unload our new row into matrixC using “vstl_X”. A very different approach than the scalar solution but with the same results. My test program is attached to the blog post if you’d like to give it a try yourself.

2. Autovectorization

Interested in avoiding all this mess and still getting at least some of the benefits? Luckily there’s something called Auto-Vectorization, an automatic way to convert a scalar program into a Vectorized one. Research into auto-vectorization is still ongoing (and probably will be for quite some time), but there are several implementations available already. Gcc is by far the most popular and gcc 4.7(+), which includes support for autovectorization is already included in Yellowdog Linux 7.

Enabling Autovectorization in gcc is quite simple, but there are several tricks and hints you may need to give the compiler to get an optimal result. In gcc 4.7(+), auto-vectorization can be enabled by adding -03 or -ftree-vectorize to the command line (or CFLAGS). If you’re planning to use neon you’ll need to enable it with -mfpu=neon, although there are some issues. Gcc’s auto vectorization will ignore floats with neon unless you enable -funsafe-math-optimizations. Unfortunately using neon instructions with floats can lead to a loss of precision as neon does not adhere to IEEE 754.

Autovectorization can under the right circumstances significantly speed up a program, but it’s imperfect. Luckily there are ways to structure your code and hints you can give the compiler that will make sure it behaves properly. In future posts I will be covering these tricks and tips as well as going into detail about NEON assembly and how using it directly instead of intrinsics can give your app an even greater speed boost.

* Yellowdog Linux (YDL) is a Linux distribution developed by Fixstars Solutions. It is based on RHEL/Centos and used in Fixstars’ products. See http://www.ydl.net/ydl7/ for more details. A version of Yellowdog Linux optimized for ARM servers is currently in development.

Geometric Performance Primitives, the world’s fastest multi-core geometry engine

Fixstars is pleased to announce the immediate release of its Geometric Performance Primitives library (GPP), the world’s fastest multi-core geometry engine.

With computational geometry tasks so important and currently established routines so inefficient, the GPP library stands to make dramatic improvements to the EDA and GIS communities. Early testing on parallelized GPU systems shows up to 25 times faster performance than reference CPUs. In addition to advanced parallelism, flexible hardware targeting, and support for boolean operations, snap rounding, and polygon relations, GPP places no artificial limits on data size, performs overlay analysis and geometry-on-geometry checks on all-angle geometry, and allows up to 53-bit coordinates.

GPP is available now. For more information, please visit our web site or contact us.

[Tech blog] IBM 7R2 Appliance and YellowDog Linux: High Performance Hardware With an Open Software Stack

In July, Fixstars announced the PowerLinux 7R2 YellowDog Appliance as a new open standard platform able to efficiently support basic infrastructure services, large transactional databases, BigData, and HPC workloads. IBM’s latest Power7 offerings are very exciting and with Fixstars’ offering, the high performance hardware is available with an open software stack just like the commodity servers it is replacing.

The 7R2 is a 2U, rack mountable server that features two 3.5GHz, 8-core Power7 processors. Each has four threading units meaning an impressive 64 Penguins are displayed on boot if you have CONFIG_LOGO enabled, not to mention the actual performance the processors are able to deliver. The default configuration features 32GB of memory but up to 256GB is available. With six bays for SAS storage, plenty of local fast storage is available and with six PCI Express slots (five 8x, one 4x) there is no shortage of expansion available for FiberChannel, Infiniband, or any other type of external connectivity you need.

IBM marketing materials claim that the PowerLinux 7R2 is capable of efficiently running thousands of tasks in parallel while achieving massive scale-out flexibility and exploiting extreme memory bandwidth. In practise this means using fewer servers for increased performance at lower costs (http://public.dhe.ibm.com/common/ssi/ecm/en/poc03088usen/POC03088USEN.PDF) and reducing the rate at which costs grow as IT capabilities expand.  (http://public.dhe.ibm.com/common/ssi/ecm/en/pol03112usen/POL03112USEN.PDF).

You may wonder how the 7R2 Appliance compares to conventional, Intel servers; the Power7 processor is quite similar to the Xeon E5 and E7 families with 256KB of L1 cache per core.  While the Intel processors share up to 20MB of L3 cache, the Power7 has a dedicated 4MB of cache per core for a total of 32MB. Comparing the 7R2 to other platforms, it fits into the same segment as the Dell PowerEdge R820 series which as of October 2012 are priced between $13,000 and $18,000 each depending on processor selection, with 32GB of memory and 2x146GB SAS drives. At $14,990 (or $19,990 with PowerVM Standard Edition) the 7R2 is competitively priced with high end x86 servers, ending the complaint that the Power architecture is “too expensive” for adoption.

Fixstars is including the new Yellow Dog Linux 7 with the 7R2 appliance and just like previous versions, it is based on the industry standard Enterprise Linux. Yellow Dog 7 features a few tweaks for better performance on Power that developers will love. The biggest improvement is the inclusion of the GCC 4.7 compiler suite, something that was universally requested from Fixstars’ other project teams. GCC 4.7 features auto-vectorization that we’ve documented here (http://ydl.net/ydl7/support/AutoVectorizationDocs.shtml).

PowerVM Standard Edition supports most any virtualization function needed while Enterprise Edition with awe inspiring features such as Active Memory Sharing and Live Partition Mobility is also available. Server consolidation is an obvious use for a machine like the 7R2 and we’ve built out 7R2s with 160 running virtual machines and performance of both database and web server workloads didn’t suffer, processor intensive workloads of course suffered from such a small slice of CPU time. In more realistic consolidation tests, we found that a 7R2 configured with 16 LPARs had the best overall performance compared to the same configuration with either 1, 4, or 64 LPARs.

The Phoronix Test Suite demonstrates the YellowDog 7R2 Appliance’s capabilities using a variety of CPU and I/O intensive benchmarks. The 7R2 (configured with 64GB of memory and four 146GB SAS drives in a single Logical Volume) has an impressive crafty score (http://openbenchmarking.org/test/pts/crafty) of 111.78, runs the Parallel BZIP test (http://openbenchmarking.org/test/pts/compress-pbzip2) in only 3.18 seconds and also scores 47929 in the PHPBench test (http://openbenchmarking.org/test/pts/phpbench). Overall the results are fairly similar to the high end Intel Xeon offerings, with the 7R2 being faster in some tests and the Xeon being faster in others.

Over the coming months, we’ll be posting more about the 7R2 and how it integrates into different environments with Virtualization and BigData.

Fixstars M³ Platform meets ALTERA FPGA

Overview of Software Stack of this demo
SALT LAKE CITY, November 12, 2012 -Fixstars Solutions, Inc. is pleased to announce that its image processing demo running on ALTERA’s latest Stratix V FPGA board is showing at Super Computing 2012.

With a port of several image processing routines using ALTERA’s OpenCL tools, Fixstars demonstrates the power and flexibility of its flagship optimization platform, M³, modified to take full advantage of the new Stratix V FPGA board. Combined, Fixstars M³ and ALTERA’s OpenCL dramatically simplify development, maximizing parallelization and minimizing time to market in applications such as medical imaging, inspection devices, surveillance systems, and media encoder/decoder products.

The most exciting international conference for high performance computing, SC gathers the best and brightest minds in supercomputing. Please join Fixstars Solutions at the ALTERA booth, #430 near the main entrance of the Salt Palace Convention Center, to experience the iPad high performance computing demonstration with M³ and the Stratix V FGPA.

For more information, please visit our web site or contact us.

The OpenCL Programming Book released in Korea


The OpenCL Programming Book has been released in Korea!

The book is a result of the collaboration of our team of engineers in providing a practical and easy to use tool for OpenCL programming. It starts with the basics of parallelization, covering the main concepts, techniques, and setting up a development environment for OpenCL. It concludes with a clear and useful example of the FFT and Mersenne Twister algorithms written in OpenCL, walking you through the programming process and providing you with the source-code. It is the perfect resource for those wishing to get started on programming in OpenCL.

This book can be purchased in bookstores in Korea and online.

For more information, see the publishers website:  http://www.hanb.co.kr/book/look.html?isbn=978-89-7914-946-3

Fixstars Achieves 50X Acceleration in Rendering Massive Particle CG Scene Using Violin Memory’s Violin 6616 Flash Memory Array

Solution Takes Advantage of Multi-Core Processing and Multi-Node Development Environments to Deliver Speed and Cost Improvements in Rendering Complex Images

SIGGRAPH 2012, Los Angeles, August 7, 2012 — Fixstars Solutions, the leader in multi-core software solutions, today announced it has achieved a 50X speed increase in rendering a massive particle computer graphics (CG) scene by utilizing Fixstars’ “lucille” global illumination renderer and the newest Violin 6616 Flash Memory Array from Violin Memory. The stunning results were accomplished through the combination of lucille’s superior parallel processing performance and the industry’s fastest I/O throughput provided by the Violin 6616.

Particle dynamics are an essential component of high quality CG production, enabling the execution of complex and realistically rendered natural phenomena, such as flames, explosions, smoke, splashes, mist, and hair. However, rendering particles typically requires large data sets and random storage access. In that case, the I/O speed of data storage represents a bottleneck for traditional storage solutions, which are typically composed of numerous hard disk drives (HDDs).

Watch larger video

Violin 6000 series

Violin Memory’s Violin 6616 Flash Memory Array, which is a storage solution configured entirely from Toshiba NAND flash memory, enables faster data read-and-write times compared to traditional HDD storage systems, since it has no moving parts. Fixstars’ tests confirmed that the Violin 6616 yielded 50 times faster processing speeds than traditional HDD based solutions for rendering large data sets. CG companies can replace traditional racks of multiple HDDs  with Violin Flash Memory Arrays which can yield significant savings in rendering times computation costs and storage footprint.

Fixstars’ lucille has an enormous speed and compatibility advantages over other ray tracing renderers, since it is highly optimized to take advantage of the inherent advanced performance capabilitiesof multi-core processors and multi-node environments. The particle generation simulations were achieved by making maximum use of 4 nodes with 32 cores of 64 threaded to achieve this level of high speed performance. Similarly, the Violin 6616 provides superior performance in multi-core environments, achieving low overhead and high data rates per CPU cycle. During the simulations, the Violin 6616 sustained lucille’s high computation speed, even when comparing single vs.16 compute threads, confirming Fixstars’ view that the combination of lucille and Violin 6616 represents the perfect solution for rendering large data sets, such as dynamic particle simulations.

Don Basile, CEO of Violin Memory commented, “We are pleased to enable Fixstars set a new performance standard with their “lucille” global illumination renderer application by making the move to Violin Flash Memory Arrays that allow the full power of the application to be realized rather than being held back by legacy disk array technologies.”

“With Toshiba’s NAND flash memory chips powering all Violin Flash Memory Arrays, technology companies are able to take full advantage of the capabilities that solid state storage can deliver,” said George Bouchaya, vice president and chief technology officer at the Institute of Strategic Storage Planning and Investment of Toshiba America Electronic Components, Inc. “Twenty-five years after inventing NAND flash, Toshiba continues to drive flash innovation forward, and it’s exciting to see solutions like the Fixstars/Violin rendering system come to life.”

Scott Frankel, lucille Product Manager at Fixstars Corporation commented, “The Violin Memory’s Violin 6616 represents an enormous step forward in performance data storage. Working hand in hand with the parallel processing capabilities of Fixstars’ lucille global illumination renderer, these two innovative products open the door to new horizons in computer graphics rendering.”

Fixstars is demonstrating the high speed particle rendering at SIGGRAPH 2012, August  7th to 9th in Los Angeles, with Violin Memory and Toshiba, who developed the NAND flash memory embedded in the Violin 6616. To see us at the SIGGRAPH, please visit booth #761. If you have any inquiry or would like to make an appointment, please contact us at the following e-mail address or telephone number,.

 

Fixstars Releases M-Cubed® Platform for Accelerated Processing and Efficient Programming for Multi-Core, Multi-Node, Multi-Architecture Environments

UNNATURALLY FAST PROCESSING: OPTIMIZING APPLICATION PERFORMANCE

M³ logo

SUNNYVALE, CA. August 6, 2012 - Although parallel processing was viewed as the solution for high-volume computational applications, the volume of data and demand for real-time output are now posing new problems with complexity of code, portability and energy consumption.  Fixstars, a global leader in multi-core software development offers a solution to this dilemma with the North American launch of M³(M-Cubed). This new software development platform reduces development time, and increases processing performance for multi-core, multi-node, multi-architecture environments.

Nagayoshi Kobayashi, ISV Enabling Manager, Intel Japan  on the release of M³, “The latest Intel processor has the capability to manage 20 threads in one chip. The M³ platform is a unique approach to improve the processing speed of software working on such many core devices. We are looking forward to future development of M³.”

While parallel processing is the answer to the need for Big Data applications, parallel programming is significantly more difficult than sequential programming, and many applications fall short of taking full advantage of their hardware environments.  Satoshi Miki, Chief Executive Office of Fixstars said, “With every release of new hardware to the market, the ability for software developers to quickly respond to and take advantage of hardware innovations has become a key competitive advantage.”

M³ Software Architecture

Since 2002, Fixstars has helped clients dramatically improve computing performance with software development, and optimization of various hardware environments such as NVIDIA / ATI GPU, ARM SoC , and Intel / AMD x86.  Fixstars has recognized the need for a platform to help developers focus on building software applications without needing to code for every specific hardware type.  The result is M³, offering a highly efficient development process producing highly optimized and portable software applications.

The demand for efficient real-time data processing and Big Data management in fields such as medical image processing, bioinformatics, computer vision, financial model simulations, and computer generated (CG) rendering will continue to grow. Satoshi Miki remarked, “There have been advances in compiler technology, but none have ever been able to meet the needs of our clients. With the release of M-Cubed, I believe that the age of software performance being handcuffed by hardware will soon come to an end.”

Learn more about about M³ at, http://www.fixstars.com/en/m-cubed/

Relaunched Fixstars website

Today we have opened completely redesigned Fixstars website!

With launching this new website, information for new products and services have been added. We plan to announce all of them soon. Count on it!

The OpenCL Programming Book was revised to support OpenCL 1.2!

The revised edition of  “The OpenCL Programming Book” is now available!

The revised edition includes a summary of changes made in OpenCL Specification 1.2, reference functions corresponding to 1.2, and updated excursion environments.

Download a sample here!

Fixstars and Codeplay Team Up to Promote OpenCL Development in Japan

Software-acceleration and parallel processing leaders to deliver services supporting open, portable multi-processor programming framework.

 

Tokyo, Japan, March 5, 2012- Fixstars Corporation, the leading company in multi-core software solutions, and Codeplay Software Ltd., the expert in high-performance compilers and software optimization for graphics and multi-core processing, today announced a new business partnership to deliver Open Computing Language (OpenCL) software development and consulting services to Japanese firms and research laboratories.

OpenCL is a parallel computing framework for programming heterogeneous systems containing devices such as multi-core processors, Graphics Processing Units (GPUs), or Cell Broadband Engine (Cell/B.E™) processors. With its multi-dimensional computation domain, multi-kernel execution and standard language specification, OpenCL is attracting attention as an efficient and portable open technology for software development.

By partnering together, Fixstars and Codeplay will promote adoption of OpenCL among application developer communities as well as semiconductor designers. Using OpenCL-compatible software, middleware and compilers delivered through this partnership, customers will benefit from a high-performance, portable software foundation on which to create competitive new products.

“Fixstars has presence in the Japanese market, and great software expertise, that will help deliver our acceleration technologies to an even larger audience,” said Andrew Richards, CEO and founder of Codeplay. “OpenCL is set to become an extremely important framework enabling developers to achieve a significant performance edge in next-generation product designs.”

“OpenCL is a comprehensive, open framework for heterogeneous parallel processing, and is the result of deep cooperation between industry-leading companies in the OpenCL working group,” added Neil Trevett, vice president mobile content of NVIDIA and president of The Khronos Group, the industry consortium responsible for creating OpenCL and other open compute and graphics standards. “I am delighted to see Codeplay and Fixstars, who are both experienced working group members, combining their expertise to nurture the OpenCL developer community in Japan.”

“I am pleased to be partnering with a company of Codeplay’s standing to help Japan’s developers increase application performance and future-proof new designs using OpenCL.” said Satoshi Miki, CEO and founder of Fixstars. “As multi-core hardware continues to evolve quickly, challenging manufacturers to select a suitable hardware platform for the longer term, the portability of software will become increasingly important and valuable. Working together, I am sure that we will make OpenCL the chosen programming framework for Japanese companies seeking the best possible performance and flexibility.”


About Codeplay

Codeplay are global experts in advanced optimizing technologies, compilers and programmable graphics. The company has been providing acceleration solutions that optimize performance for graphics semiconductor designers and AAA game developers since 1999. In the semiconductor sector, Codeplay partners with leading chip manufacturers, such as AGEIA, Qualcomm and Movidius, helping them to exploit the full potential of their chipsets and accelerate time-to-market. Codeplay’s high performance C/C++ compiler technology enables GPU product managers to bring breakthrough new technologies to their graphics processors, significantly reducing development time, costs and risks. The company’s compiler testing technologies enable the rapid testing of new compilers and languages, such as OpenCL and shader languages. Codeplay’s expert compiler developers bring many years of experience to the toughest optimization and development projects. For more information, visit www.codeplay.com/.



©2012 Fixstars Corporation, All rights reserved.

Japanese | Legal | Privacy