1.2 Parallel Computing (Hardware)

First of all, what exactly is "parallel computing"? Wikipedia defines it as "a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently("in parallel").[1]

Many different hardware architectures exist today to perform a single task using multiple processors. Some examples, in order of decreasing scale is:

Grid computing
a combination of computer resources from multiple administrative domains applied to a common task.

MPP (Massively Parallel Processor) systems
known as the supercomputer architecture.

Cluster server system
a network of general-purpose computers.

SMP (Symmetric Multiprocessing) system
identical processors (in powers of 2) connected together to act as one unit.

Multi-core processor
a single chip with numerous computing cores.

Flynn's Taxonomy

Flynn's Taxonomy is a classification of computer architectures proposed by Michael J. Flynn [2]. It is based on the concurrency of instruction and data streams available in the architecture. An instruction stream is the set of instructions that makes up a process, and a data stream is the set of data to be processed.

1. Single Instruction, Single Data stream (SISD)

SISD system is a sequential system where one instruction stream process one data stream. The pre-2004 PCs were this type of system.

2. Single Instruction, Multiple Data streams SIMD)

One instruction is broadcasted across many compute units, where each unit processes the same instruction on different data. The vector processor, a type of a supercomputer, is an example of this architecture type. Recently, various micro-processors include SIMD processors. For example, SSE instruction on Intel CPU, and SPE instruction on Cell Broadband Engines performs SIMD instructions.

3. Multiple Instruction, Single Data stream (MISD)

Multiple instruction streams process a single data stream. Very few systems fit within this category, with the exception for fault tolerant systems.

4. Multiple Instruction, Multiple Data streams (MIMD)

Multiple processing units each process multiple data streams using multiple instruction streams.

Using this classification scheme, most parallel computing hardware architectures, such as the SMP and cluster systems, fall within the MIMD category. For this reason, the MIMD architecture is further categorized by memory types.

The two main memory types used in parallel computing system are shared memory and distributed memory types. In shared memory type systems, each CPU that makes up the system is allowed access to the same memory space. In distributed memory type systems, each CPU that makes up the system uses a unique memory space.

Different memory types result in different data access methods. If each CPU is running a process, a system with shared memory type allows the two processes to communicate via Read/Write to the shared memory space. On the other hand, the system with distributed memory types requires data transfers to be explicitly performed by the user, since the two memory spaces are managed by two OS's.

The next sections explore the two parallel systems in detail.

Distributed Memory type

Tasks that take too long using one computer can be broken up to be performed in parallel using a network of processors. This is known as a cluster server system, which is perhaps the most commonly-seen distributed memory type system. This type of computing has been done for years in the HPC (High Performance Computing) field, which performs tasks such as large-scale simulation.

MPP (Massively Parallel Processor) system is also another commonly-seen distributed memory type system.

This connects numerous nodes, which are made up of CPU, memory, and a network port, via a specialized fast network. NEC's Earth Simulator and IBM's blue Gene are some of the known MPP systems.

The main difference between a cluster system and a MPP system lies in the fact that a cluster does not use specialized hardware, giving it a much higher cost performance than the MPP systems. Due to this reason, many MPP systems, which used to be the leading supercomputer type, have been replaced by cluster systems. According to the TOP500 Supercomputer Sites[3], of the top 500 supercomputers as of June 2009, 17.6% are MPP systems, while 82% are cluster systems.

One problem with cluster systems is the slow data transfer rates between the processors. This is due to the fact that these transfers occur via an external network. Some recent external networks include Myrinet, Infiniband, 10Gbit Ethernet, which has significantly become faster compared to the traditional Gigabit Ethernet. Even with these external networks, the transfer rates are still at least an order of magnitude slower than the local memory access by each processor.

For the reason given above, cluster systems are suited for parallel algorithms where the CPU's do not have to communicate with each other too often. These types of algorithms are said to be "Course-grained Parallel." These algorithms are used often in simulations where many trials are required, but these trials have no dependency. An example is the risk simulation program used in Derivative product development in the finance field.

Shared Memory Type

In shared memory type systems, all processors share the same address space, allowing these processors to communicate with each other through Read/Writes to shared memory. Since data transfers/collections are unnecessary, this results in a much simpler system from the software perspective.

An example of a shared memory type system is the Symmetric Multiprocessing (SMP) system (Figure 1.1, left). The Intel Multiprocessor Specification Version 1.0 released back in 1994 describes the method for using x86 processors in a multi-processor configuration, and 2-Way work stations (workstations where up to 2 CPU's can be installed) are commonly seen today [4]. However, increasing the number of processors naturally increases the number of accesses to the memory, which makes the bandwidth between the processors and the shared memory a bottleneck. SMP systems are thus not scalable, and only effective up to a certain number of processors. Although 2-way servers are inexpensive and common, 32-Way or 64-Way SMP servers require specialized hardware, which can become expensive.

Figure 1.1: SMP and NUMA

Another example of a shared memory type system is the Non-Uniform Memory Access (NUMA) system. The main difference from a SMP system is that the physical distance between the processor and the memory changes the access speeds. By prioritizing the usage of physically closer memory (local memory) rather than for more distant memory (remote memory), the bottleneck in SMP systems can be minimized. To reduce the access cost to the remote memory, a processor cache and a specialized hardware to make sure the cache data is coherent has been added, and this system is known as a Cash Coherent NUMA (cc-NUMA).

Server CPUs such as AMD Opteron and Intel Xeon 5500 Series contains a memory controller within the chip. Thus, when these are used in multi-processor configuration, it becomes a NUMA system. The hardware to verify cache coherency is embedded into the CPU. In addition, Front Side Bus (FSB), which is a bus that connects multiple CPU as well as other chipsets, the NUMA gets rid of this to use a interconnect port that uses a Point-to-Point Protocol. These ports are called Quick Path Interconnect (QPI) by Intel, and Hyper Transport by AMD.

Now that the basic concepts of SMP and NUMA are covered, looking at typical x86 server products bring about an interesting fact. The Dual Core and Quad Core processors are SMP systems, since the processor cores all access the same memory space. Networking these multi-core processors actually end up creating a NUMA system. In other words, the mainstream 2+way x86 server products are "NUMA systems made by connecting SMP systems" (Figure 1.2) [5].

Figure 1.2: Typical 2Way x86 Server

Accelerator

The parallel processing systems discussed in the previous sections are all made by connecting generic CPUs. Although this is an intuitive solution, another approach is to use a different hardware more suited for performing certain tasks as a co-processor. The non-CPU hardware in this configuration is known as an Accelerator.

Some popular accelerators include Cell Broadband Engine (Cell/B.E.) and GPUs. Accelerators typically contain cores optimized for performing floating point arithmetic (fixed point arithmetic for some DSPs). Since these cores are relatively simple and thus do not take much area on the chip, numerous cores are typically placed.

For example, Cell/B.E. contains 1 PowerPC Processor Element (PPE) which is suited for processes requiring frequent thread switching, and 8 Synergistic Processor Elements (SPE) which are cores optimized for floating point arithmetic. These 9 cores are connected using a high-speed bus called the Element Interconnect Bus (EIB), and placed on a single chip.

Another example is NVIDIA's GPU chip known as Tesla T10, which contains 30 sets of 8-cored Streaming Processors (SM), for a total of 240 cores on one chip.

In recent years, these accelerators are attracting a lot of attention. This is mainly due to the fact that the generic CPU's floating point arithmetic capability has leveled off at around 10 GFLOPS, while Cell/B.E. and GPUs can perform between 100 GFLOPS and 1 TFLOPS for a relatively inexpensive price. It is also more "Green", which makes it a better option than cluster server systems, since many factories and research labs are trying to cut back on the power usage.

For example, the circuit board and semiconductor fields use automatic visual checks. The number of checks gets more complex every year, requiring faster image processing so that the rate of production is not compromised. The medical imaging devices such as ultrasonic diagnosing devices and CT Scans are taking in higher and higher quality 2D images as an input every year, and the generic CPUs are not capable of processing the images in a practical amount of time. Using a cluster server for these tasks requires a vast amount of space, as well as high power usage. Thus, the accelerators provide a portable and energy-efficient alternative to the cluster. These accelerators are typically used in conjunction with generic CPUs, creating what's known as a "Hybrid System".

In summary, an accelerator allows for a low-cost, low-powered, high-performance system. However, the transfer speed between the host CPU and the accelerator can become a bottle neck, making it unfit for applications requiring frequent I/O operations. Thus, a decision to use a hybrid system, as well as what type of a hybrid system, needs to be made wisely.

OpenCL, in brief, is a development framework to write applications that runs on these "hybrid systems".