Notes

Chapter 1

1: Wikipedia, http://en.wikipedia.org/wiki/Parallel_computing

2: Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972.

3: TOP500 Supercomputer Sites, http://www.top500.org

4: Intel MultiProcessor Specification, http://www.intel.com/design/pentium/datashts/242016.htm

5: http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa/

6: Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.

Chapter 2

7: NVIDIA uses the term "GPGPU computing" when referring to performing general-purpose computations through the graphics API. The term "GPU computing" is used to refer to performing computation using CUDA or OpenCL. This text will use the term "GPGPU" to refer to any general-purpose computation performed on the GPU.

8: http://www.nvidia.com/object/cuda_home/

9: The kernel can be executed without using the <<<>>> constructor, but this will require calls to multiple CUDA-specific library functions.

10: http://www.ibm.com/developerworks/power/cell/index/

11: The devices that can be used depend on the OpenCL implementation by that company. The OpenCL implementation should be chosen wisely depending on the platform.

12: Some reasons for this requirement are

(1) Each processor on the device is often only capable of running small programs, due to limited accessible memory.

(2) Because of the limited memory, it is common for the device to not run an OS. This is actually beneficial, as it allows the processors to concentrate on just the computations.

13: At present, it is not possible to use an OpenCL compiler by one company, and an OpenCL runtime library by another company. This should be kept in mind when programming a platform with multiple devices from different chip vendors.

Chapter 4

14: The flag passed in the 2nd argument only specifies how the kernel-side can access the memory space. If the "CL_MEM_READ_ONLY" is specified, this only allows the kernel to read from the specified address space, and does not imply that the buffer would be created in the constant memory. Also, the host is only allowed access to either the global memory or the constant memory on the device.

15: Otherwise, each kernel would have to be compiled independently.

Chapter 5

16: The OpenCL specification does not state that the local memory will correspond to the scratch-pad memory, but it is most probable that this would be the case. Also, for those experienced in CUDA, note that the OpenCL local memory does not correspond with the CUDA local memory, as it corresponds to the shared memory.

17: At the time of this writing (Dec 2009), the Mac OS X OpenCL returns CL_TRUE, but the image objects are not supported.

Chapter 6

18: Victor Podlozhnyuk, Parallel Mersenne Twister

19: http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

20: The "mts.h" is not explained here, as one only need to know that three variables are declared with the names "mts128", "mts256", "mts512", each containing corresponding number of DC parameters.