36
edits
Changes
no edit summary
For research in this assignment, a CUDA enabled GPU will be used for implementing use cases and testing the performance of CUDA vs other parallel programming platforms.
Similarly, like how MPI can be deployed to manage a distributed memory system containing thousands of Central Processing Units, CUDA can be deployed in a distributed memory context to manage cloud installations with thousands of Graphics Processing Units. CUDA helps graphics processing for scientific applications including visualization of galaxies in astronomy, and molecular visualization for biology and medicine. CUDA enabled devices range in a wide variety from embedded systems to cloud and data center.
[https://developer.nvidia.com/cuda-toolkitCUDA Toolkit]
= History & Current State of Central & Graphics Processing Units =
== History of GPUs ==
“The graphics processing unit (GPU), first invented by NVIDIA in 1999, is the most pervasive parallel processor to date.” [https://www.sciencedirect.com/topics/engineering/graphics-processing-unit Source]GeForce 256 was marketed as “the the worlds first GPU”GPU. Designed on a single chip, this product was a champion in it’s time, delivering “480 million 8-sample fully filtered pixels per second.” [https://www.anandtech.com/show/429/3 Source] It featured hardware Transform & Lighting and Cube Environment Mapping which were graphics rendering methods of the day. There is support for Direct3D 7.0 and OpenGL 1.2, which run on Windows and Linux respectively.
== Current State of GPUs ==
Today, Graphics processing units, or GPUs exist in many different forms:
• Dedicated or discreet graphics cards are the most powerful form and usually are connected to a PCIe lane. Although these units have their own dedicated memory as well as memory that is shared with the CPU, they cannot replace the CPU and when used with CPUs do not constitute as distributed memory systems which are systems with multiple CPUs each with their own memory. Dedicated or discreet graphics cards can be found in PCs that require extensive dedicated graphics rendering power, such as workstations for scientific visualizations and PCs for gaming.
• Integrated graphics processing or (iGPU) units share RAM with the CPU and can appear on the CPU or near it depending on the architecture. Integrated GPUs usually ship with notebook style laptops to accomplish daily graphics rendering tasks. At 2.6 trillion floating point operations per second, the recently released M1 chip from Apple with integrated graphics is currently the fastest in the world.
• Hybrid graphics processing units have earned their name by scoring in between dedicated and integrated GPUs in price and performance. An example is NVIDIA’s TurboCache, which shares main system memory.
• General Purpose Graphics Processing Units (GPGPUs) are modified vector processors (found on CPUs) with compute kernels designed for high throughput. These modifications allow them to support APIs such as OpenMP. Since CUDA was released on June 23, 2007, NVIDIA’s G8x series and onwards (all their GPUs released after November 8, 2006) are GPGPU capable.
• External GPUs are surprisingly located outside the housing of the computer and have their own housing and power supply. They need to be connected to a PCIe port of a computer, which can be accessed through a Thunderbolt 3 or 4 port. So, if you have a compatible computer and are looking for more graphics processing power, eGPUs can be a cost-saving solution.
== History and Current State of CUDA ==
NVIDIA’s first CUDA architecture was called Fermi.
“NVIDIA’s Fermi GPU architecture consists of multiple streaming multiprocessors (SMs), each consisting of 32 cores, each of which can execute one floatingpoint or integer instruction per clock. The SMs are supported by a second-level cache, host interface, GigaThread scheduler, and multiple DRAM interfaces.”
[https://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdfNVIDIA Fermi]
A CUDA core is approximately comparable to a CPU core.
NVIDIA’s latest CUDA architecture is called Ampere GA102
“Each SM in GA10x GPUs contain 128 CUDA Cores, four third-generation Tensor Cores, a 256 KB Register File, four Texture Units, one second-generation Ray Tracing Core, and 128 KB of L1/Shared Memory, which can be configured for differing capacities depending on the needs of the compute or graphics workloads. The memory subsystem of GA102 consists of twelve 32-bit memory controllers (384-bit total). 512 KB of L2 cache is paired with each 32-bit memory controller, for a total of 6144 KB on the full GA102 GPU.”
[https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdfNVIDIA Ampere GA102]
As the amount of processing power that went into GPUs increased beyond the point necessary for rendering pixels on the screen, a method was needed to utilize the wasted computing power on Graphics Processing Units. General Purpose Graphics Processing Units can perform General Purpose computations and Graphics Processing computations. With the release of CUDA, all CUDA-enabled GPUs can perform general purpose processing, with double floating-point precision supported since the first major compute capability version.
== CPUs vs. GPUs ==
CPUs have a faster clock speed and can execute a wide array of general-purpose instructions used for controlling computers. At the time of their creation, GPUs could not do this. GPUs are focused on having as many cores as possible with a limited instruction set to provide sheer computing power in specialized scenarios. This higher number of cores in GPUs allows modern GPUs to achieve more floating-point operations per second than modern CPUs. It is highly debated whether CPUs will replace GPUs or the other way around, neither of which I believe will happen due to the various roles these units can take in any given system. A CPU is the physical and logical component that any operating system uses to run programs and manage hardware. If a computer has a screen, it needs a GPU. Depending on the level of depth and layers that graphics rending required will determine what GPU should be used.
== NVIDIA vs. AMD ==
Main architectural differences across AMD and NVIDIA GPUs are architecture of execution units, cache hierarchy, and graphics pipelines. Graphics pipelines are conceptual models used by GPUs to render 3D images on a 2D screen. Conceptually and theoretically these devices are quite different, but their practical uses and diagrams of modern GPUs from both companies are quite similar.
NVIDIA and AMD have different product lineups, audiences, and business visions. AMD offers CPUs and GPUs, while NVIDIA focuses on GPUs. Having been founded nearly thirty years prior to NVIDIA, AMD currently sits at 28% of their competitors market capitalization. Intel’s market cap is currently 43% of NVIDIAs, making NVIDIA the largest company of the three big chip makers.
Traditionally in the world of PC building, NVIDIA and Intel usually release the more expensive but most performing hardware, while AMD provides the best valued hardware in price-to-performance.
[https://www.hardwaretimes.com/amd-navi-vs-nvidia-turing-comparing-the-radeon-and-geforce-graphics-architectures/Source] [https://ycharts.com/companies/NVDA/r_and_d_expense Source] [https://www.statista.com/statistics/267873/amds-expenditure-on-research-and-development-since-2001/Source]
= Software Development Strategies =
oneAPI, branded by Intel as “A Unified X-Architecture Programming Model” with applications for the CPU covered in this course, but oneAPI is designed for diverse architectures and can be used to simplify programming across GPUs. Intel oneAPI DPC++ (Data Parallel C++) compiler includes CUDA backend support, which was initially released almost two years ago. Considering Intel oneAPI was officially released December 2020, oneAPI seems to live up to its brand name.
[https://phoronix.com/scan.php?page=news_item&px=Intel-oneAPI-DPCpp-CUDA-ReleaseSource][https://codematters.online/intel-oneapi-faq-part-1-what-is-oneapi/Source]
== Software Programming Frameworks ==
CUDA special syntax parameters <<<…>>> identified in the kernel invocation execution configuration are number of blocks (1) and number of threads per block (N).
[https://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.htmlSource][https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Source]
Common libraries contained in the CUDA Framework, accessible through the CUDA API:
cuBLAS - “the CUDA Basic Linear Algebra Subroutine library.”
cuFFT - “the CUDA Fast Fourier Transform library.”
== Software Programming Toolkits ==
Where Frameworks are the structures with libraries containing code we access, toolkits are software programs written to help other software developers to develop programs and maintain code.
oneAPI HPC Toolkit
CUDA Toolkit – available for download from NVIDIA’s website. Includes extensions for IDEs for features such as syntax highlighting and easily creating CUDA projects similarly to OpenMP, TBB & MPI. Also included in CUDA toolkit is the CUDA compiler, framework, NSight Profiler for Visual Studio for monitoring performance of applications and much more.
11. Safety and security
12. Tools and management”
[https://www.infoworld.com/article/3299703/what-is-cuda-parallel-programming-for-gpus.htmlSource]
= Use Cases for Parallel Computing with GPUs =
== Video Processing ==
My favourite use case to fall under this bucket, Video Games! Originally the exclusive use case for the Graphics Processing Unit was… Graphics Processing. Otherwise known as pushing pixels to the screen, gaming graphics cards have been optimized for this behaviour since their conception. Interestingly, one of the most important calculations in graphics is matrix multiplication. Matrices are generic operators required for processing graphic transformations such as: translation, rotation, scaling, reflection, and shearing. Of course, video rendering techniques have evolved greatly in the last couple of decades, from 2D to 3D rendering methods and additional layers of graphical computation resulting in continuously increasing realism.
Average screen is 1920 by 1080 pixels, average refresh rate is 60 frames per second, resulting in the number of operations to update all pixels on the screen per second to be multiplied by 124,416,000. From a computation standpoint that is not a very big number especially considering a 2 gigahertz CPU can perform two billion operations per second, so that begs the question why GPUs? The answer is because the number of calculations to update all pixels on the screen per second can reach insanely large numbers resulting in more than two billion operations per second. This is in intense graprics rendering sceneriaos. However, they are simple operations like matrix multiplication. This helps clearly understand the role of the GPU which is to perform massive amounts of simple operations in parallel. As a result of GPU being able to perform so many calculations per clock cycle, clock speeds aren’t as high.
[https://superuser.com/questions/461022/how-does-the-cpu-and-gpu-interact-in-displaying-computer-graphicsSource][https://stackoverflow.com/questions/40337748/graphics-matrix-by-matrix-multiplication-necessary-for-transformationsSource][https://www.includehelp.com/computer-graphics/types-of-transformations.aspxSource]
Modern day GPUs have dedicated physical units for tasks such as 3D rendering, Copying, Video Encoding & Decoding, and more. As you may expect, 3D rendering is the concept of converting 3D models into 2D images on a computer, video encoding occurs on outbound data and video decoding occurs on inbound data. For small tasks such as video calls, and watching YouTube videos, integrated graphics will suffice and are usually shipped with notebook style laptops for this reason. For larger tasks such as 4K video editing, or 4K gaming, dedicated hardware is required.
=== Real Time ===
3D rendering often occurs in real time, especially in video games and simulations such as those that occur in software development for robotic applications.
=== Non-Real Time or Pre-Rendering ===
Any time we watch or upload a video through a computer, whether it is stored on the device or being streamed over a network, data containing the video must be encoded before it is stored or transmitted and decoded before it can be displayed. This is where pre-rendering occurs.
== Machine Learning ==
Machine learning has gained extreme popularity in recent years due to a combination of factors including big data, computational advances, and cloud business models. Every industry can benefit from Machine Learning Artificial Intelligence technologies, whether through data analysis to improve operation quality or embedded robotic systems to automate physical tasks.
[https://www.zdnet.com/article/three-reasons-why-ai-is-taking-off-right-now-and-what-you-need-to-do-about-it/Source] [https://www.techcenturion.com/tensor-cores/Source]
== Science & Research ==
= Use Case Implementation =
== Machine Learning (TensorFlow)== “NVIDIA's GTX 1080 does around 8.9 teraflops”[https://www.ign.com/articles/2019/03/19/what-are-teraflops-and-what-do-they-tell-us-about-googles-stadia-performanceSource] Apple M1 does around 2.6 teraflops[https://www.tomshardware.com/news/Apple-M1-Chip-Everything-We-KnowSource]
8.9 / 2.6 = 3.4230769230769230769230769230769
1.136666667 / 0.142666667 = 7.9672897033474539641414627006041
Although the GTX 1080 does around 9 teraflops which is about three and a half times faster than the Apple M1, the M1 can consistently perform about 8 times faster for TensorFlow machine learning algorithms. This is most likely since the GTX 1080 is a Graphics Processing Unit meant for gaming, and the Apple M1 chip has a dedicated 16-core Neural Engine component for Machine Learning. The test results prove that this dedicated Neural component greatly accelerates Machine Learning as Apple claims. Unfortunately, an NVIDIA Volta or Turing GPU was not available for testing in this scenario. Volta and Turing are NVIDIAs machine learning architectures, and they offer a wide range of devices for Machine Learning uses from embedded devices to datacenter and cloud. The architecture of these devices includes tensor cores that are dedicated for machine learning and greatly increase efficiency.
[https://developer.nvidia.com/deep-learning Source][https://www.howtogeek.com/701804/how-unified-memory-speeds-up-apples-m1-arm-macs/Source]
= CUDA Performance Testing =
== Matrix Multiplication – CUDA vs. OpenMP==
For this test, the CUDA code was ran on a CUDA enabled GPU and OpenMP code was ran on the CPU, a truer test of CUDA vs. OpenMP would be to have the OpenMP code run on the GPU.
After learning that matrix multiplication operations are optimized for GPUs and of course CUDA is optimized for NVIDIA GPUs, this seems like an unfair test, with CUDA crushing OpenMP.
== Matrix Multiplication - CUDA vs OpenCL==
As covered earlier, CUDA is proprietary to NVIDIA and only works on CUDA enabled NVIDIA GPUs. This is not the case for OpenCL, which is open source and runs on a wide variety of GPUs and CPUs. This should provide the expected result that CUDA will outperform OpenCL on any given CUDA enabled device.
Interestingly, with a small array size, OpenCL seems to outperform CUDA, this is presumably due to parallel overhead.
= CUDA Limitations =
Main limitations of CUDA: No recursive functions. Minimum unit block of 32 threads.
[http://ixbtlabs.com/articles3/video/cuda-1-p3.html#:~:text=Main%20limitations%20of%20CUDA%3A,unit%20block%20of%2032%20threadsSource]
Due to its proprietary nature, a limitation of CUDA is the number of devices supported and systems it can extend. This proprietary nature allows NVIDIA to completely maximize performance, resulting in CUDA outperforming OpenCL, at least on any given CUDA enabled device.