gpu: a technical orientation

December 17, 2023

I've been on a bit of a low-level tangent recently, exploring the intricate workings behind our everyday ML code. We often overlook the complexities while routinely using commands like torch.compile, model.generate(), or a @ b. My curiosity has been piqued particularly by aspects of CUDA, compilers, and GPU architectures, which is why I want to dive into an accessible discussion about GPUs in the context of ML.

Why are GPUs suited for Machine Learning

Conventional knowledge tells us that GPUs excel at machine learning. Maybe you've even heard that the reason behind this is their parallel execution. If your like me, you've stopped and just accepted that fact for what it is. Today, I want to take the next step in understanding why this is. What about the GPU architecture makes it so great at the foundational operation of ML: matrix multiplication.

GPUs and CPUs are designed with different goals in mind. CPUs are optimized for minimizing latency, excelling in small-scale operations such as scalar multiplication (a * b). They are quick in fetching small data packets from memory, outpacing GPUs in these specific instances. However, GPUs shine in matrix multiplication (A * B) due to their bandwidth-optimized design. This allows them to handle larger chunks of data simultaneously, thereby reducing overall latency. The best GPUs can manage memory bandwidths of up to 750GB/s, far surpassing the 50GB/s capacity of top CPUs, making them more advantageous for larger operations.

Despite their high latency, GPUs overcome this through thread parallelism. This means that while there is an initial wait for the first chunk of memory, the employment of multiple threads ensures a consistent flow of data. This parallelism masks the latency, enabling GPUs to deliver high bandwidth with minimal delay.

Transferring memory from system RAM to GPU's onboard video RAM is only part of the process. The architecture of a GPU typically includes L2 cache, L1 cache, and registers, with each level being smaller and faster than the last. All computations occur in registers, located adjacent to the execution units (stream processors in GPUs). The proximity of registers to these units is vital, as every nanometer reduction in distance speeds up access time. GPUs leverage their multiple cores to accommodate more registers in close proximity, thereby enhancing both total memory and bandwidth.

NVIDIA has played a significant role in streamlining the process of moving memory between different caches and maximizing register utilization. Their tools help developers write CUDA code that takes full advantage of these architectural benefits.

In the realm of deep learning, where large datasets are common, slower memory can significantly hinder performance. GPUs address this by ensuring that most memory movements occur in registers (80TB/s), reducing reliance on main memory (0.75TB/s). This distribution indicates that despite the high bandwidth, the majority of computation time is spent accessing main memory. Therefore, the key factors that make GPUs particularly suited for deep learning are their high bandwidth main memory, the ability to hide memory access latency through thread parallelism, and their substantial, fast registers and cache.

Understanding these aspects of GPU architecture helps to demystify their effectiveness in ML. It's the careful balance of memory management and processing power that underscores their suitability for this rapidly advancing technological domain.

Tensor Cores

In our quest to understand GPU specifications more intuitively, it's important to consider their components in terms of their significance. The key elements, ranked by importance, are Tensor cores, memory bandwidth, cache hierarchy, and lastly FLOPS.

Introduced by NVIDIA in the Volta architecture, Tensor cores have become central to large-scale ML training. These cores are specifically optimized for matrix multiplication. To illustrate their performance benefits, consider this: a 32x32 matrix multiplication (A * B) takes about 500 cycles without a Tensor core, but only around 230 cycles with one.

Memory Bandwidth

Tensor cores are incredibly fast, so fast in fact that they are mostly idling. When GPT-3 was trained, the Tensor Core TFLOPS utilization was about 45-65% meaning that even for models with billions of parameters (remember the larger the better for Tensor Cores) our cores are still twiddling their thumbs half the time. What does this mean? Well, that we're memory bound and as such memory bandwidth becomes the best indicator of a GPUs performance.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers are the limiting factor in Tensor Core performance, we also need to examine the local memory on chip. The GPU memory hierarchy, comprising L2 cache, shared memory, L1 cache, and registers, plays a crucial role in this. Matrix multiplication on GPUs utilizes this hierarchy, moving from the slow global memory to the much faster registers, with each level being faster but smaller. The process involves dividing matrix multiplication into smaller sub-matrices or memory tiles, allowing for faster computation in the closer and faster memory areas.

Matrix multiplication is performed in local shared memory near the streaming multiprocessor (SM), akin to a CPU core, and further accelerated by loading parts of these tiles into the ultra-fast registers of Tensor Cores.

Optimization through quantization

A few years ago one of the more prevalent issues in deep learning was the limitations of 16-bit floating-point numbers (FP16). They only support a limited range of values, and exceeding this range can lead to catastrophic gradient explosions, rendering the data useless. Traditionally, this has been mitigated through loss scaling, a workaround that, while effective, is far from ideal. The introduction of BrainFloat 16 (BF16) offers a robust solution to this problem. By expanding the range of possible numbers to match that of 32-bit floating-point numbers (FP32), BF16 minimizes the need for loss scaling, thus simplifying the training process. This advancement is not just about preventing errors; it's about making the training of AI models more stable and efficient, and it does so without requiring significant changes in existing codebases.

The support for 8-bit Float (FP8) in cutting-edge GPUs like the RTX 40 series and H100 is another monumental step. This feature allows for quicker data loading for matrix multiplication, enhanced cache storage capacity, and a remarkable increase in computational power. To put this in perspective, the computational power of an RTX 4090 GPU with FP8 is on par with the world's fastest supercomputer from 2007. However, the use of 8-bit precision isn't without its challenges, particularly in the stability of transformers in Large Language Models (LLMs). My research suggests that maintaining some dimensions in high precision can counteract these instabilities, ensuring smoother training and inference processes.

The research also highlights that 8-bit matrix multiplication can achieve comparable results to 16-bit baselines, a significant finding considering the efficiency benefits of 8-bit computation. Furthermore, the FP8 data type, introduced in the RTX 40 series, offers more stability and ease of use compared to the Int8 data type, especially in functions like layer normalization or nonlinear functions. This makes FP8 a strong candidate for widespread adoption in both training and inference in the near future.