Demystifying FLOPS: A Beginner‘s Guide to Computing Performance

Floating point operations per second, or FLOPS, have become a standard way to measure computing performance, especially for complex math-intensive tasks like scientific simulations, machine learning, and graphics rendering. But what exactly are FLOPS and how are they calculated?

In this comprehensive guide, we‘ll start from the ground up to demystify FLOPS and give you a deeper understanding of compute power benchmarks. Whether you‘re a coder looking to speed up your neural network or a gamer trying to interpret GPU specs, you‘ll learn all the ins and outs of this vital performance metric. Let‘s dive in!

A Quick Primer on FLOPS

First, a quick primer on what FLOPS measure. FLOPS stand for Floating point Operations Per Second. As the name suggests, it refers to the number of floating point math operations a computer can process each second.

Some key facts:

Floating point means calculations with fractional numbers that have a decimal point. Integers like 1, 2, 3 are not floating point values.
Basic floating point operations are addition, subtraction, multiplication, and division. More complex formulas require multiple FLOPS.
FLOPS measure the theoretical peak performance of hardware. Real-world workloads will achieve lower sustained FLOPS.
Consumer PCs and phones measure FLOPS in gigaFLOPS (10^9) or teraFLOPS (10^12). Supercomputers reach petaFLOPS (10^15) or exaFLOPS (10^18).

Now let‘s see how FLOPS are actually calculated starting with CPU performance.

Calculating Theoretical CPU FLOPS

Modern CPU architectures rely on techniques like pipelining, superscalar execution, SIMD, and FMA to maximize FLOPS. Let‘s break down how each of these contribute:

Superscalar Execution

Superscalar CPUs can process multiple instructions in parallel each cycle. While early CPUs could only handle 1 instruction per cycle, most modern CPUs execute 2, 3 or even 4 operations per cycle. This superscalar factor directly multiplies theoretical FLOPS.

SIMD: Single instruction, multiple data

SIMD allows a single instruction to perform the same operation across multiple data values simultaneously. For example, adding two 256-bit SIMD registers performs four 64-bit integer additions or eight 32-bit float additions in parallel in a single cycle.

Wider SIMD registers allow more data parallelism. Modern CPUs support 128-bit, 256-bit, and 512-bit SIMD widths.

FMA: Fused multiply-add

FMA combines a multiplication and addition into one instruction. This doubles peak FLOPS compared to handling multiply and add separately. CPUs support 128-bit, 256-bit or wider FMA.

Putting It Together

Here is the overall formula for calculating maximum CPU FLOPS:

Peak FLOPS = Cores x Frequency x (SIMD Factor + FMA Factor) x Superscalar Factor

Where:

Cores = Number of CPU cores
Frequency = Clock speed in GHz
SIMD Factor = SIMD Width / Data Type Size
FMA Factor = FMA Width / Data Type Size
Superscalar Factor = Instructions per cycle (typically 1 or 2)

Let‘s break this down for a real CPU:

AMD Ryzen 9 5950X:

16 cores
Up to 4.9 GHz frequency
256-bit SIMD
256-bit FMA
64-bit floating point data
2 instructions per cycle

Plugging this in:

SIMD Factor: 256/64 = 4 FLOPS
FMA Factor: 256/64 = 4 FLOPS
Superscalar Factor: 2
Peak FLOPS = 16 cores x 4.9 GHz x (4 + 4) FLOPS x 2 = 619 gigaFLOPS

As you can see, increasing cores, frequency, SIMD/FMA widths and superscalar performance all help scale peak computational throughput.

Real-World CPU FLOPS

The peak FLOPS formula gives the upper bound assuming perfect utilization. But real workloads will achieve lower sustained FLOPS, typically 50-70% of peak. There are several reasons for this:

Pipelining Efficiency

Pipelining lets CPUs work on multiple instructions in parallel across different stages – but pipeline stalls can occur due to branching or dependencies that reduce utilization. Modern CPUs predict branches and optimize pipelines to improve efficiency.

Memory Access Latency

Longer memory access latency due to cache misses can cause pipelines to stall as they wait for data. Computation and data movement need to be balanced to maximize FLOPS.

Instruction Level Parallelism

Superscalar logic needs to find enough independent instructions in typical code to fully leverage parallel execution capacity. Suboptimal code can limit ILP.

Precision and Data Movement

Higher precision operations like double precision use more cycles per FLOP. Moving data on and off chip also incurs overhead. FP32 is optimal for maximizing FLOPS.

By optimizing code to minimize stalls, maximize ILP, and leverage pipelines, it‘s possible to get over 70% utilization and much higher sustained FLOPS on real workloads.

GPU FLOPS Calculations

Let‘s shift our focus to understanding peak FLOPS for GPUs. GPU architectures have thousands of smaller cores optimized for data parallelism and high math throughput. Here is the GPU FLOPS formula:

Peak FLOPS = Cores x Frequency x FLOPS per Cycle per Core

Where:

Cores = Total GPU cores
Frequency = GPU core clock speed
FLOPS per Cycle per Core = FLOPS enabled by SIMD/FMA per core

For example, Nvidia‘s RTX 3090 GPU has:

10,496 CUDA cores
1.7 GHz clock speed
256-bit SIMD + 256-bit FMA = 16 FLOPS per cycle per core (at FP32)

Plugging this in:

10,496 cores x 1.7 GHz x 16 FLOPS per cycle per core = 280 teraFLOPS peak FP32

As with CPUs, actual sustained FLOPS will be 50-70% of this peak. GPUs achieve much higher throughput due to having thousands of smaller, simpler cores.

Neural Network FLOPS

For training and inference of neural networks, we can calculate FLOPS based on the layer type and dimensions:

Convolutional Layers

FLOPS = 2 x Filter Count x Filter Height x Filter Width x Output Height x Output Width

For example, a layer with 128 5×5 convolutional filters on a 256×256 input would need 2 x 128 x 5 x 5 x 256 x 256 = 328 million FLOPS.

Fully Connected Layers

FLOPS = 2 x Input Neurons x Output Neurons

So a layer with 2048 input neurons and 1024 output neurons requires 2 x 2048 x 1024 = 4 million FLOPS.

Tracking total FLOPS by layer gives an estimate of a network‘s computational complexity.

Petaflop & Exaflop Supercomputers

While desktops and smartphones operate in the gigaFLOPS and teraFLOPS range, the most powerful supercomputers handle computation on a massive petaFLOPS and upcoming exaflops scale.

PetaFLOPS

1 PetaFLOP = 1,000 Teraflops = 10^15 FLOPS

To reach this scale, supercomputers interconnect thousands of high performance accelerator chips like GPUs and custom ASICs using high bandwidth networks.

The current world‘s fastest supercomputer Fugaku built by Fujitsu achieves 448 petaFLOPS on the LINPACK benchmark. That‘s over 1,000x faster than even top-end consumer GPUs today!

Exaflops

1 Exaflop = 1,000 Petaflops = 10^18 FLOPS

Exascale supercomputers capable of exaFLOPS performance are expected in the next few years. Front-runners include:

Frontier Supercomputer – Expected 1.5 exaflops in 2021 using AMD EPYC CPUs and Radeon Instinct GPUs
Aurora Supercomputer – Intel is targeting over 1 exaflop using its Xeon Scalable processors and Xe GPUs

These massive systems push the limits on parallelism, power efficiency, and reliability at scale. They will enable breakthroughs in scientific research through massively increased simulation capability.

Optimizing and Reducing FLOPS

While FLOPS benchmarks make for nice marketing, you don‘t always need maximum FLOPS to get the job done. Here are some techniques to optimize or reduce FLOPS:

Precision

Lower precision saves FLOPS directly since each operation uses fewer bits. FP16 cuts FLOPS in half over FP32. Int4 or Int8 can achieve 4-8x reduction.

Sparsity

Exploiting sparsity and skipping zero values avoids wasted FLOPS. This is common in neural networks.

Code Optimization

Efficient algorithms reduce unnecessary operations. Strength reduction transforms expensive operations like divide into simpler ones.

Quantization

Quantizing weights and activations down to low bitwidths like int8 approximates original values to massively reduce FLOPS.

With these optimizations, you can get away with far fewer FLOPS while maintaining accuracy and performance for many applications. Clever coding trumps blindly chasing peak FLOPS!

The Road to ZettaFLOPS and Beyond

FLOPS have skyrocketed from early supercomputers that operated in the kiloFLOPS range to peta and upcoming exaflop machines. As Moore‘s law tapering slows raw hardware advances, innovations in new materials like graphene and optics offer hope for continued exponential growth.

A zettaFLOPS machine performing 10^21 FLOPS may arrive by mid-century, enabling advanced AI and science fiction-like immersive worlds. We still have a long way to push the limits of computing performance across both hardware and software algorithms.

I hope this guide helped demystify the concept of FLOPS and how they relate to real-world computation. Let me know if you have any other questions! Whether you want to speed up your code or buy your next GPU, understanding these fundamentals of FLOPS will give you key insights into computer performance.