Computer systems are crunching extra numbers than ever to crack essentially the most advanced issues of our time — tips on how to remedy illnesses like COVID and most cancers, mitigate local weather change and extra.
These and different grand challenges ushered computing into at present’s exascale period when high efficiency is usually measured in exaflops.
So, What’s an Exaflop?
An exaflop is a measure of efficiency for a supercomputer that may calculate a minimum of 1018 or one quintillion floating level operations per second.
In exaflop, the exa- prefix means a quintillion, that’s a billion billion, or one adopted by 18 zeros. Equally, an exabyte is a reminiscence subsystem packing a quintillion bytes of information.
The “flop” in exaflop is an abbreviation for floating level operations. The speed at which a system executes a flop in seconds is measured in exaflop/s.
Floating level refers to calculations made the place all of the numbers are expressed with decimal factors.
1,000 Petaflops = an Exaflop
The prefix peta- means 1015, or one with 15 zeros behind it. So, an exaflop is a thousand petaflops.
To get a way of what a heady calculation an exaflop is, think about a billion individuals, every holding a billion calculators. (Clearly, they’ve bought large fingers!)
If all of them hit the equal signal on the similar time, they’d execute one exaflop.
Indiana College, house to the Big Red 200 and several other different supercomputers, places it this manner: To match what an exaflop laptop can do in only one second, you’d must carry out one calculation each second for 31,688,765,000 years.
A Temporary Historical past of the Exaflop
For many of supercomputing’s historical past, a flop was a flop, a actuality that’s morphing as workloads embrace AI.
Individuals used numbers expressed within the highest of a number of precision formats, referred to as double precision, as outlined by the IEEE Commonplace for Floating Level Arithmetic. It’s dubbed double precision, or FP64, as a result of every quantity in a calculation requires 64 bits, knowledge nuggets expressed as a zero or one. In contrast, single precision makes use of 32 bits.
Double precision makes use of these 64 bits to make sure every quantity is correct to a tiny fraction. It’s like saying 1.0001 + 1.0001 = 2.0002, as a substitute of 1 + 1 = 2.
The format is a good match for what made up the majority of the workloads on the time — simulations of the whole lot, from atoms to airplanes, that want to make sure their outcomes come near what they characterize in the true world.
So, it was pure that the LINPACK benchmark, aka HPL, that measures efficiency on FP64 math turned the default measurement in 1993, when the TOP500 record of world’s strongest supercomputers debuted.
The Large Bang of AI
A decade in the past, the computing business heard what NVIDIA CEO Jensen Huang describes because the big bang of AI.
This highly effective new type of computing began exhibiting vital outcomes on scientific and enterprise purposes. And it takes benefit of some very completely different mathematical strategies.
Deep studying shouldn’t be about simulating real-world objects; it’s about sifting by means of mountains of information to search out patterns that allow contemporary insights.
Its math calls for excessive throughput, so doing many, many calculations with simplified numbers (like 1.01 as a substitute of 1.0001) is a lot better than doing fewer calculations with extra advanced ones.
That’s why AI makes use of decrease precision codecs like FP32, FP16 and FP8. Their 32-, 16- and 8-bit numbers let customers do extra calculations sooner.
Blended Precision Evolves
For AI, utilizing 64-bit numbers could be like taking your entire closet when going away for the weekend.
Discovering the best lower-precision method for AI is an lively space of analysis.
For instance, the primary NVIDIA Tensor Core GPU, Volta, used blended precision. It executed matrix multiplication in FP16, then gathered the ends in FP32 for greater accuracy.
Hopper Accelerates With FP8
Extra just lately, the NVIDIA Hopper architecture debuted with a lower-precision methodology for coaching AI that’s even sooner. The Hopper Transformer Engine routinely analyzes a workload, adopts FP8 at any time when potential and accumulates ends in FP32.
In relation to the much less compute-intensive job of inference — operating AI fashions in manufacturing — main frameworks reminiscent of TensorFlow and PyTorch help 8-bit integer numbers for quick efficiency. That’s as a result of they don’t want decimal factors to do their work.
The excellent news is NVIDIA GPUs help all precision codecs (above), so customers can speed up each workload optimally.
Final yr, the IEEE P3109 committee began work on an business customary for precision codecs utilized in machine studying. This work might take one other yr or two.
Some Sims Shine at Decrease Precision
Whereas FP64 stays in style for simulations, many use lower-precision math when it delivers helpful outcomes sooner.
For instance, researchers run in FP32 a well-liked simulator for automotive crashes, LS-Dyna from Ansys. Genomics is one other subject that tends to favor lower-precision math.
As well as, many conventional simulations are beginning to undertake AI for a minimum of a part of their workflows. As workloads shift in direction of AI, supercomputers have to help decrease precision to run these rising purposes nicely.
Benchmarks Evolve With Workloads
Recognizing these modifications, researchers together with Jack Dongarra — the 2021 Turing award winner and a contributor to HPL — debuted HPL-AI in 2019. It’s a brand new benchmark that’s higher for measuring these new workloads.
“Blended-precision methods have develop into more and more essential to enhance the computing effectivity of supercomputers, each for conventional simulations with iterative refinement methods in addition to for AI purposes,” Dongarra stated in a 2019 blog. “Simply as HPL permits benchmarking of double-precision capabilities, this new method based mostly on HPL permits benchmarking of mixed-precision capabilities of supercomputers at scale.”
Thomas Lippert, director of the Jülich Supercomputing Middle, agreed.
“We’re utilizing the HPL-AI benchmark as a result of it’s an excellent measure of the mixed-precision work in a rising variety of our AI and scientific workloads — and it displays correct 64-bit floating level outcomes, too,” he stated in a blog posted final yr.
At this time’s Exaflop Programs
In a June report, 20 supercomputer facilities around the globe reported their HPL-AI results, three of them delivering greater than an exaflop.
A type of programs, a supercomputer at Oak Ridge Nationwide Laboratory, additionally exceeded an exaflop in FP64 efficiency on HPL.
Two years in the past, a really unconventional system was the first to hit an exaflop. The group-sourced supercomputer assembled by the [email protected] consortium handed the milestone after it put out a name for assist preventing the COVID-19 pandemic and was deluged with donated time on greater than 1,000,000 computer systems.
Exaflop in Principle and Apply
Since then, many organizations have put in supercomputers that ship greater than an exaflop in theoretical peak efficiency. It’s price noting that the TOP500 record stories each Rmax (precise) and Rpeak (theoretical) scores.
Rmax is just the very best efficiency a pc truly demonstrated.
Rpeak is a system’s high theoretical efficiency if the whole lot might run at its highest potential degree, one thing that nearly by no means actually occurs. It’s sometimes calculated by multiplying the variety of processors in a system by their clock velocity, then multiplying the outcome by the variety of floating level operations the processors can carry out in a single second.
So, if somebody says their system can do an exaflop, think about asking if that’s utilizing Rmax (precise) or Rpeak (theoretical).
Many Metrics within the Exaflop Age
It’s one other one of many many nuances on this new exascale period.
And it’s price noting that HPL and HPL-AI are artificial benchmarks, which means they measure efficiency on math routines, not real-world purposes. Different benchmarks, like MLPerf, are based mostly on real-world workloads.
Ultimately, the very best measure of a system’s efficiency, in fact, is how nicely it runs a person’s purposes. That’s a measure not based mostly on exaflops, however on ROI.