GTC Nvidia said the lines are blurring between the standard C++ and Nv’s CUDA C++ library when it comes to parallel execution of code.
C++ itself is “starting to enable parallel algorithms and asynchronous execution as first-class components of the language,” said Stephen Jones, CUDA architect at Nvidia, during a break-out session on CUDA at Nv’s GPU Technology Conference (GTC) this week.
“I think, by far the most exciting move for standard C++ in that direction,” Jones added.
A C++ committee is developing an asynchronous programming abstraction layer involving senders and receivers, which can schedule work to run within generic execution contexts. A context might be a CPU thread doing mainly IO, or a CPU or GPU thread doing intensive computation. This management is not tied to specific hardware. “This is a framework for orchestrating parallel execution, writing your own portable parallel algorithms [with an] emphasis on portability,” Jones said.
A paper proposing the design noted that the programming language needed “standard vocabulary and framework for asynchrony and parallelism that C++ programmers desperately need.” The draft lists, among others, Michael Garland, senior director of programming systems and applications at Nvidia, as a proposer.
The paper noted that “C++11’s intended exposure for asynchrony, is inefficient, hard to use correctly, and severely lacking in genericity, making it unusable in many contexts. We introduced parallel algorithms to the C++ Standard Library in C++17, and while they are an excellent start, they are all inherently synchronous and not composable.”
Senders and receivers are a unifying point for running workloads across a range of targets and programming models, and are designed for heterogeneous systems, Jones said.
“The idea with senders and receivers is that you can express execution dependencies and compose together asynchronous task graphs in standard C++,” Jones said. “I can target CPUs or GPUs, single thread, multi thread, even multi GPU.”
This is all good news for Nvidia, for one, as it should make it easier for people to write software to run across its GPUs, DPUs, CPUs, and other chips. Nvidia’s CUDA C++ library, called libcu++ and which already provides a “heterogeneous implementation” of the standard C++ library, is online for HPC and CUDA devs.
At GTC, Nvidia emitted more than 60 updates to its libraries, including frameworks for quantum computing, 6G networks, robotics, cybersecurity, and drug discovery.
“With each new SDK, new science, new applications and new industries can tap into the power of Nvidia computing. These SDKs tackle the immense complexity at the intersection of computing algorithms and science,” CEO Jensen Huang during a keynote on Tuesday.
Nvidia also introduced the Hopper H100 GPU, which Jones said had features to speed up processing by minimizing data movement and keeping information local.
“There’s some profound new architectural features which change the way we program the GPU. It takes the asynchrony steps that we started making in the A100 and moves them forward,” Jones said.
One such improvement is 132 streaming-multiprocessor (SM) units in the H100, up from 15 in Kepler. “There’s this ability to scale across SMs that is at the core of the CUDA programming model,” Jones said.
There’s another feature called the thread block cluster, in which multiple thread blocks operate concurrently across multiple SMs, exchanging data in a synchronized way. Jones called it a “block of blocks” with 16,384 concurrent threads in a cluster.
“By adding a cluster to the execution hierarchy, we are allowing an application to take advantage of faster local synchronization, faster memory sharing, all sorts of other good things like that,” Jones said.
Another asynchronous execution feature is a new Tensor Memory Accelerator (TMA) unit, which the company says transfers large data blocks efficiently between global memory and shared memory, and asynchronously copies between thread blocks in a cluster.
Jones called TMA “a self-contained data movement engine” that is a separate hardware unit in the SM that runs independently of SM threads. “Instead of every thread in the block participating in the asynchronous memory copy, the TMA can take over and handle all the loops and address and calculations for you,” Jones said.
Nvidia has also added an asynchronous transaction barrier in which waiting threads can sleep until all other threads arrive, for atomic data transfer and synchronization purposes.
“You just say ‘Wake me up when the data has arrived.’ I can have my thread waiting … expecting data from lots of different places and only wake up when it’s all arrived,” Jones said. “It’s seven times faster than normal communication. I don’t have all that back and forth. It’s just a single write operation.”
Nvidia also streamlined and improved the runtime compilation speed, which is where code is presented to CUDA for compilation.
“We streamline the internals of both the CUDA C++ and PTX compilers,” Jones said, adding, “we’ve also made the runtime compiler multithreaded, which can halve the compilation time if you’re using more CPU threads.”
More news on the compiler front is support for C++20, which will come out in the upcoming CUDA 11.7 release.
“It’s not yet going to be available on Microsoft Visual Studio that’s coming in the following release, but it means that you can use C++ 20 in both your host and your device code,” Jones said. ®