GPU microarchitecture designed by Nvidia
Hopper
Launched
| September 20, 2022
; 19 months ago
(
2022-09-20
)
|
---|
Designed by
| Nvidia
|
---|
Manufactured by
| |
---|
Fabrication process
| TSMC
4N
|
---|
|
Server/datacenter
| |
---|
|
L1 cache
| 256
KB (per SM)
|
---|
L2 cache
| 50
MB
|
---|
Memory support
| HBM3
|
---|
PCIe
support
| PCI Express 5.0
|
---|
|
Encoder(s) supported
| NVENC
|
---|
|
Predecessor
| Ampere
|
---|
Variant
| Ada Lovelace
(consumer and professional)
|
---|
Successor
| Blackwell
|
---|
Hopper
is a
graphics processing unit
(GPU)
microarchitecture
developed by
Nvidia
. It is designed for datacenters and is parallel to
Ada Lovelace
. It's the latest generation of
Nvidia Tesla
.
Named for computer scientist and
United States Navy
rear admiral
Grace Hopper
, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the
Turing
and
Ampere
microarchitectures, featuring a new
streaming multiprocessor
and a faster memory subsystem.
Architecture
[
edit
]
The Nvidia Hopper H100 GPU is implemented using the
TSMC
4N process with 80 billion transistors. It consists of up to 144
streaming multiprocessors
.
In
SXM5
, the Nvidia Hopper H100 offers better performance than
PCIe
.
Streaming multiprocessor
[
edit
]
The streaming multiprocessors for Hopper improve upon the
Turing
and
Ampere
microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64.
The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.
Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through
cuda::memcpy_async
When parallelizing applications, developers can use
thread block
clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as
distributed shared memory
. Distributed shared memory may be used by an SM simultaneously with
L2 cache
; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the
cudaFuncAttributeNonPortableClusterSizeAllowed
function, potentially at the cost of reduced number of active blocks.
With L2 multicasting and distributed shared memory, the required bandwidth for
dynamic random-access memory
read and writes is reduced.
[7]
Hopper features improved
single-precision floating-point format
(FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including the
Smith?Waterman algorithm
.
Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.
Memory
[
edit
]
The Nvidia Hopper H100 supports
HBM3
and
HBM2e
memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased.
Hopper allows
CUDA
compute kernels
to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus it's
compressibility
) may be changed at any time. The compressor will automatically choose between several compression algorithms.
The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attribute
cudaFuncAttributePreferredSharedMemoryCarveout
may be used to define the carveout of the L1 cache. Hopper introduces enhancements to
NVLink
through a new generation with faster overall communication bandwidth.
Memory synchronization domains
[
edit
]
Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink or
PCIe
writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.
DPX instructions
[
edit
]
The Hopper architecture math
application programming interface
(API) exposes functions in the SM such as
__viaddmin_s16x2_relu
, which performs the per-
halfword
. In the Smith?Waterman algorithm,
__vimax3_s16x2_relu
can be used, a three-way min or max followed by a clamp to zero.
[12]
Similarly, Hopper speeds up implementations of the
Needleman?Wunsch algorithm
.
[13]
Transformer engine
[
edit
]
The Hopper architecture utilizes a transformer engine.
[14]
Power efficiency
[
edit
]
The SXM5 form factor H100 has a
thermal design power
(TDP) of 700
watts
. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.
Grace Hopper
[
edit
]
Grace Hopper GH200
Designed by
| Nvidia
|
---|
Manufactured by
| |
---|
Fabrication process
| TSMC
4N
|
---|
Codename(s)
| Grace Hopper
|
---|
|
Compute
| GPU: 132 Hopper SMs
CPU: 72
Neoverse V2
cores
|
---|
Shader clock rate
| 1980 MHz
|
---|
Memory support
| GPU: 96 GB HBM3 or 144 GB HBM3e
CPU: 480 GB LPDDR5X
|
---|
The GH200 combines a Hopper-based H200 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.
[16]
History
[
edit
]
In November 2019, a well-known
Twitter
account posted a tweet revealing that the next architecture after
Ampere
would be called Hopper, named after computer scientist and
United States Navy
rear admiral
Grace Hopper
, one of the first programmers of the
Harvard Mark I
. The account stated that Hopper would be based on a
multi-chip module
design, which would result in a yield gain with lower wastage.
[17]
During the 2022
Nvidia GTC
, Nvidia officially announced Hopper.
[18]
By 2023, during the
AI boom
, H100s were in great demand.
Larry Ellison
of
Oracle Corporation
said that year that at a dinner with Nvidia CEO
Jensen Huang
, he and
Elon Musk
of
Tesla, Inc.
and
xAI
"were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging".
[19]
References
[
edit
]
Citations
[
edit
]
- ^
Vishal Mehta (September 2022).
CUDA Programming Model for Hopper Architecture
. Santa Clara:
Nvidia
. Retrieved
May 29,
2023
.
- ^
Tirumala, Ajay; Eaton, Joe; Tyrlik, Matt (December 8, 2022).
"Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions"
.
Nvidia
. Retrieved
May 29,
2023
.
- ^
Harris, Dion (March 22, 2022).
"NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions"
.
Nvidia
. Retrieved
May 29,
2023
.
- ^
Salvator, Dave (March 22, 2022).
"H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy"
.
Nvidia
. Retrieved
May 29,
2023
.
- ^
"NVIDIA: Grace Hopper Has Entered Full Production & Announcing DGX GH200 AI Supercomputer"
.
Anandtech
. May 29, 2023.
- ^
Pirzada, Usman (November 16, 2019).
"NVIDIA Next Generation Hopper GPU Leaked ? Based On MCM Design, Launching After Ampere"
.
Wccftech
. Retrieved
May 29,
2023
.
- ^
Vincent, James (March 22, 2022).
"Nvidia reveals H100 GPU for AI and teases 'world's fastest AI supercomputer'
"
.
The Verge
. Retrieved
May 29,
2023
.
- ^
Fitch, Asa (February 26, 2024).
"Nvidia's Stunning Ascent Has Also Made It a Giant Target"
.
The Wall Street Journal
. Retrieved
February 27,
2024
.
Works cited
[
edit
]
- Elster, Anne; Haugdahl, Tor (March 2022).
"Nvidia Hopper GPU and Grace CPU Highlights"
.
Computing in Science & Engineering
.
24
(2): 95?100.
Bibcode
:
2022CSE....24b..95E
.
doi
:
10.1109/MCSE.2022.3163817
.
hdl
:
11250/3051840
.
S2CID
249474974
. Retrieved
May 29,
2023
.
- Fujita, Kohei; Yamaguchi, Takuma; Kikuchi, Yuma; Ichimura, Tsuyoshi; Hori, Muneo; Maddegedara, Lalith (April 2023).
"Calculation of cross-correlation function accelerated by TensorFloat-32 Tensor Core operations on NVIDIA's Ampere and Hopper GPUs"
.
Journal of Computational Science
.
68
.
doi
:
10.1016/j.jocs.2023.101986
.
- CUDA C++ Programming Guide
(PDF)
.
Nvidia
. April 17, 2023.
- Hopper Tuning Guide
(PDF)
.
Nvidia
. April 13, 2023.
- NVIDIA H100 Tensor Core GPU Architecture
(PDF)
.
Nvidia
. 2022.
Further reading
[
edit
]
|
---|
|
Software and technologies
|
---|
Multimedia acceleration
| |
---|
Software
| |
---|
Technologies
| |
---|
GPU microarchitectures
| |
---|
|
|
|
|