nvidia-hopper-h100-gpu-is-even-extra-highly-effective-in-newest-specs,-up-to-67-tflops-single-precision-compute-listed

NVIDIA has printed the official specs of its Hopper H100 GPU which is extra highly effective than what we had anticipated.

NVIDIA Hopper H100 GPU Specs Up to date, Now Options Even Sooner 67 TFLOPs FP32 Compute Horsepower

When NVIDIA introduced its Hopper H100 GPU for AI Datacenters earlier this 12 months, the corporate had printed as much as 60 TFLOPs FP32 and 30 TFLOPs FP64 figures. Nonetheless, because the launch comes shut, the corporate has now up to date the specs to replicate extra life like expectations and because it seems, the flagship and quickest chip for the AI phase is, much more, sooner now.

NVIDIA Hopper H100 GPU Is Even Extra Highly effective In Newest Specs, Up To 67 TFLOPs Single-Precision Compute Listed

One motive why the compute numbers have seen a lift is as a result of when the chip goes by way of manufacturing, the GPU producer can finalize the numbers based mostly on precise clock speeds. It’s possible that NVIDIA used conservative clock figures to offer the preliminary efficiency figures and because the manufacturing hit full swing, the corporate noticed that the chip can provide significantly better clocks.

Final month at GTC, NVIDIA confirmed that their Hopper H100 GPU was below full manufacturing and companions can be rolling out the primary wave of merchandise in October this 12 months. It was additionally confirmed that the worldwide rollout for Hopper will embody three phases, the primary can be pre-orders for NVIDIA DGX H100 methods & free palms of labs to prospects immediately from NVIDIA with methods reminiscent of Dell’s Energy Edge servers which are actually out there on NVIDIA LaunchPad.

NVIDIA Hopper H100 GPU Specs At A Look

So coming to the specs, the NVIDIA Hopper GH100 GPU consists of an enormous 144 SM (Streaming Multiprocessor) chip structure which is featured in a complete of 8 GPCs. These GPCs rock whole of 9 TPCs that are additional composed of two SM items every. This provides us 18 SMs per GPC and 144 on the entire 8 GPC configuration. Every SM consists of as much as 128 FP32 items which ought to give us a complete of 18,432 CUDA cores.

NVIDIA Kepler GK110 GPU Is Equivalent To A Single GPC on Hopper H100 GPU, 4th Gen Tensor Cores Up To 2x Faster 2

Following are a number of the configurations you may anticipate from the H100 chip:

The total implementation of the GH100 GPU consists of the next items:

  • 8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU
  • 128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU
  • 4 Fourth-Technology Tensor Cores per SM, 576 per full GPU
  • 6 HBM3 or HBM2e stacks, 12 512-bit Reminiscence Controllers
  • 60 MB L2 Cache
  • Fourth-Technology NVLink and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 board form-factor consists of the next items:

  • 8 GPCs, 66 TPCs, 2 SMs/TPC, 132 SMs per GPU
  • 128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
  • 4 Fourth-generation Tensor Cores per SM, 528 per GPU
  • 80 GB HBM3, 5 HBM3 stacks, 10 512-bit Reminiscence Controllers
  • 50 MB L2 Cache
  • Fourth-Technology NVLink and PCIe Gen 5

This can be a 2.25x enhance over the complete GA100 GPU configuration. NVIDIA can also be leveraging extra FP64, FP16 & Tensor cores inside its Hopper GPU which might drive up efficiency immensely. And that is going to be a necessity to rival Intel’s Ponte Vecchio which can also be anticipated to function 1:1 FP64. NVIDIA states that the 4th Gen Tensor Cores on Hopper ship 2 occasions the efficiency on the identical clock.

NVIDIA Kepler GK110 GPU Is Equivalent To A Single GPC on Hopper H100 GPU, 4th Gen Tensor Cores Up To 2x Faster 3

The next NVIDIA Hopper H100 efficiency breakdown exhibits that the extra SMs are solely a 20% efficiency enhance. The primary profit comes from the 4th Gen Tensor Cores and the FP8 compute the trail. Increased frequency additionally provides a good 30% uplift to the combination.

NVIDIA Kepler GK110 GPU Is Equivalent To A Single GPC on Hopper H100 GPU, 4th Gen Tensor Cores Up To 2x Faster 4

An fascinating comparability that factors out GPU scaling exhibits {that a} single GPC on a Hopper H100 GPU is equal to a Kepler GK110 GPU, a flagship HPC chip from 2012. The Kepler GK110 housed a complete of 15 SMs whereas the Hopper H110 GPU packs 132 SMs and even a singular GPC on the Hopper GPU options 18 SMs, 20% greater than the whole thing of SMs on the Kepler flagship.

NVIDIA Hopper H100 GPU Is Even Extra Highly effective In Newest Specs, Up To 67 TFLOPs Single-Precision Compute Listed

The cache is one other area where NVIDIA has given a lot consideration, upping it to 48 MB within the Hopper GH100 GPU. This can be a 20% enhance over the 50 MB cache featured on the Ampere GA100 GPU and 3x the dimensions of AMD’s flagship Aldebaran MCM GPU, the MI250X.

Rounding up the efficiency figures, NVIDIA’s GH100 Hopper GPU will provide 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute efficiency. These record-shattering figures decimate all different HPC accelerators that got here earlier than it. For comparability, that is 3.3x sooner than NVIDIA’s personal A100 GPU and 28% sooner than AMD’s Intuition MI250X within the FP64 compute. In FP16 compute, the H100 GPU is 3x sooner than A100 and 5.2x sooner than MI250X which is actually bonkers.

The PCIe variant which is a cut-down mannequin was just lately listed over in Japan for over $30,000 US so one can think about that the SXM variant with a beefier configuration will simply value round $50 grand.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics Card NVIDIA H100 (SMX5) NVIDIA H100 (PCIe) NVIDIA A100 (SXM4) NVIDIA A100 (PCIe4) Tesla V100S (PCIe) Tesla V100 (SXM2) Tesla P100 (SXM2) Tesla P100


(PCI-Specific)
Tesla M40


(PCI-Specific)
Tesla K40


(PCI-Specific)
GPU GH100 (Hopper) GH100 (Hopper) GA100 (Ampere) GA100 (Ampere) GV100 (Volta) GV100 (Volta) GP100 (Pascal) GP100 (Pascal) GM200 (Maxwell) GK110 (Kepler)
Course of Node 4nm 4nm 7nm 7nm 12nm 12nm 16nm 16nm 28nm 28nm
Transistors 80 Billion 80 Billion 54.2 Billion 54.2 Billion 21.1 Billion 21.1 Billion 15.3 Billion 15.3 Billion 8 Billion 7.1 Billion
GPU Die Measurement 814mm2 814mm2 826mm2 826mm2 815mm2 815mm2 610 mm2 610 mm2 601 mm2 551 mm2
SMs 132 114 108 108 80 80 56 56 24 15
TPCs 66 57 54 54 40 40 28 28 24 15
FP32 CUDA Cores Per SM 128 128 64 64 64 64 64 64 128 192
FP64 CUDA Cores / SM 128 128 32 32 32 32 32 32 4 64
FP32 CUDA Cores 16896 14592 6912 6912 5120 5120 3584 3584 3072 2880
FP64 CUDA Cores 16896 14592 3456 3456 2560 2560 1792 1792 96 960
Tensor Cores 528 456 432 432 640 640 N/A N/A N/A N/A
Texture Items 528 456 432 432 320 320 224 224 192 240
Increase Clock TBD TBD 1410 MHz 1410 MHz 1601 MHz 1530 MHz 1480 MHz 1329MHz 1114 MHz 875 MHz
TOPs (DNN/AI) 2000 TOPs


4000 TOPs
1600 TOPs


3200 TOPs
1248 TOPs


2496 TOPs with Sparsity
1248 TOPs


2496 TOPs with Sparsity
130 TOPs 125 TOPs N/A N/A N/A N/A
FP16 Compute 2000 TFLOPs 1600 TFLOPs 312 TFLOPs


624 TFLOPs with Sparsity
312 TFLOPs


624 TFLOPs with Sparsity
32.8 TFLOPs 30.4 TFLOPs 21.2 TFLOPs 18.7 TFLOPs N/A N/A
FP32 Compute 1000 TFLOPs 800 TFLOPs 156 TFLOPs


(19.5 TFLOPs customary)
156 TFLOPs


(19.5 TFLOPs customary)
16.4 TFLOPs 15.7 TFLOPs 10.6 TFLOPs 10.0 TFLOPs 6.8 TFLOPs 5.04 TFLOPs
FP64 Compute 60 TFLOPs 48 TFLOPs 19.5 TFLOPs


(9.7 TFLOPs customary)
19.5 TFLOPs


(9.7 TFLOPs customary)
8.2 TFLOPs 7.80 TFLOPs 5.30 TFLOPs 4.7 TFLOPs 0.2 TFLOPs 1.68 TFLOPs
Reminiscence Interface 5120-bit HBM3 5120-bit HBM2e 6144-bit HBM2e 6144-bit HBM2e 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 384-bit GDDR5 384-bit GDDR5
Reminiscence Measurement Up To 80 GB HBM3 @ 3.0 Gbps Up To 80 GB HBM2e @ 2.0 Gbps Up To 40 GB HBM2 @ 1.6 TB/s


Up To 80 GB HBM2 @ 1.6 TB/s
Up To 40 GB HBM2 @ 1.6 TB/s


Up To 80 GB HBM2 @ 2.0 TB/s
16 GB HBM2 @ 1134 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 732 GB/s


12 GB HBM2 @ 549 GB/s
24 GB GDDR5 @ 288 GB/s 12 GB GDDR5 @ 288 GB/s
L2 Cache Measurement 51200 KB 51200 KB 40960 KB 40960 KB 6144 KB 6144 KB 4096 KB 4096 KB 3072 KB 1536 KB
TDP 700W 350W 400W 250W 250W 300W 300W 250W 250W 235W

Information Source: Videocardz