AMD has provided some additional perception on its CDNA 2 “Aldebaran” GPU-powered Instinct MI200 sequence that are the primary to function an MCM design. The Instinct MI200 GPUs have been detailed by AMD Architects, Alan Smith & Norman James, throughout Hot Chips 34.

AMD Provides First Look At Aldebaran “CDNA 2” Instinct MI200 Series GPU Block Diagram, First In HPC To Feature MCM Design

AMD is formally the primary to MCM expertise and they’re doing so with a grand product which is their Instinct MI200 codenamed Aldebaran. The AMD Aldebaran GPU will are available numerous varieties & sizes nevertheless it’s all primarily based on the model new CDNA 2 structure which is essentially the most refined variation of Vega. Some of the principle options earlier than we go into element are listed beneath:

  • AMD CDNA 2 structure – 2nd Gen Matrix Cores accelerating FP64 and FP32 matrix operations, delivering as much as 4X the height theoretical FP64 efficiency vs. AMD previous-gen GPUs.
  • Leadership Packaging Technology – Industry-first multi-die GPU design with 2.5D Elevated Fanout Bridge (EFB) expertise delivers 1.8X extra cores and a couple of.7X larger reminiscence bandwidth vs. AMD previous-gen GPUs, providing the trade’s greatest combination peak theoretical reminiscence bandwidth at 3.2 terabytes per second.
  • third Gen AMD Infinity Fabric expertise – Up to eight Infinity Fabric hyperlinks join the AMD Instinct MI200 with third Gen EPYC CPUs and different GPUs within the node to allow unified CPU/GPU reminiscence coherency and maximize system throughput, permitting for a better on-ramp for CPU codes to faucet the ability of accelerators.

AMD Instinct MI200 GPU Die Shot:AMD Provides First Look At Aldebaran “CDNA 2” Instinct MI200 Series MCM GPU Block Diagram

Inside the AMD Instinct MI200 is an Aldebaran GPU that includes two dies, a secondary and a major. It has two dies with every consisting of 8 shader engines for a complete of 16 SE’s. Each Shader Engine packs 14 CUs with full-rate FP64, packed FP32 & a 2nd Generation Matrix Engine for FP16 & BF16 operations. The entire GPU is fabricated on TSMC’s 6nm course of node and comes filled with a complete of 58 Billion transistors.

AMD Instinct MI200 GPU Block Diagram:

AMD Provides First Look At Aldebaran “CDNA 2” Instinct MI200 Series MCM GPU Block Diagram

Each die, as such, consists of 112 compute models or 7,168 stream processors. This rounds as much as a complete of 224 compute models or 14,336 stream processors for all the chip. The Aldebaran GPU can also be powered by a brand new XGMI interconnect. Each chiplet encompasses a VCN 2.6 engine and the principle IO controller. Each GPU chiplet has 4 1024-bit reminiscence controllers for the HBM2e reminiscence.

As for the cache, every GPU chiplet encompasses a complete of 8 MB of L2 capability which is bodily partitioned into 32 slices. Each slice delivers 128B/CLK with enhanced queuing and arbitration plus enhanced atomic operations. The per GCD reminiscence subsystem consists of 64 GB of HBM2e reminiscence per chiplet with an aggregated 1.6 TB/s of bandwidth per GCD which is partitioned into 32 channels with a 64B/CLK for environment friendly operational voltage. The in-Package interconnect features a 400 GB/s bi-sectional bandwidth throughout the 2 GCDs.




2 of 9

There are a complete of 8 Infinity Fabric interconnects of which one on every GPU can be utilized for PCI-Express interconnect. The interconnect is rated at a coherent CPU-GPU switch price of 144 GB/s. You can scale as much as 500 GB/s utilizing the exterior Infinity Fabric hyperlink with a complete of 4 MI200 sequence GPUs or scale out utilizing a PCIe Gen 4 ESM AIC for 100 GB/s bandwidth.

AMD Instinct MI200 “Aldebaran GPU” Performance Metrics:

AMD Provides First Look At Aldebaran “CDNA 2” Instinct MI200 Series MCM GPU Block Diagram

In phrases of efficiency, AMD is touting numerous file wins within the HPC phase over NVIDIA’s A100 resolution with as much as 3x efficiency enhancements in AMG.















2 of 9

As for DRAM, AMD has gone with an 8-channel interface consisting of 1024-bit interfaces for an 8192-bit broad bus interface. Each interface can help 2GB HBM2e DRAM modules. This ought to give us as much as 16 GB of HBM2e reminiscence capability per stack and since there are eight stacks in complete, the full quantity of capability could be a whopping 128 GB. That’s 48 GB greater than the A100 which homes 80 GB HBM2e reminiscence. The reminiscence will clock in at an insane pace of three.2 Gbps for a full-on bandwidth of three.2 TB/s. This is an entire 1.2 TB/s extra bandwidth than the A100 80 GB which has 2 TB/s.




2 of 9

The AMD Instinct MI200 CDNA 2 “Aldebaran” GPUs are already powering the world’s quickest super-computer, the Frontier, which can also be the world’s first Exascale machine, providing 1.1 ExaFLOPs of compute horsepower and at the moment listed on the prime throughout the TOP500 and Green500 lists. AMD has additionally unveiled its future plans for the Instinct MI300 APU lineup which can additional leverage the chiplet structure and take issues to the subsequent stage.

AMD Radeon Instinct Accelerators 2020

Accelerator Name AMD Instinct MI300 AMD Instinct MI250X AMD Instinct MI250 AMD Instinct MI210 AMD Instinct MI100 AMD Radeon Instinct MI60 AMD Radeon Instinct MI50 AMD Radeon Instinct MI25 AMD Radeon Instinct MI8 AMD Radeon Instinct MI6
CPU Architecture Zen 4 (Exascale APU) N/A N/A N/A N/A N/A N/A N/A N/A N/A
GPU Architecture TBA (CDNA 3) Aldebaran (CDNA 2) Aldebaran (CDNA 2) Aldebaran (CDNA 2) Arcturus (CDNA 1) Vega 20 Vega 20 Vega 10 Fiji XT Polaris 10
GPU Process Node 5nm+6nm 6nm 6nm 6nm 7nm FinFET 7nm FinFET 7nm FinFET 14nm FinFET 28nm 14nm FinFET
GPU Chiplets 4 (MCM / 3D Stacked)

1 (Per Die)
2 (MCM)

1 (Per Die)
2 (MCM)

1 (Per Die)
2 (MCM)

1 (Per Die)
1 (Monolithic) 1 (Monolithic) 1 (Monolithic) 1 (Monolithic) 1 (Monolithic) 1 (Monolithic)
GPU Cores 28,160? 14,080 13,312 6656 7680 4096 3840 4096 4096 2304
GPU Clock Speed TBA 1700 MHz 1700 MHz 1700 MHz 1500 MHz 1800 MHz 1725 MHz 1500 MHz 1000 MHz 1237 MHz
FP16 Compute TBA 383 TOPs 362 TOPs 181 TOPs 185 TFLOPs 29.5 TFLOPs 26.5 TFLOPs 24.6 TFLOPs 8.2 TFLOPs 5.7 TFLOPs
FP32 Compute TBA 95.7 TFLOPs 90.5 TFLOPs 45.3 TFLOPs 23.1 TFLOPs 14.7 TFLOPs 13.3 TFLOPs 12.3 TFLOPs 8.2 TFLOPs 5.7 TFLOPs
FP64 Compute TBA 47.9 TFLOPs 45.3 TFLOPs 22.6 TFLOPs 11.5 TFLOPs 7.4 TFLOPs 6.6 TFLOPs 768 GFLOPs 512 GFLOPs 384 GFLOPs
VRAM 192 GB HBM3? 128 GB HBM2e 128 GB HBM2e 64 GB HBM2e 32 GB HBM2 32 GB HBM2 16 GB HBM2 16 GB HBM2 4 GB HBM1 16 GB GDDR5
Memory Clock TBA 3.2 Gbps 3.2 Gbps 3.2 Gbps 1200 MHz 1000 MHz 1000 MHz 945 MHz 500 MHz 1750 MHz
Memory Bus 8192-bit 8192-bit 8192-bit 4096-bit 4096-bit bus 4096-bit bus 4096-bit bus 2048-bit bus 4096-bit bus 256-bit bus
Memory Bandwidth TBA 3.2 TB/s 3.2 TB/s 1.6 TB/s 1.23 TB/s 1 TB/s 1 TB/s 484 GB/s 512 GB/s 224 GB/s
Form Factor OAM OAM OAM Dual Slot Card Dual Slot, Full Length Dual Slot, Full Length Dual Slot, Full Length Dual Slot, Full Length Dual Slot, Half Length Single Slot, Full Length
Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling
TDP ~600W 560W 500W 300W 300W 300W 300W 300W 175W 150W