Earlier this month, we reported that Birentech, an organization hailing from China, was engaged on its quickest GPU up to now, the Biren BR100. Based on what the corporate has publicly revealed, the Biren BR100 goals to be a General-Purpose GPU that will provide sooner efficiency than NVIDIA’s A100 GPUs in AI processing. Now at Hot Chips 34, the corporate is presenting us with extra particulars on the specs and structure inside its Biren GPGPU lineup.

China’s Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed

The Birentech BR100 is the flagship General-Purpose GPU that China has to supply, that includes an in-house GPU structure that makes use of a 7nm course of node and homes 77 Billion transistors inside its die. The GPU has been fabricated on TSMC’s 2.5D CoWoS design and in addition comes filled with 300 MB of on-chip cache, 64 GB of HBM2e with a reminiscence bandwidth of two.3 TB/s, and assist for PCIe Gen 5.0 (CXL interconnect protocol). The entire chip measures 1074mm2 which is past the reticle restrict of the method node.

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed 1

Some of the basics that went into designing the BR100 GPU included:

  • To break the reticle dimension restrict and combine extra transistors on a chip
  • One tape out to empower a number of SKUs
  • Smaller die for higher yield, therefore decrease price
  • 896 GB/s high-speed die-to-die interconnect
  • 30% extra efficiency, and 20% higher yield in contrast with a monolithic design



2 of 9

Talking in regards to the structure itself, the Biren BR100 is made up of two chiplets, every housing 16 SPC or Streaming Processing Clusters. Each SPC has 16 EUs and 4 of those EUs kind an inner Compute Unit or CU that’s hooked up to 64 KB of L1 cache (LSC) whereas the SPC incorporates a shared 8 MB L2 cache throughout all Execution Units. So that is a complete of 32 SPCs with 512 Execution Units, 256 MB of L2 cache, and eight MB of L1 cache.

A deeper take a look at the Execution Unit reveals 16 streaming processing cores (V-Core) and a single Tensor Engine (T-Core). There’s 40 KB of TLR (Thread Local Register), 4 SFUs, and a TDA (Tensor Data Accelerator). Interestingly, every CU can include 4, 8, and as much as 16 EUs. The V-Core itself is a general-purpose SIMT processor which options 16-cores that helps FP32, FP16, INT32 & INT16 together with SFU, Load/Store, and Data Processing, whereas dealing with deep studying operations similar to Batch Norm, ReLu, and many others. It additionally options an enhanced SIMT Model that may run as much as 128K threads on 32 SPCs in a super-scalar mode (static and dynamic). For the T-Cores, the tensor design is used to speed up AI operations similar to MMA, Convolution, and many others.














2 of 9

Birentech disclosed varied efficiency metrics of the chip. It gives as much as 2048 TOPs (INT8), 1024 TFLOPs (BF16), 512 TFLOPs (TF32+), and 256 TFLOPs (FP32), and based mostly on the efficiency figures, it appears to be like like this chip goes to be sooner than the NVIDIA Ampere A100, at the least on paper. The GPU has been in contrast towards the NVIDIA Ampere A100 in varied HPC workloads and it appears to be like like it might provide as much as a 2.6x common speedup and as much as a 2.8x speedup over its foremost competitor.

Birentech Particulars China’s Most Highly effective GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Sooner Than NVIDIA Ampere at 550W

The Hopper H100 GPU gives almost 2x or 2.5x the efficiency in the identical GPU efficiency metrics. The chip additionally helps 64-channel encoding and 512-channel encoding. As for the interconnects, the chip comes with an 8 BLink answer which gives 2.3 TB/s of exterior I/O bandwidth.

What’s fascinating is that the BR100 is not that far behind when it comes to general transistor depend in comparison with the NVIDIA H100. The H100 options 80 Billion transistors on the brand new N4 course of node whereas the BR100 is simply 3 Billion transistors behind the 7nm course of node. This would result in a a lot larger die dimension.



2 of 9

Birentech Biren BR100
Process 7nm
System interface, bandwidth, interconnection protocol PCIe5.0 X16, 128GB/s, assist CXL
FP32 TFLOPS (peak) 256
TF32+ TFLOPS (peak) 512
BF16 TFLOPS (peak) 1,024
INT8 TOPS (peak) 2,048
Memory capability, interface bit width, bandwidth 64GB HBM2E;4,096bit, 1.64TB/s
interconnection 512GB/s BLink™, helps 8 x8 ports
Secure digital occasion Up to eight servings
Video codec ([email protected]) 64-channel HEVC/H.264 encoding/512-channel HEVC/H.264 decoding
TDP 550W
Product kind OAM module

The Biren BR100 is not the one chip that the China-based firm has introduced. There’s additionally the Biren BR104 which gives half the efficiency metrics of the BR100 however the specs aren’t instructed but. The solely element obtainable on the opposite chip is that, in contrast to the Biren BR100 which makes use of a chiplet design, the BR104 is a monolithic die and is available in a normal PCIe kind issue with a TDP of 300W.

Birentech Particulars China’s Most Highly effective GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Sooner Than NVIDIA Ampere at 550W
Birentech Biren 104
Process 7nm
System interface, bandwidth, interconnection protocol PCIe5.0 X16, 128GB/s, assist CXL
FP32 TFLOPS (peak) 128
TF32+ TFLOPS (peak) 256
BF16 TFLOPS (peak) 512
INT8 TOPS (peak) 1,024
Memory capability, interface bit width, bandwidth 32GB HBM2E; 2,048bit, 819GB/s
interconnection 192GB/s BLink™, helps 3 x8 ports
Secure digital occasion as much as 4 servings
Video codec ([email protected]) 32 channels of HEVC/H.264 encoding, 256 channels of HEVC/H.264 decoding
TDP 300W
Product kind Full-height full-length, dual-slot PCIe card



2 of 9

The firm states {that a} chip with 77 Billion transistors can mimic the human mind nerve cells and the chip itself shall be used for DNN and AI functions so it is kind of going to interchange China’s dependence on NVIDIA’s AI GPUs.