tachyum-publishes-prodigy-common-processor-whitepaper:-up-to-6x-sooner-than-nvidia-h100-&-30x-sooner-than-intel-xeon-8380,-accessible-in-2h-2023

Tachyum has formally revealed the whitepaper of its 5nm Prodigy Common Processor which was unveiled all the way in which again in 2018.

Tachyum Guarantees Large Numbers In 5nm Prodigy Common Processor Whitepaper, Up To 9 Instances Increased Efficiency Effectivity Than NVIDIA’s H100

The Tachyum Prodigy CPUs make the most of a common processor design which signifies that they’ll execute CPU, GPU, and TPU duties on the identical chip, saving prices over competing merchandise and in addition providing actually excessive efficiency.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 2

The corporate goals to deal with all three chip giants, AMD, Intel & NVIDIA with its Prodigy lineup and of their displays, Tachyum has estimated a 4x efficiency uplift over Intel’s Xeon CPUs, on the HPC entrance, a 3x enhance over NVIDIA’s H100 and a 6x enhance in uncooked efficiency in AI & inference workloads. The chips are additionally mentioned to supply over 10x the efficiency of its competitor’s techniques on the identical energy. A number of the fundamental options of the CPUs embrace:

  • 128 high-performance unified 64-bit cores operating as much as 5.7 GHz
  • 16 DDR5 reminiscence controllers
  • 64 PCIe 5.0 lanes
  • Multiprocessor assist for 4-socket and 2-socket platforms
  • Rack options for each air-cooled and liquid-cooled knowledge facilities
  • SPECrate 2017 Integer efficiency of round 4x Intel 8380 and round 3x AMD 7763HPC
  • Double-Precision Floating-Level efficiency is 3x NVIDIA H100
  • AI FP8 efficiency is 6x NVIDIA H100
Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 3

Tachyum has now launched the total whitepaper of its Prodigy Common Processor that particulars the CPU structure, platform, and lineup, which can scale from the low-power T8232-LP 32 Core CPU at 180W TDP, all the way in which as much as the flagship T16128-AIX, which encompasses a whole of 128 cores.

Tachyum Prodigy Common CPU Structure – Customized 64-bit Design

The Tachyum Prodigy makes use of an OOD (Out-Of-Order) structure that may decode and retire as much as 8 directions per clock, concern as much as 11 directions per clock, with an instruction queue that helps as much as 48 directions and a scheduler that helps 12 queues which are 15 entries deep. It comes with 4 ALUs, one load unit, one retailer unit, one load/retailer unit, one masks unit & two 1024-bit vector models. Every core additionally has an AI subsystem that features a 4096-bit matrix unit. Every core is a single-threaded {hardware} design.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 4

Coming to the cache configuration, every core packs 64 KB I-Cache & 64 KB D-Cache with SECDED ECC. Every core additionally has 1 MB of L2 with twin error appropriate ECC and triple error detect DECTED. The energetic cores also can pool within the L2 cache from idle CPU cores to behave as a shared L3 cache.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 5

Prodigy employs an revolutionary coherency protocol, T-MESI (Tachyum-MESI), that’s primarily based on MESI. T-MESI provides optimizations enhancing customary MESI that enhance latency and efficiency. Along with on-chip cache coherency, Prodigy additionally helps {hardware} coherency between Prodigy units that permits each 2-socket and 4-socket platforms to be totally coherent. Prodigy’s {hardware} coherency makes use of eight full duplex lanes of 112 gigabit/sec SERDES hyperlinks between every set of coherent units, offering an mixture of 1.8 terabit/sec of bandwidth between coherent units.

Prodigy’s TLB can hold massive reminiscence footprints for HPC, as much as 128 TB. The MMU is hardware-managed for max efficiency and features a subtle international purge mechanism.

Vector and Matrix Models

Prodigy’s 2×1024-bit vector subsystems are 2x the dimensions of Intel and 4x the dimensions of AMD top-end processors. Prodigy’s 4096 matrix unit helps 16 x 16, 8 x 8, and 4 x 4 operations. The vector and matrix subsystems assist a variety of information sorts, together with FP64, FP32, TF32, BF16, Int8, FP8, in addition to TAI, or Tachyum AI, a brand new knowledge sort that will probably be introduced later this yr and can ship increased efficiency than FP8. Prodigy’s matrix operations assist sparse knowledge sorts for highest efficiency, together with 4:2 sparsity which can be supported by the Nvidia H100, in addition to Tachyum’s Tremendous-Sparsity, which permits even increased efficiency with an 8:3 ratio.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 6

Sparse knowledge sorts maximize efficiency for coaching and inference with a really minor discount in accuracy. Decrease precision knowledge sorts and sparsity are mentioned in additional element within the part “Prodigy on the Main Fringe of AI Trade Developments” beneath. Scatter/Collect operations present quick, environment friendly loading and storing for vectors and matrices.

Reminiscence and I/O Subsystems

Prodigy integrates an industry-leading sixteen DDR5 reminiscence controllers that run as much as DDR5-7200, offering roughly 1 TB/sec of reminiscence bandwidth, supporting 2 DIMMs per channel. Tachyum will probably be asserting a brand new characteristic later this yr known as “Bandwidth Amplification” that successfully doubles the reminiscence bandwidth to a staggering 2 TB/sec. The PCIe subsystem consists of 64 lanes of PCIe 5.0 with 32 PCIe controllers.

The PCIe subsystem consists of 4 x16 PCIe practical blocks, and every of the x16 blocks consists of 8 controllers that may bifurcate all the way down to x2, providing most flexibility to assist exterior units starting from excessive efficiency NICs to massive NVMe storage arrays.

Emulation for x86, Arm, RISC-V Prodigy Runs

Prodigy helps software program dynamic binary translation for different instruction set architectures (ISAs) that embrace x86, Arm, and RISC-V. x86 is the established knowledge heart processor, Arm may be very prevalent for telco purposes, and RISC-V is common with tutorial establishments. The overhead for binary translation is roughly 30 – 40%, however Prodigy will probably be operating roughly two instances the frequency of aggressive processors, so the efficiency needs to be just like operating native. Binary translation is meant to allow quick, simple out of-the-box analysis and testing for purchasers and companions, with clients migrating to Prodigy’s native ISA for manufacturing deployments for max efficiency.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 7

All chips are fabricated on TSMC’s 5nm (N5P) course of node which is a barely optimized variant of the usual 5nm (N5) node and run native and x86, Arm, and RISC-V binaries. As for HPC and AI-specific options, the Tachyum Prodigy lineup consists of:

  • 2 x 1024-bit Vector Models Per Core
  • 4096-bit Matrix Processors Per Core
  • FP64, FP32, TF32, BF16, Int8, FP8, TAI Knowledge Sorts
  • Sparse Knowledge Sorts Optimizes Effectivity
  • Quantization Assist Utilizing Low Precision Knowledge Sorts
  • Scatter/Collect for effectively storing and loading matrices

Tachyum Prodigy Common CPU Lineup/Platform – Scaling from 180W To 900W

All 128 cores on the flagship CPU are clocked at 5.7 GHz plus, AI clients will probably be getting as much as 16 reminiscence channels, supporting as much as 32 TB (64 DIMMs) of DDR5-7200. The processor may also rock 64 PCIe Gen 5.0 lanes and can are available in a 950W TDP package deal.

Tachyum Publishes Prodigy Universal Processor Whitepaper: Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380 8

The remainder of the CPUs that Tachyum will provide are listed within the specs sheet beneath:

Cores Clock Reminiscence PCIe TDP Market Section
Prodigy T16128-AIX 128 5,7 GHz 16x DDR5-7200 Gen5 x64 950W HPC, Large AI
Prodigy T16128-AIM 128 4,5 GHz 16x DDR5-7200 Gen5 x64 700W HPC, Large AI
Prodigy T16128-AIE 128 4,0 GHz 16x DDR5-7200 Gen5 x64 600W HPC, Large AI
Prodigy T16128-HT 128 4,5 GHz 16x DDR5-6400 Gen5 x64 300W Analytics, Large Knowledge
Prodigy T864-HS 64 5,7 GHz 8x DDR5-6400 Gen5 x32 300W Cloud, Databases
Prodigy T864-HT 64 4,5 GHz 8x DDR5-6400 Gen5 x32 300W Cloud, Databases
Prodigy T832-HS 32 5,7 GHz 8x DDR5-6400 Gen5 x32 300W Scalar Workloads
Prodigy T832-LP 32 3,2 GHz 8x DDR5-4800 Gen5 x32 180W Internet hosting, Storage, Edge

Now that is only one chip and Tachyum will permit full {hardware} coherency that helps 2 and 4-socket techniques. In order that’s as much as 512 cores and 3600W of energy from 4 Progidy T16128-AIX tier processors.

Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023

The Prodigy Platform will are available in numerous rack options reminiscent of an air-cooled 2U server that can have the ability to home as much as 4 Tachyum Prodigy chips, 64 16 GB DDR5 DIMMs, and 2×200 GbE RoCE NICs. There’s additionally a customized 48U rack reference design that is available in 2 variations, one liquid cooled and one air-cooled. The air-cooled model helps 40 4-socket 2U servers for a complete of 160 chips whereas the liquid-cooled model helps 88 4-socket 1U servers for a complete of 352 chips. Each racks have a modular design and a couple of racks could be mixed right into a 2-rack cupboard to optimize flooring house. Every server comes with 4 cLGA sockets.

Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023

Tachyum Prodigy Common CPU Lineup – Hitting NVIDIA, Intel & AMD All At As soon as

Tachyum additionally supplies some preliminary efficiency estimates towards Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The corporate claims as much as a 4x SPECrate 2017 Integer and 30x Uncooked Floating Level efficiency (FP64) enhance versus the competitors. Hopper H100 from NVIDIA is the principle chip that Tachyum appears to have its eyes set upon because it’s utilized in a number of comparative assessments.

Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023 Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023

A number of the efficiency figures talked about embrace:

  • 3x vs NVIDIA H100 in Double Precision Floating-Level Efficiency
  • 6x vs NVIDIA H100 in AI FP8 Efficiency
  • 9x vs NVIDIA H100 in Efficiency per Watt
  • 4x vs Intel Xeon Platinum 8380 in Specrate 2017 INT Efficiency
  • 30x vs Intel Xeon Platinum 8380 in FP64 Efficiency

tachyum-prodigy-05

tachyum-prodigy-06

tachyum-prodigy-07

tachyum-prodigy-09

2 of 9

Tachyum additionally supplies some preliminary efficiency estimates towards Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The corporate claims as much as a 4x SPECrate 2017 Integer and 30x Uncooked Floating Level efficiency (FP64) enhance versus the competitors. Hopper H100 from NVIDIA is the principle chip that Tachyum appears to have its eyes set upon because it’s utilized in a number of comparative assessments.

Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023

Whereas the Prodigy T16128-AIX provides round 90 TFLOPs of FP64 perf (with sparsity). The corporate makes use of an Air-cooled Prodigy rack which is estimated to ship as much as 6.2 PetaFlops of HPC FP64 horsepower versus an NVIDIA H100 DGX POD rack which provides 960 TFLOPs of FP64 HPC efficiency. The liquid-cooled Prodigy which might maintain higher-end chips ought to provide over double the efficiency at 12.9 PetaFLOPs.

Tachyum Publishes Prodigy Common Processor Whitepaper: Up To 6x Sooner Than NVIDIA H100 & 30x Sooner Than Intel Xeon 8380, Accessible In 2H 2023

Tachyum expects the primary Prodigy ships to begin sampling later this yr with quantity manufacturing anticipated within the second half of 2023. The following-gen improve to Prodigy, referred to as Prodigy 2 can be listed in Tachyum’s roadmap and will probably be providing a brand new 3nm structure with much more cores, increased reminiscence bandwidth, PCIe 6.0 + CXL assist, and enhanced connectivity. Sampling on that ought to start by the second half of 2024.