SILICON TALES
A field guide to NVIDIA's 2026 platform

Vera Rubin NVL72 How a leather-jacketed CEO turned a graphics company into the power grid of artificial intelligence.

Every spring, in a packed arena that the press has taken to calling "AI Woodstock," a man in a black leather jacket walks on stage without notes and tells the technology industry what the next year will look like. He is almost always right, because by the time he says it, most of it has already been built. This is the story of the machine he unveiled for 2026, told the way a hardware enthusiast would tell it: every chip, every acronym, and the part nobody explains clearly, which is what you can actually buy and what "one unit" even means.

72
Rubin GPUs / rack
36
Vera CPUs / rack
3.6 EF
NVFP4 inference / rack
20.7 TB
HBM4 memory / rack
~1,296
total chips / rack
A note on trust: the numbers below are NVIDIA's preliminary specifications, the kind that get revised before a product reaches a loading dock. Mass production is targeted for the second half of 2026. Treat the figures as the company's intent, not a finished datasheet, and treat the marketing multipliers with the polite skepticism they have earned over the years.
01 / The house that Jensen built

A name for measuring invisible things

NVIDIA names its architectures after scientists, and the choices are rarely accidental. Pascal, Volta, Turing, Ampere, Hopper, Blackwell. This generation belongs to Vera Rubin, the American astronomer whose painstaking measurements of how galaxies rotate produced the most convincing evidence we have for dark matter, the unseen mass that holds the cosmos together. There is a quiet poetry in attaching her name to a machine built for inference, the act of drawing conclusions from what a model cannot directly see.

And then, as has become tradition, NVIDIA split the name across the two halves of the system, which tells you the entire design philosophy before you read a single specification.

The GPU
Rubin

The surname goes to the GPU, the engine that does the punishing AI mathematics. When people say "a Rubin," this is the silicon they picture.

The CPU
Vera

The first name goes to a brand new CPU, the data-traffic controller and coordinator. It is the successor to "Grace," and it marks NVIDIA's return to designing its own processor cores from scratch.

So "Vera Rubin" is not two products bolted together at the last minute. It is a CPU and a GPU drawn up on the same whiteboard, intended from day one to behave as a single organism.

From a Denny's booth to the most valuable company on Earth

To understand why a new NVIDIA rack now moves global stock markets, you have to understand how improbable the company's position is. NVIDIA was founded in 1993 in a roadside Denny's in San Jose by Jensen Huang and two engineers, with a plan to accelerate computer graphics. It nearly died in its first years when an early chip flopped, and survived on a single rescue product. For most of its life it was, to the wider world, the company that made your video games run smoothly.

The pivot that created today's giant was a bet almost nobody else was willing to make. Around 2006, NVIDIA began pouring money and engineers into CUDA, a way to use graphics chips for general-purpose computing. For years this looked like an expensive distraction; analysts asked why a gaming company was funding a research-computing platform with no obvious customer. Huang kept funding it anyway. When deep learning arrived a few years later and researchers discovered that GPUs were almost magically suited to training neural networks, NVIDIA was the only company with a decade of software already in the ground. That patience is the moat. Rivals can copy the silicon; they cannot copy twenty years of libraries and the millions of developers who learned on them.

The rest reads like a coronation. The 2020 acquisition of Mellanox turned NVIDIA from a chip vendor into a networking company. The arrival of ChatGPT turned its data-center GPUs into the most fought-over hardware on the planet. By the mid-2020s NVIDIA had passed Apple and Microsoft to become the most valuable public company in the world, selling the picks and shovels for the entire AI gold rush with a market share in AI accelerators that competitors still measure in single-digit fractions of what is left over.

Huang's recurring line at these launches, delivered with a showman's grin, is "the more you buy, the more you save." It is a joke. It is also, for the hyperscalers racing each other, a description of how they actually think.

The texture matters too, because Huang has made it part of the product. The black leather jacket he has worn for decades is now as recognizable as the logo. He runs the company famously flat, with dozens of executives reporting to him directly. In 2016 he personally carried the first DGX-1 AI supercomputer to a young lab called OpenAI and signed its chassis with a dedication to the future of computing and humanity, a piece of theater that has aged into legend. The annual GTC keynote, where Blackwell was held aloft in 2024 and the Rubin roadmap was laid out in 2025, is less a product briefing than a stadium event. None of this is incidental. It is how a components company convinced the world it is building the future, one rack at a time.

The cadence: a new architecture every year

Here is the lineage that explains why Rubin matters. NVIDIA has compressed what used to be a multi-year chip cycle into an annual drumbeat, and each beat has had a distinct personality.

YearArchitectureFlagshipWhat it really meant
2022 to 2023HopperH100 / H200The chip the whole world fought over in the ChatGPT gold rush.
2024BlackwellB200 / GB200 NVL72NVIDIA stops selling a chip and starts selling a whole rack as one computer.
2025Blackwell UltraGB300 NVL72Retuned for "reasoning" AI that thinks before it answers.
2026RubinVR200 / Vera Rubin NVL72Built to run AI cheaply, for billions of users and AI agents. (this guide)
2027Rubin UltraNVL576Doubles the rack to 144 packages (576 GPU dies).
~2028Feynmanon the mapNamed, scheduled, and still under wraps.

What each leap actually meant

Strip away the codenames and there is a single arc running through all of it: the industry moved from the problem of building giant models to the far larger problem of running them affordably for everyone, forever. Read top to bottom, this is that story.

2022 to 2023 · Hopper
H100, the gold-rush chip Training era

When ChatGPT detonated, every company on Earth suddenly needed to train AI, and the H100 was the one chip that could really do it. Demand went vertical, supply evaporated, and the phrase "GPU shortage" entered ordinary conversation. The whole modern boom was built on these. NVIDIA's valuation crossed a trillion dollars on the strength of them.

2024 · Blackwell
GB200, when they stopped selling chips and started selling computers Training era

Frontier models grew too large to live on a single chip, so Huang changed what NVIDIA sells. Instead of a card you slot into a server, Blackwell's flagship is an entire rack: seventy-two GPUs wired so tightly they behave like one colossal brain. This is the moment "AI factory," his favorite phrase, stopped being a metaphor and became a product category.

2025 · Blackwell Ultra
GB300, tuned for AI that "thinks" The shift begins

A new species of model appeared, the kind that reasons step by step before it answers, the source of that "thinking" pause you now see in chatbots. That style burns enormous compute at answer time, not just at training time. GB300 was a mid-cycle tune-up aimed squarely at this reasoning workload, a hint of where the center of gravity was about to move.

2026 · Rubin  ←  you are here
VR200, making AI cheap enough for everyone Inference era

The hard problem flipped completely. The challenge is no longer training a model once. It is serving that model to billions of people and to armies of autonomous agents, affordably, every second of every day. Rubin is engineered for exactly this. NVIDIA is targeting roughly a tenth of the cost per answer compared with Blackwell, and it has added the new Vera CPU and the Groq LPU specifically to keep agentic workloads fast and cheap.

2027 · Rubin Ultra
NVL576, double the rack Roadmap

The same idea, scaled. The rack roughly doubles to 144 packages (576 GPU dies), readying NVIDIA for the next jump in both model size and global user volume.

~2028 · Feynman
The next chapter Roadmap

Named for the physicist Richard Feynman. It exists on Huang's roadmap slide, which by now is enough to make suppliers plan around it, but the details are still locked away.

The one takeawayHopper and Blackwell answered "how do we build these models?" Rubin answers a different and much bigger question: "how do we run them for everyone without going broke?" That pivot, from training to cheap inference and AI agents, is the entire reason Rubin exists.
Naming gotchaFrom Rubin onward, NVIDIA counts GPU dies, not GPU packages, as "GPUs." Each Rubin package holds two compute dies. That is why the same rack appears as "NVL72" (72 packages) in the final product name and as "NVL144" (144 dies) in earlier slides. Same cabinet, two ways of counting, and a small masterclass in how a bigger number gets onto a keynote slide.
02 / The big picture

It is not a server. It is a building that thinks.

The single hardest adjustment for anyone coming from PCs is this: a consumer GPU lives inside one machine, but NVIDIA's unit of measure is now an entire rack, a refrigerator-sized cabinet that behaves as one computer assembled from roughly 1,300 chips. Once you accept that the rack is the product, two interconnect concepts unlock everything else, and they exist because moving data is now harder than computing on it.

Inside the rack
Scale-up

Wiring every GPU within a single rack so they act like one enormous GPU. The hero here is NVLink, a private highway that runs at close to memory speed. This is the part nobody else can match at scale.

Between racks
Scale-out

Linking many racks into a full data center. The heroes are the networking chips: ConnectX, Spectrum-X, Quantum-X InfiniBand. Think of this as the public motorway between buildings, fast but a tier below NVLink.

Figure 01
Two highways, two completely different jobs
Scale-up keeps a rack's GPUs glued together at memory speed. Scale-out stitches racks into a data center. Confusing the two is the most common mistake newcomers make.
SCALE-UP · ONE RACK · NVLink GPU GPU GPU GPU GPU GPU NVLink 6 3.6 TB/s per GPU · all-to-all · memory-class speed SCALE-OUT · MANY RACKS · Ethernet/IB RACK RACK RACK SW SW ConnectX-9 · Spectrum-X · Quantum-X800

It helps to see the physical object. A Vera Rubin NVL72 is not abstract. It is a cabinet you could touch, if you were allowed near the cooling loop, weighing on the order of one and a half metric tons and drinking power at a rate that would trip the breakers of a small office block. Here is what is inside it, top to bottom.

Figure 02
Anatomy of one NVL72 rack
Eighteen compute trays do the thinking, nine switch trays in the middle keep them glued together, and the whole thing is fed by liquid, not air.
POWER SHELF LIQUID COOLING (CDU) 18 compute trays 2 Vera + 4 Rubin each 9 NVLink switch trays 260 TB/s all-to-all 72Rubin GPUs 36Vera CPUs 20.7TB HBM4 3.6EF NVFP4 ~1,296 chips total ~150 to 190 kW ~1.5 tonnes liquid-cooled only

That is the punchline of the whole platform in one image. The thinking happens in the green compute trays; the brighter switch trays in the middle are the connective tissue that lets all seventy-two GPUs pretend to be one; and the orange and teal bands at top and bottom are the unglamorous truth of modern AI, which is that the real engineering challenge is now feeding the thing power and carrying away its heat.

03 / CUDA, cores and number formats

"Does it have CUDA?" is almost a funny question

It is a fair thing to ask, and the honest answer is that asking whether a modern NVIDIA GPU "has CUDA" is a little like asking whether a Catholic cathedral has religion. CUDA is not a feature of the hardware. It is the gravity well that the entire AI industry now orbits, and it exists in two distinct forms that Rubin carries at once.

The software
CUDA, the platform

NVIDIA's programming model and its two-decade stack of libraries, CUDA-X. This is the real moat, the thing AMD and Intel keep failing to dislodge. Rubin runs CUDA 13, and the new Groq chips were deliberately wired in so that no CUDA code has to change.

The hardware
CUDA cores

The small general-purpose math units inside every NVIDIA GPU. They handle ordinary parallel arithmetic in FP32 and FP64 and integers, the foundation for scientific computing and all the logic that surrounds the AI itself.

Figure 03
Why the moat is the software, not the silicon
A competitor can fabricate a fast chip. What they cannot conjure overnight is twenty years of libraries and the millions of engineers who learned on them. Every layer above the silicon is locked to NVIDIA.
Every AI model on Earth GPT, Gemini, Llama, Claude-class systems, your app Frameworks PyTorch · TensorFlow · JAX CUDA-X libraries cuDNN · cuBLAS · NCCL · TensorRT · twenty years of work CUDA programming model The silicon Rubin GPU · Vera CPU · BlueField DPU · ConnectX NIC THE MOAT rivals can match silicon, not this

Two kinds of cores: the generalists and the specialists

Inside each GPU, compute is grouped into SMs, or Streaming Multiprocessors, and the Rubin GPU has 224 of them. Picture an SM as a workshop containing two very different teams. There is no single "AI core" doing everything; the magic is in how these two teams divide the labor.

Core typeWhat it doesWhat it is forThink of it as
CUDA coreGeneral-purpose parallel math (FP32, FP64, integers)HPC, physics, simulation, and the connective logic around the AIA bench of versatile generalists
Tensor CoreMatrix-multiply-accumulate at low precision (FP4, FP8, FP16)The roughly ninety percent of deep learning that is matrix math. This is where the headline PetaFLOPS come from.Specialist robots that do one thing at terrifying speed

Rubin's 224 SMs carry the latest generation of Tensor Cores, which NVIDIA documents as fifth-generation and tunes heavily for NVFP4 and FP8, alongside the general-purpose CUDA cores. For data-center parts NVIDIA leads its marketing with Tensor throughput rather than a raw CUDA-core count, and that choice is itself the story: for AI, the matrix engines are what matter, and the specialists have quietly become the main event.

The Transformer Engine, the part that feels like cheating

Sitting on top of the Tensor Cores is a layer of hardware and software called the Transformer Engine, and it is one of NVIDIA's cleverest tricks. As a model runs, it watches the numbers flowing through each layer and decides, on the fly, which values can be safely crushed down to ultra-low precision and which need a little more room. The result is the holy grail of inference economics: something close to FP4 speed with close to FP8 accuracy. Rubin's third-generation engine adds an adaptive, two-level scaling scheme and retires the older "structured sparsity" trick from previous generations. This is how the same chip can quote 35 dense PetaFLOPS of NVFP4 yet reach an effective 50 PetaFLOPS for real inference.

Number formats, and the only rule you need

Precision simply means how many bits you spend to represent each number. Fewer bits run faster and use less memory but carry less accuracy, so the entire discipline of modern AI hardware is a hunt for the lowest precision you can get away with without the model falling apart.

// lower precision buys speed and memory; higher precision buys accuracy
FormatBitsPrimary useWhy you would reach for it
NVFP44Inference at massive scale, and increasingly the training of the very largest modelsThe cheapest possible cost per token. The Transformer Engine keeps the accuracy honest.
FP8 / NVFP88Training, and inference where quality cannot slipMore numerical range than FP4, the safe default when stability matters.
FP16 / BF1616Sensitive training layers, gradients, mixed precisionStability where FP8 would drift. BF16 trades precision for range.
TF32~19A near drop-in accelerator for FP32 workFaster than FP32 with almost no code change.
FP32 / FP6432 / 64Scientific computing, simulation, HPCFull precision where being wrong is not an option. This runs on the CUDA cores.
The only rule you needFP4 runs the model cheaply. FP8 trains the model. FP16 and BF16 keep training from blowing up. FP32 and FP64 do science. Rubin is bent, hard, toward the first two, because that is where the money in the inference era actually is.
04 / The seven chips, one by one

Seven new chips, one supercomputer

"Seven new chips, one AI supercomputer" is how Huang framed it on stage, and the framing is doing real work. The competition tends to ship a faster GPU. NVIDIA ships an entire fleet of co-designed silicon, each piece solving a different bottleneck, and the bottlenecks are no longer mostly about compute. Here is the cast, in order.

01

Rubin GPU

// the compute engine

This is the star, the silicon that does the overwhelming share of the AI mathematics, and physically it represents a clean break from the past. Old GPUs were a single monolithic slab. Rubin is a chiplet design, an assembly of separate dies bonded onto one carrier, because chips have hit the physical ceiling of how large a single piece of silicon can be manufactured. NVIDIA's answer is to stop fighting that ceiling and start sewing chips together.

Figure 04
Inside a single Rubin package
Two reticle-sized compute dies and two I/O dies, ringed by eight stacks of HBM4 memory, all stitched onto one TSMC CoWoS-L carrier. "Reticle-sized" means as physically large as a lithography machine can print in a single exposure.
CoWoS-L INTERPOSER HBM4HBM4HBM4HBM4 HBM4HBM4HBM4HBM4 COMPUTE DIE 1 COMPUTE DIE 2 I/O DIE I/O DIE 336Btransistors 288 GBHBM4 capacity 22 TB/sbandwidth
TSMC N3P 3nm class 336B transistors 224 SMs 288 GB HBM4 22 TB/s bandwidth1 50 PF NVFP4 inference ~1.8 to 2.3 kW TDP

The two upgrades that actually matter

The first is memory, and it is the more important of the two. As models swell, the limiting factor stopped being how fast a chip can compute and became how fast you can feed it. HBM4 doubles the interface width over the previous HBM3e, delivering 288 GB across eight stacks at up to 22 TB/s, roughly two and three-quarter times Blackwell's 8 TB/s. The second is the third-generation Transformer Engine described above. Together they target the exact wall that the inference era runs into.

The uncomfortable truth: heatAt eighteen hundred to twenty-three hundred watts per GPU, there is no such thing as an air-cooled Rubin. Every configuration is liquid-cooled, full stop. That is nearly double Blackwell's thousand watts, and it is why deploying Rubin is as much a plumbing project as a computing one. The chip is so capable that the hard part has moved off the die entirely and into the data center's cooling loop.
02

Vera CPU

// the brand new brain

This is the genuinely new piece, and the one most people miss. Blackwell's companion CPU was called Grace, and it leaned on off-the-shelf Arm Neoverse cores. Vera is its successor, and the headline is that NVIDIA designed the cores itself, a custom design code-named "Olympus." After years of buying its CPU cores off the shelf, NVIDIA has gone back to drawing its own, which is a statement of intent about how tightly it wants the CPU and GPU to fit.

88 custom Olympus cores 176 threads 227B transistors 1.5 TB LPDDR5X 1.8 TB/s NVLink-C2C to GPU first CPU with native FP8

The job here is the work a GPU is genuinely bad at. GPUs are gloriously fast at doing the same operation across thousands of data points, but they stumble on branchy, sequential, decision-heavy logic. Vera handles exactly that: staging data, deciding which GPU should get which piece at which moment, and orchestrating the long, multi-step loops of agentic reasoning. NVIDIA keeps using the word "deterministic," meaning predictable, jitter-free timing, and for AI agents that fire thousands of small dependent steps, that predictability is worth more than raw speed.

A unit of vocabularyOne Vera CPU plus two Rubin GPUs on a single board is called a Vera Rubin Superchip. An NVL72 rack contains thirty-six of them, which is where its 72 GPUs and 36 CPUs come from. The CPU and GPU talk over NVLink-C2C at 1.8 TB/s, twice the bandwidth Grace had, so the line between "CPU memory" and "GPU memory" gets blurry on purpose.
03

NVLink 6 Switch

// the in-rack highway, scale-up

A frontier model does not fit on one GPU; it is sliced across dozens, and those slices have to talk to each other constantly and instantly, or the whole rack stalls waiting on itself. NVLink is the private highway that carries that traffic, and the NVLink Switch is the interchange that lets every GPU reach every other one at once. This is the single piece of the puzzle that competitors have struggled most to replicate at scale.

3.6 TB/s per GPU, all-to-all 260 TB/s per rack roughly 2x Blackwell

The payoff is the illusion at the heart of the NVL72: with NVLink 6, all seventy-two GPUs can behave as one enormous GPU with a shared pool of memory. Try to move model shards across ordinary networking instead and the latency would strangle performance. NVLink keeps that traffic running at something close to memory speed, which is the difference between a rack and a mere pile of servers.

04

ConnectX-9 SuperNIC

// the between-racks highway, scale-out

One rack is never enough at the frontier, so you link hundreds of them into a single training cluster, and the SuperNIC is what does the linking. This is the lineage that traces straight back to the Mellanox acquisition; networking is no longer an afterthought bolted onto the side of an AI system, it is a first-class citizen of the design.

1.6 Tb/s per GPU programmable RDMA GPU-direct

Its key trick is programmable RDMA, which lets a GPU reach directly into another server's memory without bothering either machine's CPU, at very low latency. The clean mental split is this: NVLink makes things fast inside the rack, and ConnectX-9 makes things fast between racks. Both have to be excellent or the cluster runs at the speed of its weakest link.

05

BlueField-4 DPU

// the infrastructure workhorse

A DPU, or Data Processing Unit, is the data center's invisible custodian. Every AI cluster has a mountain of unglamorous chores: shuffling storage, managing the network, enforcing security boundaries between tenants, encrypting traffic. Run those on the expensive GPUs and CPUs and you are burning gold to do janitorial work. BlueField-4 takes all of it off their plate.

storagenetworkingsecurity and isolationelastic scalingintegrated SSD for KV-cache

The new wrinkle for Rubin is that BlueField-4 can park the KV-cache on an integrated SSD. As models stretch to million-token contexts, they need a fast tier to hold their working memory, and putting it close to the network fabric rather than hogging precious HBM is one of those quiet architectural decisions that pays off enormously at scale.

06

Spectrum-X Ethernet Co-Packaged Optics

// networking, rebuilt around light

This is the most science-fiction piece in the set. Traditional network switches turn electricity into light using little pluggable optical modules, and at data-center scale those modules are a genuine plague: they burn power, run hot, and fail often enough to be a real reliability problem. Co-packaged optics takes the radical step of building the silicon photonics right next to the switch chip, generating the light at the source rather than at the end of a copper run.

5x power efficiency 10x network resiliency up to 5x more uptime

At the scale of a gigawatt AI factory, the power drawn by networking and the failures of optical links are real line items on the budget and the maintenance schedule. Co-packaged optics attacks both at once, moving data at the speed of light while spending dramatically less energy to do it. It is the kind of unglamorous infrastructure win that does not make headlines but quietly decides whether a build is economical.

07

Groq 3 LPU

// the wildcard, and a twenty-billion-dollar story

If one card on NVIDIA's own slide makes people do a double-take, it is this one, and the confusion is completely understandable, because Groq spent years as a rival. The explanation is one of the more dramatic corporate moves of the decade. On the 24th of December 2025, NVIDIA signed a deal with Groq worth roughly twenty billion dollars. Structurally it is a non-exclusive license combined with a team transfer, an "acqui-hire," which is precisely the structure that let it close in under four months without triggering a full merger review. Groq's founder Jonathan Ross, who originally designed Google's TPU, and its president Sunny Madra moved over to NVIDIA.

Huang reached for a familiar comparison on stage. Mellanox, he reminded everyone, turned NVIDIA from a chip vendor into a networking company. Groq, in the same telling, turns it from a training-first GPU vendor into a full-stack platform optimized for inference. It is a tidy narrative, and like most things Huang says from a stage, it doubles as a roadmap.

The asterisk worth keepingThis deal is contested, and it would be dishonest to present it as settled. A United States Senate inquiry opened in March 2026, led by Senators Warren and Blumenthal, alongside FTC interest, is examining whether the "reverse acqui-hire" structure was engineered specifically to dodge antitrust scrutiny. As of the middle of 2026 the deal stands and the chips are real, but the investigation is open, and the question of whether a company this dominant should be allowed to absorb its challengers this way is very much live.

What an LPU actually is, and why it had to exist

An LPU, or Language Processing Unit, closes the one gap a GPU cannot close on its own. The problem is specific and brutal. At very high token-generation speeds, north of a thousand tokens per second, even NVLink-connected GPU systems choke, not on compute but on memory bandwidth, because the data has to keep making the round trip to the HBM stacks at the edge of the package. The LPU's solution is radical to the point of seeming reckless: throw out the expensive HBM entirely and pour a vast pool of SRAM directly onto the die, right alongside the compute, so the data barely has to travel at all.

Figure 05
Why the LPU moves memory onto the chip
On a GPU, data sprints to HBM at the edge of the package and back. On an LPU, memory and compute are interleaved on the same die, so the round trip nearly vanishes. That is the whole idea.
GPU · memory lives OFF-CHIP COMPUTE tensor cores HBMHBMHBMHBM ~22 TB/s, superb, until 1,000+ tok/s, then it jams LPU · memory lives ON the die SRAM compute SRAM compute SRAM compute SRAM compute SRAM ~150 TB/s per chip, no off-chip trip, deterministic latency
per LPU: ~500 MB SRAM per LPU: ~150 TB/s LPX rack: 256 LPUs LPX rack: 128 GB SRAM LPX rack: 40 PB/s Samsung 4nm

How the GPU and LPU split the work

The two chips do not compete inside the system; they divide the labor of generating an answer. The Rubin GPU handles the prefill, the KV-cache, and the heavy attention math, while the LPU takes the latency-sensitive feed-forward networks, the mixture-of-expert execution, and the pointwise operations. NVIDIA's claim for the pairing is striking: an LPX rack working alongside a Vera Rubin NVL72 delivers thirty-five times the inference throughput per megawatt of Blackwell. The LPX racks physically sit beside the Rubin racks, connected over Spectrum-X, and, in the line that is clearly aimed at reassuring nervous customers, none of it requires a single change to CUDA code.

05 / The eighth chip

Rubin CPX, and the art of not wasting your best machine

Beyond the seven, NVIDIA lists one more part as a GPU in its own right: Rubin CPX, the Context Phase aXcelerator. To understand why it exists, you have to know that answering a prompt is really two jobs, with completely different appetites, and using one expensive chip for both is like running a fine-dining kitchen where the same chef chops the onions and plates the dessert.

Phase 1
Prefill, the reading

Digesting the prompt, which today might be an entire codebase or a feature-length video, millions of tokens at once. This phase is compute-hungry but not especially latency-sensitive. The right tool is Rubin CPX.

Phase 2
Decode, the writing

Producing the answer one token at a time. This phase is bound by memory and latency, not raw compute. The right tools are the Rubin GPU and, at extreme speed, the LPU.

Figure 06
One answer, three specialists, in sequence
This is "disaggregated inference." Instead of one chip doing everything, the request flows down an assembly line where each stage runs on the silicon built for it.
your prompt up to 1M tokens RUBIN CPX PREFILL read & understand 128 GB GDDR7 30 PF NVFP4 RUBIN GPU ATTENTION + KV the heavy math 288 GB HBM4 22 TB/s GROQ LPU DECODE generate fast on-die SRAM ~150 TB/s tokens out each stage runs on the chip designed for it, instead of one chip paying the price for all three

CPX delivers thirty PetaFLOPS of NVFP4, but its defining choice is to swap the expensive HBM4 for 128 GB of much cheaper GDDR7, the kind of memory you would find on a gaming card. For the prefill job, ingesting long sequences cheaply and without melting, that trade is exactly right, and CPX pairs it with hardware video decode and roughly three times the attention performance of GB300.

The idea in one sentenceRubin is not a single do-everything chip. It is a toolbox: the Rubin GPU for general work, CPX for cheap long-context reading, the LPU for blistering token output. Pull these together and you get the Vera Rubin NVL144 CPX rack, which combines 144 Rubin GPUs, 144 CPX, and 36 Vera CPUs to reach roughly 8 ExaFLOPS of NVFP4, 100 TB of fast memory, and 1.7 PB/s of bandwidth. NVIDIA frames that as about seven and a half times a GB300 rack, which is the kind of number that gets a keynote audience to its feet.
06 / Form factors

Decoding the alphabet soup: MGX, NVL72, HGX, DGX, SuperPOD

These names trip up almost everyone, and they shouldn't, because they are simply different ways of putting the same chips into different boxes at different scales. Here is the clean separation, and then a picture that makes the hierarchy obvious.

NameWhat it actually isScaleWho it is for
MGXA modular reference design, a blueprint for building the rack with cable-free trays. Not a product you buy, a template partners build from.A specificationThe 80-plus OEM and system builders
NVL72The rack-scale system: 72 Rubin GPUs and 36 Vera CPUs, fully liquid-cooled, all-to-all NVLink. The scale-up domain made physical.One rackThe all-in NVIDIA stack, with an Arm Vera host
HGX Rubin NVL8An eight-GPU baseboard for conventional x86 servers. The "keep one foot in the familiar world" option.One server, 8 GPUsEnterprises that still want x86 servers
DGXNVIDIA's own branded, turnkey build of the above, pre-integrated, supported by NVIDIA itself, shipped with the full software stack.One rack and upBuyers who want it finished and warranted by NVIDIA
DGX SuperPODA pre-engineered cluster of many NVL72 racks plus networking, storage and software. As close as it gets to ordering a turnkey AI data center.8-plus racksGigascale and frontier AI labs
Figure 07
The same silicon, nested at six scales
Read it left to right. Each step is just a bundle of the step before it. Once you see this, the whole catalog stops being confusing.
chip1 GPU superchip1 Vera+ 2 Rubin tray2 superchips(MGX) NVL72 rack18 trays72 GPU · 36 CPU= DGX if NVIDIA-built SuperPOD8 racks576 GPU+ network+ storage AI factorymany SuperPODshyperscale x2x2x18x8xN
Mental modelThe whole hierarchy is just nesting: chip into superchip into tray into NVL72 rack into SuperPOD into AI factory. "MGX" is the recipe everyone cooks from; "DGX" is NVIDIA cooking the meal for you and standing behind it; everything else is portion size.
07 / What you actually buy ★

"I want to buy one Vera Rubin." One what, exactly?

This is the question that quietly trips up everyone arriving from the world of consumer hardware, and it is worth slowing down for. When you buy an RTX 4090, one unit means one GPU, a card you slide into a PC. AI-factory hardware refuses to play by that rule. The product is the system, or the rack, almost never the bare chip. Here is the purchasing ladder, from the rung you cannot actually stand on up to hyperscale.

A single Rubin GPU die not sold on its own

Unlike an RTX card, you cannot walk in and buy one Rubin GPU. NVIDIA sells chips to system builders, not to end users. As a rough sense of scale only, the implied value of one GPU inside a rack lands somewhere around fifty to seventy thousand dollars, but you will never see it on its own price tag.

A Vera Rubin Superchip a component, not a finished product

One Vera CPU plus two Rubin GPUs on a board. It is a building block sold to integrators. You do not buy "one superchip" as a thing that arrives in a box; you buy a system that happens to contain them.

1
An HGX Rubin NVL8 server the smallest real system you can buy

A server from Dell, Supermicro, HPE and the like, carrying an eight-GPU HGX baseboard on an x86 host. This is the closest thing to "a box of GPUs," and it is where most enterprises actually start.

2
A Vera Rubin NVL72 rack ★ the default meaning of "one unit"

When someone in the data-center world says "a Vera Rubin," nine times out of ten they mean one NVL72 rack: 72 GPUs and 36 CPUs in a single liquid-cooled cabinet. You can buy a functionally equivalent rack from an OEM building on the MGX blueprint.

3
A DGX Vera Rubin NVL72 the same rack, NVIDIA-branded and turnkey

Or you buy the DGX version straight from NVIDIA: the same hardware, but integrated, supported and warranted by NVIDIA, arriving with the full software stack of DGX OS, Base Command and Mission Control.

4
A DGX SuperPOD a turnkey AI supercomputer

A pre-engineered cluster. A common configuration is eight NVL72 racks: 576 GPUs, 288 CPUs, roughly 600 TB of memory and about 28.8 ExaFLOPS of NVFP4, plus switches, storage, DPUs, cabling and software, ordered and delivered as a single unit.

5
Many SuperPODs, an AI factory hyperscale

Stack SuperPODs and you arrive at the hyperscale AI factory, the level at which Microsoft, Google, xAI, Meta and OpenAI actually operate. This is what Huang means when he says the unit of computing is now the data center.

The direct answer"One Vera Rubin" is ambiguous, so the right reflex is always to ask "which tier?" But the default unit is one NVL72 rack. You generally cannot buy a single GPU die, the smallest practical purchase is an HGX server with eight GPUs, and there is no Rubin desktop card waiting for you. Rubin is rack-scale by design, because the problem it was built to solve does not fit in a tower under a desk.
08 / Example bills of quantities

What is actually inside "one unit"

Because the instinct from PC building is "one RTX 4090 equals one GPU," the most useful thing I can show you is what a "one-unit" purchase really contains at each tier. These are illustrative reference BOQs meant to build intuition, not quotes; exact quantities flex with configuration, and the prices are deliberately rough.

BOQ A, one HGX Rubin NVL8 server the entry "box of GPUs"

#Line itemQtyNotes
1Liquid-cooled server chassis (OEM)1Dell, Supermicro, HPE and similar
2HGX Rubin NVL8 baseboard1carries the eight GPUs and on-board NVLink
3Rubin GPU8eight times 288 GB HBM4 is 2,304 GB of GPU memory
4x86 host CPU2dual-socket host
5System DRAM (DDR5)~2 to 4 TBconfiguration-dependent
6ConnectX-9 SuperNIC8 to 9scale-out, roughly one per GPU
7BlueField-4 DPU1 to 2storage and infrastructure offload
8NVMe storagemultiplelocal scratch and KV-cache tier
9Liquid-cooling manifold and PSUs1 setdirect-to-chip cooling

Use case: enterprise inference and mid-size training, mixed workloads, x86 compatibility. Indicative cost: several hundred thousand dollars.2

BOQ B, one Vera Rubin NVL72 rack the standard rack-scale unit

#Line itemQtyNotes
1Compute trays18each tray is 2 Vera CPU plus 4 Rubin GPU, that is 2 superchips
2Rubin GPU (total)7272 times 288 GB is 20.7 TB HBM4, about 1,580 TB/s aggregate
3Vera CPU (total)363,168 Olympus cores, 54 TB LPDDR5X
4NVLink 6 switch trays9provides 260 TB/s all-to-all, the scale-up fabric
5NVLink spine and backplane1cable-free
6ConnectX-9 SuperNICup to 72scale-out fabric, configuration-dependent
7BlueField-4 DPU~18storage, security, KV-cache
8Power shelves and busbar1 setroughly 150 to 190 kW class3
9CDU and liquid-cooling loop1 setfanless trays, about twice Blackwell's flow
10Rack enclosure1around 1.5 tonnes fully populated

Use case: the "one unit" of frontier training and inference. Indicative cost: roughly four to five million dollars per rack, for reference a GB200 NVL72 was widely reported around three million.2

BOQ C, one DGX SuperPOD (Vera Rubin) turnkey AI supercomputer

#Line itemQtyNotes
1DGX Vera Rubin NVL72 racks8= 576 GPUs, 288 CPUs, ~600 TB memory, 28.8 EF NVFP4
2Quantum-X800 InfiniBand or Spectrum-X Ethernet switchessetthe scale-out fabric between racks
3Spectrum-X co-packaged-optics switchessetsilicon-photonics option
4Storage nodes (context and KV-cache tier)multiplehigh-speed model and data storage
5Management and head nodesmultipleorchestration
6Software stackincludedBase Command, Mission Control, AI Enterprise, Run:ai
7Integration, cabling, CDUs, supportincludedturnkey delivery
8Optional Groq 3 LPX inference racksadd-on256 LPUs each, sit beside the racks via Spectrum-X

Use case: gigascale training of frontier models, ordered as a single AI supercomputer. Indicative cost: tens of millions of dollars.2

1 Early HBM4 units may ship below the 22 TB/s target before the supply chain ramps.   2 Prices are rough, non-official, order-of-magnitude figures for intuition only; real pricing swings enormously with configuration and contract.   3 Rubin pushes rack power density past Blackwell; exact figures are not yet final.
09 / Performance, with the marketing removed

GB200 vs GB300 vs Rubin, honestly

NVIDIA's slides love a big multiplier, so this section sticks to numbers you can defend and flags the ones that mix conventions. Two charts first, because two of the generational jumps are clean and unambiguous, and then the full tables.

Figure 08
Memory bandwidth per GPU, the wall everyone hits
This is the number that matters most in the inference era, and it is the one Rubin moves the furthest. HBM4 is the headline upgrade.
H100Hopper
3.35 TB/s
B200 / B300Blackwell
8 TB/s
Rubin R2002026
22 TB/s
Figure 09
Dense FP4 compute per GPU
Apples to apples, all dense figures, no sparsity tricks. The generational climb from Blackwell to Blackwell Ultra to Rubin is real and steep.
B2009 PF
9 PF
B30015 PF
15 PF
Rubin R20035 PF
35 PF dense

Per-GPU comparison

// Blackwell B200, then Blackwell Ultra B300, then Rubin R200. Dense figures unless noted.
SpecB200 (in GB200)B300 (in GB300)Rubin R200
ProcessTSMC 4NPTSMC 4NPTSMC N3P (3nm)
Transistors208 B208 B336 B
Die structure2 dies2 dies2 compute + 2 I/O
Memory~180 to 192 GB HBM3e288 GB HBM3e288 GB HBM4
Bandwidth8 TB/s8 TB/s22 TB/s
Dense FP49 PF15 PF35 PF (50 PF inf with TE)
Dense FP84.5 PF5 PF17.5 PF
NVLink per GPU1.8 TB/s (v5)1.8 TB/s (v5)3.6 TB/s (v6)
TDP~1,000 W~1,400 W~1,800 to 2,300 W

Per-rack comparison

// rack-scale systems, with the dense-versus-sparse caveat noted below
SpecGB200 NVL72GB300 NVL72Vera Rubin NVL72VR NVL144 CPX
GPUs72 B20072 B30072 Rubin144 Rubin + 144 CPX
CPUs36 Grace36 Grace36 Vera36 Vera
FP4 inference~1.4 EF~1.1 EF dense3.6 EF8 EF
HBM capacity~13.4 TB HBM3e20.7 TB HBM3e20.7 TB HBM4100 TB fast mem
HBM bandwidth~576 TB/s~576 TB/s1,580 TB/s1.7 PB/s
NVLink per rack~130 TB/s~130 TB/s260 TB/s260 TB/s
vs prior genbaseline1.5x GB200~3.3x GB300~7.5x GB300

The two economic claims NVIDIA actually leads with

For all the spec tables, NVIDIA knows the buyers care about two numbers above all, because those are the ones that show up in a profit-and-loss statement. Both are big, and both deserve a raised eyebrow until independent benchmarks land.

Training
A quarter of the GPUs

NVIDIA says Rubin can train large mixture-of-expert models with roughly one-quarter the number of GPUs that Blackwell needed. If it holds, that is a direct cut to both the capital bill and the power bill.

Inference
A tenth of the cost per token

For deep-reasoning agentic workloads, Rubin targets about one-tenth the cost per million tokens versus Blackwell. Token economics, not raw FLOPS, is the real battlefield of the inference era, and this is the number aimed at it.

On dense versus sparse: GB200's ~1.4 EF FP4 is a sparse figure, while GB300's ~1.1 EF is dense, which is why a naive reading makes the newer rack look slower. Rubin's 3.6 EF is the Transformer-Engine-boosted inference number, and NVIDIA's multipliers (3.3x, 7.5x) compare that against GB300. Treat all cross-generation FP4 comparisons as directional rather than exact, and trust the bandwidth and dense-FP4 charts above more than any headline multiplier.
10 / Terminology

Glossary

Every key term in one place. The dotted terms scattered through the guide also reveal a quick definition on hover.

CUDA (platform)
NVIDIA's parallel-computing software model and its CUDA-X library stack. The dominant AI software ecosystem and the company's deepest moat. Rubin runs CUDA 13.
CUDA core
A general-purpose math unit in each SM, handling FP32, FP64 and integers. The basis for HPC, simulation and all the logic that surrounds the AI.
Tensor Core
A specialized core that performs matrix-multiply-accumulate at low precision. The source of the big PetaFLOPS numbers and the engine of roughly ninety percent of deep learning.
SM (Streaming Multiprocessor)
The basic building block of an NVIDIA GPU, bundling CUDA cores, Tensor Cores, schedulers and caches. The Rubin GPU has 224.
Transformer Engine
A hardware-and-software layer that automatically manages precision per layer to reach FP4 speed at close to FP8 accuracy. Rubin uses the third generation, with adaptive compression.
NVFP4
NVIDIA's 4-bit floating-point format with micro-block scaling. Best for low-cost, high-throughput inference and the training of the largest models.
FP8 / NVFP8
8-bit floating point. The safe default for training and for inference where quality cannot slip, thanks to more range than FP4.
FP16 / BF16 / FP32 / FP64
Higher-precision formats for training stability, mixed precision, and scientific or HPC work respectively.
HBM4
High Bandwidth Memory, fourth generation, stacked DRAM beside the GPU die. 288 GB at 22 TB/s on Rubin, doubling the interface width of HBM3e.
GDDR7
Cheaper, cooler graphics memory with no advanced packaging. Used on Rubin CPX (128 GB) for cost-efficient long-context work.
SRAM
Ultra-fast on-die memory. The LPU's defining bet: a huge SRAM pool on the chip itself instead of off-chip HBM, for extreme bandwidth at low latency.
NVLink / NVLink Switch
The in-rack, all-to-all GPU interconnect, the scale-up fabric. NVLink 6 runs at 3.6 TB/s per GPU and 260 TB/s per rack.
NVLink-C2C
The chip-to-chip coherent link between CPU and GPU inside a superchip, 1.8 TB/s on Vera Rubin.
RDMA
Remote Direct Memory Access, reading or writing another machine's memory without involving its CPU. Core to SuperNIC scale-out.
SuperNIC
A smart network card (ConnectX-9) tuned for GPU-to-GPU traffic between racks, the scale-out fabric, at 1.6 Tb/s per GPU.
DPU
Data Processing Unit (BlueField-4), offloading storage, networking, security and KV-cache from the CPU and GPU.
CPO / silicon photonics
Co-Packaged Optics, light generated right next to the switch chip. 5x power efficiency and 10x resiliency versus pluggable optics.
KV-cache
Key-Value cache, the model's working memory of the conversation so far. Central to long-context inference and able to live on a BlueField-4 SSD.
MoE
Mixture-of-Experts, a model that routes each token to a subset of expert sub-networks. Rubin and the LPU accelerate expert execution.
Prefill / decode
The two phases of inference. Prefill reads the prompt and is compute-heavy (CPX); decode generates tokens and is memory-bound (GPU and LPU).
Scale-up / scale-out
Scale-up links GPUs within a rack (NVLink). Scale-out links racks across a data center (Ethernet or InfiniBand).
Chiplet / CoWoS-L
A chiplet is a package built from several dies. CoWoS-L is TSMC's advanced 2.5D packaging that stitches those dies and the HBM onto one carrier.
MGX
NVIDIA's modular rack reference design, the blueprint. Third-generation MGX underpins Vera Rubin NVL72, with cable-free trays and 80-plus partners.
NVL72
The rack-scale system, 72 Rubin GPUs and 36 Vera CPUs, liquid-cooled with all-to-all NVLink. The default "one unit" of rack-scale AI.
HGX
An eight-GPU baseboard for x86 servers (HGX Rubin NVL8). The traditional-server entry point.
DGX
NVIDIA's own branded, turnkey, fully supported system line (DGX Vera Rubin NVL72, DGX SuperPOD).
SuperPOD
A pre-engineered cluster of many NVL72 racks plus networking, storage and software. Eight racks make 576 GPUs and about 28.8 ExaFLOPS.
LPU
Language Processing Unit (Groq 3), an SRAM-based inference accelerator for ultra-fast token decode, from NVIDIA's twenty-billion-dollar Groq deal.
Superchip
One Vera CPU plus two Rubin GPUs on one board. The building block, with thirty-six per NVL72 rack.
AFD (Attention-FFN Disaggregation)
Splitting inference so the Rubin GPUs handle attention and prefill while LPUs handle feed-forward and MoE, the Rubin-plus-LPU teamwork model.