NVIDIA Vera Rubin NVL72: The Field Guide

01 / The house that Jensen built

A name for measuring invisible things

NVIDIA names its architectures after scientists, and the choices are rarely accidental. Pascal, Volta, Turing, Ampere, Hopper, Blackwell. This generation belongs to Vera Rubin, the American astronomer whose painstaking measurements of how galaxies rotate produced the most convincing evidence we have for dark matter, the unseen mass that holds the cosmos together. There is a quiet poetry in attaching her name to a machine built for inference, the act of drawing conclusions from what a model cannot directly see.

And then, as has become tradition, NVIDIA split the name across the two halves of the system, which tells you the entire design philosophy before you read a single specification.

The GPU

Rubin

The surname goes to the GPU, the engine that does the punishing AI mathematics. When people say "a Rubin," this is the silicon they picture.

The CPU

Vera

The first name goes to a brand new CPU, the data-traffic controller and coordinator. It is the successor to "Grace," and it marks NVIDIA's return to designing its own processor cores from scratch.

So "Vera Rubin" is not two products bolted together at the last minute. It is a CPU and a GPU drawn up on the same whiteboard, intended from day one to behave as a single organism.

From a Denny's booth to the most valuable company on Earth

To understand why a new NVIDIA rack now moves global stock markets, you have to understand how improbable the company's position is. NVIDIA was founded in 1993 in a roadside Denny's in San Jose by Jensen Huang and two engineers, with a plan to accelerate computer graphics. It nearly died in its first years when an early chip flopped, and survived on a single rescue product. For most of its life it was, to the wider world, the company that made your video games run smoothly.

The pivot that created today's giant was a bet almost nobody else was willing to make. Around 2006, NVIDIA began pouring money and engineers into CUDA, a way to use graphics chips for general-purpose computing. For years this looked like an expensive distraction; analysts asked why a gaming company was funding a research-computing platform with no obvious customer. Huang kept funding it anyway. When deep learning arrived a few years later and researchers discovered that GPUs were almost magically suited to training neural networks, NVIDIA was the only company with a decade of software already in the ground. That patience is the moat. Rivals can copy the silicon; they cannot copy twenty years of libraries and the millions of developers who learned on them.

The rest reads like a coronation. The 2020 acquisition of Mellanox turned NVIDIA from a chip vendor into a networking company. The arrival of ChatGPT turned its data-center GPUs into the most fought-over hardware on the planet. By the mid-2020s NVIDIA had passed Apple and Microsoft to become the most valuable public company in the world, selling the picks and shovels for the entire AI gold rush with a market share in AI accelerators that competitors still measure in single-digit fractions of what is left over.

Huang's recurring line at these launches, delivered with a showman's grin, is "the more you buy, the more you save." It is a joke. It is also, for the hyperscalers racing each other, a description of how they actually think.

The texture matters too, because Huang has made it part of the product. The black leather jacket he has worn for decades is now as recognizable as the logo. He runs the company famously flat, with dozens of executives reporting to him directly. In 2016 he personally carried the first DGX-1 AI supercomputer to a young lab called OpenAI and signed its chassis with a dedication to the future of computing and humanity, a piece of theater that has aged into legend. The annual GTC keynote, where Blackwell was held aloft in 2024 and the Rubin roadmap was laid out in 2025, is less a product briefing than a stadium event. None of this is incidental. It is how a components company convinced the world it is building the future, one rack at a time.

The cadence: a new architecture every year

Here is the lineage that explains why Rubin matters. NVIDIA has compressed what used to be a multi-year chip cycle into an annual drumbeat, and each beat has had a distinct personality.

Year	Architecture	Flagship	What it really meant
2022 to 2023	Hopper	H100 / H200	The chip the whole world fought over in the ChatGPT gold rush.
2024	Blackwell	B200 / GB200 NVL72	NVIDIA stops selling a chip and starts selling a whole rack as one computer.
2025	Blackwell Ultra	GB300 NVL72	Retuned for "reasoning" AI that thinks before it answers.
2026	Rubin	VR200 / Vera Rubin NVL72	Built to run AI cheaply, for billions of users and AI agents. (this guide)
2027	Rubin Ultra	NVL576	Doubles the rack to 144 packages (576 GPU dies).
~2028	Feynman	on the map	Named, scheduled, and still under wraps.

What each leap actually meant

Strip away the codenames and there is a single arc running through all of it: the industry moved from the problem of building giant models to the far larger problem of running them affordably for everyone, forever. Read top to bottom, this is that story.

2022 to 2023 · Hopper

H100, the gold-rush chip Training era

When ChatGPT detonated, every company on Earth suddenly needed to train AI, and the H100 was the one chip that could really do it. Demand went vertical, supply evaporated, and the phrase "GPU shortage" entered ordinary conversation. The whole modern boom was built on these. NVIDIA's valuation crossed a trillion dollars on the strength of them.

2024 · Blackwell

GB200, when they stopped selling chips and started selling computers Training era

Frontier models grew too large to live on a single chip, so Huang changed what NVIDIA sells. Instead of a card you slot into a server, Blackwell's flagship is an entire rack: seventy-two GPUs wired so tightly they behave like one colossal brain. This is the moment "AI factory," his favorite phrase, stopped being a metaphor and became a product category.

2025 · Blackwell Ultra

GB300, tuned for AI that "thinks" The shift begins

A new species of model appeared, the kind that reasons step by step before it answers, the source of that "thinking" pause you now see in chatbots. That style burns enormous compute at answer time, not just at training time. GB300 was a mid-cycle tune-up aimed squarely at this reasoning workload, a hint of where the center of gravity was about to move.

2026 · Rubin ← you are here

VR200, making AI cheap enough for everyone Inference era

The hard problem flipped completely. The challenge is no longer training a model once. It is serving that model to billions of people and to armies of autonomous agents, affordably, every second of every day. Rubin is engineered for exactly this. NVIDIA is targeting roughly a tenth of the cost per answer compared with Blackwell, and it has added the new Vera CPU and the Groq LPU specifically to keep agentic workloads fast and cheap.

2027 · Rubin Ultra

NVL576, double the rack Roadmap

The same idea, scaled. The rack roughly doubles to 144 packages (576 GPU dies), readying NVIDIA for the next jump in both model size and global user volume.

~2028 · Feynman

The next chapter Roadmap

Named for the physicist Richard Feynman. It exists on Huang's roadmap slide, which by now is enough to make suppliers plan around it, but the details are still locked away.

The one takeawayHopper and Blackwell answered "how do we build these models?" Rubin answers a different and much bigger question: "how do we run them for everyone without going broke?" That pivot, from training to cheap inference and AI agents, is the entire reason Rubin exists.

Naming gotchaFrom Rubin onward, NVIDIA counts GPU dies, not GPU packages, as "GPUs." Each Rubin package holds two compute dies. That is why the same rack appears as "NVL72" (72 packages) in the final product name and as "NVL144" (144 dies) in earlier slides. Same cabinet, two ways of counting, and a small masterclass in how a bigger number gets onto a keynote slide.

02 / The big picture

It is not a server. It is a building that thinks.

The single hardest adjustment for anyone coming from PCs is this: a consumer GPU lives inside one machine, but NVIDIA's unit of measure is now an entire rack, a refrigerator-sized cabinet that behaves as one computer assembled from roughly 1,300 chips. Once you accept that the rack is the product, two interconnect concepts unlock everything else, and they exist because moving data is now harder than computing on it.

Inside the rack

Scale-up

Wiring every GPU within a single rack so they act like one enormous GPU. The hero here is NVLink, a private highway that runs at close to memory speed. This is the part nobody else can match at scale.

Between racks

Scale-out

Linking many racks into a full data center. The heroes are the networking chips: ConnectX, Spectrum-X, Quantum-X InfiniBand. Think of this as the public motorway between buildings, fast but a tier below NVLink.

Figure 01

Two highways, two completely different jobs

Scale-up keeps a rack's GPUs glued together at memory speed. Scale-out stitches racks into a data center. Confusing the two is the most common mistake newcomers make.

It helps to see the physical object. A Vera Rubin NVL72 is not abstract. It is a cabinet you could touch, if you were allowed near the cooling loop, weighing on the order of one and a half metric tons and drinking power at a rate that would trip the breakers of a small office block. Here is what is inside it, top to bottom.

Figure 02

Anatomy of one NVL72 rack

Eighteen compute trays do the thinking, nine switch trays in the middle keep them glued together, and the whole thing is fed by liquid, not air.

That is the punchline of the whole platform in one image. The thinking happens in the green compute trays; the brighter switch trays in the middle are the connective tissue that lets all seventy-two GPUs pretend to be one; and the orange and teal bands at top and bottom are the unglamorous truth of modern AI, which is that the real engineering challenge is now feeding the thing power and carrying away its heat.

03 / CUDA, cores and number formats

"Does it have CUDA?" is almost a funny question

It is a fair thing to ask, and the honest answer is that asking whether a modern NVIDIA GPU "has CUDA" is a little like asking whether a Catholic cathedral has religion. CUDA is not a feature of the hardware. It is the gravity well that the entire AI industry now orbits, and it exists in two distinct forms that Rubin carries at once.

The software

CUDA, the platform

NVIDIA's programming model and its two-decade stack of libraries, CUDA-X. This is the real moat, the thing AMD and Intel keep failing to dislodge. Rubin runs CUDA 13, and the new Groq chips were deliberately wired in so that no CUDA code has to change.

The hardware

CUDA cores

The small general-purpose math units inside every NVIDIA GPU. They handle ordinary parallel arithmetic in FP32 and FP64 and integers, the foundation for scientific computing and all the logic that surrounds the AI itself.

Figure 03

Why the moat is the software, not the silicon

A competitor can fabricate a fast chip. What they cannot conjure overnight is twenty years of libraries and the millions of engineers who learned on them. Every layer above the silicon is locked to NVIDIA.

Two kinds of cores: the generalists and the specialists

Inside each GPU, compute is grouped into SMs, or Streaming Multiprocessors, and the Rubin GPU has 224 of them. Picture an SM as a workshop containing two very different teams. There is no single "AI core" doing everything; the magic is in how these two teams divide the labor.

Core type	What it does	What it is for	Think of it as
CUDA core	General-purpose parallel math (FP32, FP64, integers)	HPC, physics, simulation, and the connective logic around the AI	A bench of versatile generalists
Tensor Core	Matrix-multiply-accumulate at low precision (FP4, FP8, FP16)	The roughly ninety percent of deep learning that is matrix math. This is where the headline PetaFLOPS come from.	Specialist robots that do one thing at terrifying speed

Rubin's 224 SMs carry the latest generation of Tensor Cores, which NVIDIA documents as fifth-generation and tunes heavily for NVFP4 and FP8, alongside the general-purpose CUDA cores. For data-center parts NVIDIA leads its marketing with Tensor throughput rather than a raw CUDA-core count, and that choice is itself the story: for AI, the matrix engines are what matter, and the specialists have quietly become the main event.

The Transformer Engine, the part that feels like cheating

Sitting on top of the Tensor Cores is a layer of hardware and software called the Transformer Engine, and it is one of NVIDIA's cleverest tricks. As a model runs, it watches the numbers flowing through each layer and decides, on the fly, which values can be safely crushed down to ultra-low precision and which need a little more room. The result is the holy grail of inference economics: something close to FP4 speed with close to FP8 accuracy. Rubin's third-generation engine adds an adaptive, two-level scaling scheme and retires the older "structured sparsity" trick from previous generations. This is how the same chip can quote 35 dense PetaFLOPS of NVFP4 yet reach an effective 50 PetaFLOPS for real inference.

Number formats, and the only rule you need

Precision simply means how many bits you spend to represent each number. Fewer bits run faster and use less memory but carry less accuracy, so the entire discipline of modern AI hardware is a hunt for the lowest precision you can get away with without the model falling apart.

// lower precision buys speed and memory; higher precision buys accuracy
Format	Bits	Primary use	Why you would reach for it
NVFP4	4	Inference at massive scale, and increasingly the training of the very largest models	The cheapest possible cost per token. The Transformer Engine keeps the accuracy honest.
FP8 / NVFP8	8	Training, and inference where quality cannot slip	More numerical range than FP4, the safe default when stability matters.
FP16 / BF16	16	Sensitive training layers, gradients, mixed precision	Stability where FP8 would drift. BF16 trades precision for range.
TF32	~19	A near drop-in accelerator for FP32 work	Faster than FP32 with almost no code change.
FP32 / FP64	32 / 64	Scientific computing, simulation, HPC	Full precision where being wrong is not an option. This runs on the CUDA cores.

The only rule you needFP4 runs the model cheaply. FP8 trains the model. FP16 and BF16 keep training from blowing up. FP32 and FP64 do science. Rubin is bent, hard, toward the first two, because that is where the money in the inference era actually is.

04 / The seven chips, one by one

Seven new chips, one supercomputer

"Seven new chips, one AI supercomputer" is how Huang framed it on stage, and the framing is doing real work. The competition tends to ship a faster GPU. NVIDIA ships an entire fleet of co-designed silicon, each piece solving a different bottleneck, and the bottlenecks are no longer mostly about compute. Here is the cast, in order.

Rubin GPU

// the compute engine

This is the star, the silicon that does the overwhelming share of the AI mathematics, and physically it represents a clean break from the past. Old GPUs were a single monolithic slab. Rubin is a chiplet design, an assembly of separate dies bonded onto one carrier, because chips have hit the physical ceiling of how large a single piece of silicon can be manufactured. NVIDIA's answer is to stop fighting that ceiling and start sewing chips together.

Figure 04

Inside a single Rubin package

Two reticle-sized compute dies and two I/O dies, ringed by eight stacks of HBM4 memory, all stitched onto one TSMC CoWoS-L carrier. "Reticle-sized" means as physically large as a lithography machine can print in a single exposure.

TSMC N3P 3nm class 336B transistors 224 SMs 288 GB HBM4 22 TB/s bandwidth¹ 50 PF NVFP4 inference ~1.8 to 2.3 kW TDP

The two upgrades that actually matter

The first is memory, and it is the more important of the two. As models swell, the limiting factor stopped being how fast a chip can compute and became how fast you can feed it. HBM4 doubles the interface width over the previous HBM3e, delivering 288 GB across eight stacks at up to 22 TB/s, roughly two and three-quarter times Blackwell's 8 TB/s. The second is the third-generation Transformer Engine described above. Together they target the exact wall that the inference era runs into.

The uncomfortable truth: heatAt eighteen hundred to twenty-three hundred watts per GPU, there is no such thing as an air-cooled Rubin. Every configuration is liquid-cooled, full stop. That is nearly double Blackwell's thousand watts, and it is why deploying Rubin is as much a plumbing project as a computing one. The chip is so capable that the hard part has moved off the die entirely and into the data center's cooling loop.

Vera CPU

// the brand new brain

This is the genuinely new piece, and the one most people miss. Blackwell's companion CPU was called Grace, and it leaned on off-the-shelf Arm Neoverse cores. Vera is its successor, and the headline is that NVIDIA designed the cores itself, a custom design code-named "Olympus." After years of buying its CPU cores off the shelf, NVIDIA has gone back to drawing its own, which is a statement of intent about how tightly it wants the CPU and GPU to fit.

88 custom Olympus cores 176 threads 227B transistors 1.5 TB LPDDR5X 1.8 TB/s NVLink-C2C to GPU first CPU with native FP8

The job here is the work a GPU is genuinely bad at. GPUs are gloriously fast at doing the same operation across thousands of data points, but they stumble on branchy, sequential, decision-heavy logic. Vera handles exactly that: staging data, deciding which GPU should get which piece at which moment, and orchestrating the long, multi-step loops of agentic reasoning. NVIDIA keeps using the word "deterministic," meaning predictable, jitter-free timing, and for AI agents that fire thousands of small dependent steps, that predictability is worth more than raw speed.

A unit of vocabularyOne Vera CPU plus two Rubin GPUs on a single board is called a Vera Rubin Superchip. An NVL72 rack contains thirty-six of them, which is where its 72 GPUs and 36 CPUs come from. The CPU and GPU talk over NVLink-C2C at 1.8 TB/s, twice the bandwidth Grace had, so the line between "CPU memory" and "GPU memory" gets blurry on purpose.

NVLink 6 Switch

// the in-rack highway, scale-up

A frontier model does not fit on one GPU; it is sliced across dozens, and those slices have to talk to each other constantly and instantly, or the whole rack stalls waiting on itself. NVLink is the private highway that carries that traffic, and the NVLink Switch is the interchange that lets every GPU reach every other one at once. This is the single piece of the puzzle that competitors have struggled most to replicate at scale.

3.6 TB/s per GPU, all-to-all 260 TB/s per rack roughly 2x Blackwell

The payoff is the illusion at the heart of the NVL72: with NVLink 6, all seventy-two GPUs can behave as one enormous GPU with a shared pool of memory. Try to move model shards across ordinary networking instead and the latency would strangle performance. NVLink keeps that traffic running at something close to memory speed, which is the difference between a rack and a mere pile of servers.

ConnectX-9 SuperNIC

// the between-racks highway, scale-out

One rack is never enough at the frontier, so you link hundreds of them into a single training cluster, and the SuperNIC is what does the linking. This is the lineage that traces straight back to the Mellanox acquisition; networking is no longer an afterthought bolted onto the side of an AI system, it is a first-class citizen of the design.

1.6 Tb/s per GPU programmable RDMA GPU-direct

Its key trick is programmable RDMA, which lets a GPU reach directly into another server's memory without bothering either machine's CPU, at very low latency. The clean mental split is this: NVLink makes things fast inside the rack, and ConnectX-9 makes things fast between racks. Both have to be excellent or the cluster runs at the speed of its weakest link.

BlueField-4 DPU

// the infrastructure workhorse

A DPU, or Data Processing Unit, is the data center's invisible custodian. Every AI cluster has a mountain of unglamorous chores: shuffling storage, managing the network, enforcing security boundaries between tenants, encrypting traffic. Run those on the expensive GPUs and CPUs and you are burning gold to do janitorial work. BlueField-4 takes all of it off their plate.

storagenetworkingsecurity and isolationelastic scalingintegrated SSD for KV-cache

The new wrinkle for Rubin is that BlueField-4 can park the KV-cache on an integrated SSD. As models stretch to million-token contexts, they need a fast tier to hold their working memory, and putting it close to the network fabric rather than hogging precious HBM is one of those quiet architectural decisions that pays off enormously at scale.

Spectrum-X Ethernet Co-Packaged Optics

// networking, rebuilt around light

This is the most science-fiction piece in the set. Traditional network switches turn electricity into light using little pluggable optical modules, and at data-center scale those modules are a genuine plague: they burn power, run hot, and fail often enough to be a real reliability problem. Co-packaged optics takes the radical step of building the silicon photonics right next to the switch chip, generating the light at the source rather than at the end of a copper run.

5x power efficiency 10x network resiliency up to 5x more uptime

At the scale of a gigawatt AI factory, the power drawn by networking and the failures of optical links are real line items on the budget and the maintenance schedule. Co-packaged optics attacks both at once, moving data at the speed of light while spending dramatically less energy to do it. It is the kind of unglamorous infrastructure win that does not make headlines but quietly decides whether a build is economical.

Groq 3 LPU

// the wildcard, and a twenty-billion-dollar story

If one card on NVIDIA's own slide makes people do a double-take, it is this one, and the confusion is completely understandable, because Groq spent years as a rival. The explanation is one of the more dramatic corporate moves of the decade. On the 24th of December 2025, NVIDIA signed a deal with Groq worth roughly twenty billion dollars. Structurally it is a non-exclusive license combined with a team transfer, an "acqui-hire," which is precisely the structure that let it close in under four months without triggering a full merger review. Groq's founder Jonathan Ross, who originally designed Google's TPU, and its president Sunny Madra moved over to NVIDIA.

Huang reached for a familiar comparison on stage. Mellanox, he reminded everyone, turned NVIDIA from a chip vendor into a networking company. Groq, in the same telling, turns it from a training-first GPU vendor into a full-stack platform optimized for inference. It is a tidy narrative, and like most things Huang says from a stage, it doubles as a roadmap.

The asterisk worth keepingThis deal is contested, and it would be dishonest to present it as settled. A United States Senate inquiry opened in March 2026, led by Senators Warren and Blumenthal, alongside FTC interest, is examining whether the "reverse acqui-hire" structure was engineered specifically to dodge antitrust scrutiny. As of the middle of 2026 the deal stands and the chips are real, but the investigation is open, and the question of whether a company this dominant should be allowed to absorb its challengers this way is very much live.

What an LPU actually is, and why it had to exist

An LPU, or Language Processing Unit, closes the one gap a GPU cannot close on its own. The problem is specific and brutal. At very high token-generation speeds, north of a thousand tokens per second, even NVLink-connected GPU systems choke, not on compute but on memory bandwidth, because the data has to keep making the round trip to the HBM stacks at the edge of the package. The LPU's solution is radical to the point of seeming reckless: throw out the expensive HBM entirely and pour a vast pool of SRAM directly onto the die, right alongside the compute, so the data barely has to travel at all.

Figure 05

Why the LPU moves memory onto the chip

On a GPU, data sprints to HBM at the edge of the package and back. On an LPU, memory and compute are interleaved on the same die, so the round trip nearly vanishes. That is the whole idea.

per LPU: ~500 MB SRAM per LPU: ~150 TB/s LPX rack: 256 LPUs LPX rack: 128 GB SRAM LPX rack: 40 PB/s Samsung 4nm

How the GPU and LPU split the work

The two chips do not compete inside the system; they divide the labor of generating an answer. The Rubin GPU handles the prefill, the KV-cache, and the heavy attention math, while the LPU takes the latency-sensitive feed-forward networks, the mixture-of-expert execution, and the pointwise operations. NVIDIA's claim for the pairing is striking: an LPX rack working alongside a Vera Rubin NVL72 delivers thirty-five times the inference throughput per megawatt of Blackwell. The LPX racks physically sit beside the Rubin racks, connected over Spectrum-X, and, in the line that is clearly aimed at reassuring nervous customers, none of it requires a single change to CUDA code.

05 / The eighth chip

Rubin CPX, and the art of not wasting your best machine

Beyond the seven, NVIDIA lists one more part as a GPU in its own right: Rubin CPX, the Context Phase aXcelerator. To understand why it exists, you have to know that answering a prompt is really two jobs, with completely different appetites, and using one expensive chip for both is like running a fine-dining kitchen where the same chef chops the onions and plates the dessert.

Phase 1

Prefill, the reading

Digesting the prompt, which today might be an entire codebase or a feature-length video, millions of tokens at once. This phase is compute-hungry but not especially latency-sensitive. The right tool is Rubin CPX.

Phase 2

Decode, the writing

Producing the answer one token at a time. This phase is bound by memory and latency, not raw compute. The right tools are the Rubin GPU and, at extreme speed, the LPU.

Figure 06

One answer, three specialists, in sequence

This is "disaggregated inference." Instead of one chip doing everything, the request flows down an assembly line where each stage runs on the silicon built for it.

CPX delivers thirty PetaFLOPS of NVFP4, but its defining choice is to swap the expensive HBM4 for 128 GB of much cheaper GDDR7, the kind of memory you would find on a gaming card. For the prefill job, ingesting long sequences cheaply and without melting, that trade is exactly right, and CPX pairs it with hardware video decode and roughly three times the attention performance of GB300.

The idea in one sentenceRubin is not a single do-everything chip. It is a toolbox: the Rubin GPU for general work, CPX for cheap long-context reading, the LPU for blistering token output. Pull these together and you get the Vera Rubin NVL144 CPX rack, which combines 144 Rubin GPUs, 144 CPX, and 36 Vera CPUs to reach roughly 8 ExaFLOPS of NVFP4, 100 TB of fast memory, and 1.7 PB/s of bandwidth. NVIDIA frames that as about seven and a half times a GB300 rack, which is the kind of number that gets a keynote audience to its feet.

06 / Form factors

Decoding the alphabet soup: MGX, NVL72, HGX, DGX, SuperPOD

These names trip up almost everyone, and they shouldn't, because they are simply different ways of putting the same chips into different boxes at different scales. Here is the clean separation, and then a picture that makes the hierarchy obvious.

Name	What it actually is	Scale	Who it is for
MGX	A modular reference design, a blueprint for building the rack with cable-free trays. Not a product you buy, a template partners build from.	A specification	The 80-plus OEM and system builders
NVL72	The rack-scale system: 72 Rubin GPUs and 36 Vera CPUs, fully liquid-cooled, all-to-all NVLink. The scale-up domain made physical.	One rack	The all-in NVIDIA stack, with an Arm Vera host
HGX Rubin NVL8	An eight-GPU baseboard for conventional x86 servers. The "keep one foot in the familiar world" option.	One server, 8 GPUs	Enterprises that still want x86 servers
DGX	NVIDIA's own branded, turnkey build of the above, pre-integrated, supported by NVIDIA itself, shipped with the full software stack.	One rack and up	Buyers who want it finished and warranted by NVIDIA
DGX SuperPOD	A pre-engineered cluster of many NVL72 racks plus networking, storage and software. As close as it gets to ordering a turnkey AI data center.	8-plus racks	Gigascale and frontier AI labs

Figure 07

The same silicon, nested at six scales

Read it left to right. Each step is just a bundle of the step before it. Once you see this, the whole catalog stops being confusing.

Mental modelThe whole hierarchy is just nesting: chip into superchip into tray into NVL72 rack into SuperPOD into AI factory. "MGX" is the recipe everyone cooks from; "DGX" is NVIDIA cooking the meal for you and standing behind it; everything else is portion size.

07 / What you actually buy ★

"I want to buy one Vera Rubin." One what, exactly?

This is the question that quietly trips up everyone arriving from the world of consumer hardware, and it is worth slowing down for. When you buy an RTX 4090, one unit means one GPU, a card you slide into a PC. AI-factory hardware refuses to play by that rule. The product is the system, or the rack, almost never the bare chip. Here is the purchasing ladder, from the rung you cannot actually stand on up to hyperscale.

A single Rubin GPU die not sold on its own

Unlike an RTX card, you cannot walk in and buy one Rubin GPU. NVIDIA sells chips to system builders, not to end users. As a rough sense of scale only, the implied value of one GPU inside a rack lands somewhere around fifty to seventy thousand dollars, but you will never see it on its own price tag.

◐

A Vera Rubin Superchip a component, not a finished product

One Vera CPU plus two Rubin GPUs on a board. It is a building block sold to integrators. You do not buy "one superchip" as a thing that arrives in a box; you buy a system that happens to contain them.

An HGX Rubin NVL8 server the smallest real system you can buy

A server from Dell, Supermicro, HPE and the like, carrying an eight-GPU HGX baseboard on an x86 host. This is the closest thing to "a box of GPUs," and it is where most enterprises actually start.

A Vera Rubin NVL72 rack ★ the default meaning of "one unit"

When someone in the data-center world says "a Vera Rubin," nine times out of ten they mean one NVL72 rack: 72 GPUs and 36 CPUs in a single liquid-cooled cabinet. You can buy a functionally equivalent rack from an OEM building on the MGX blueprint.

A DGX Vera Rubin NVL72 the same rack, NVIDIA-branded and turnkey

Or you buy the DGX version straight from NVIDIA: the same hardware, but integrated, supported and warranted by NVIDIA, arriving with the full software stack of DGX OS, Base Command and Mission Control.

A DGX SuperPOD a turnkey AI supercomputer

A pre-engineered cluster. A common configuration is eight NVL72 racks: 576 GPUs, 288 CPUs, roughly 600 TB of memory and about 28.8 ExaFLOPS of NVFP4, plus switches, storage, DPUs, cabling and software, ordered and delivered as a single unit.

Many SuperPODs, an AI factory hyperscale

Stack SuperPODs and you arrive at the hyperscale AI factory, the level at which Microsoft, Google, xAI, Meta and OpenAI actually operate. This is what Huang means when he says the unit of computing is now the data center.

The direct answer"One Vera Rubin" is ambiguous, so the right reflex is always to ask "which tier?" But the default unit is one NVL72 rack. You generally cannot buy a single GPU die, the smallest practical purchase is an HGX server with eight GPUs, and there is no Rubin desktop card waiting for you. Rubin is rack-scale by design, because the problem it was built to solve does not fit in a tower under a desk.

08 / Example bills of quantities

What is actually inside "one unit"

Because the instinct from PC building is "one RTX 4090 equals one GPU," the most useful thing I can show you is what a "one-unit" purchase really contains at each tier. These are illustrative reference BOQs meant to build intuition, not quotes; exact quantities flex with configuration, and the prices are deliberately rough.

BOQ A, one HGX Rubin NVL8 server the entry "box of GPUs"

#	Line item	Qty	Notes
1	Liquid-cooled server chassis (OEM)	1	Dell, Supermicro, HPE and similar
2	HGX Rubin NVL8 baseboard	1	carries the eight GPUs and on-board NVLink
3	Rubin GPU	8	eight times 288 GB HBM4 is 2,304 GB of GPU memory
4	x86 host CPU	2	dual-socket host
5	System DRAM (DDR5)	~2 to 4 TB	configuration-dependent
6	ConnectX-9 SuperNIC	8 to 9	scale-out, roughly one per GPU
7	BlueField-4 DPU	1 to 2	storage and infrastructure offload
8	NVMe storage	multiple	local scratch and KV-cache tier
9	Liquid-cooling manifold and PSUs	1 set	direct-to-chip cooling

Use case: enterprise inference and mid-size training, mixed workloads, x86 compatibility. Indicative cost: several hundred thousand dollars.²

BOQ B, one Vera Rubin NVL72 rack the standard rack-scale unit

#	Line item	Qty	Notes
1	Compute trays	18	each tray is 2 Vera CPU plus 4 Rubin GPU, that is 2 superchips
2	Rubin GPU (total)	72	72 times 288 GB is 20.7 TB HBM4, about 1,580 TB/s aggregate
3	Vera CPU (total)	36	3,168 Olympus cores, 54 TB LPDDR5X
4	NVLink 6 switch trays	9	provides 260 TB/s all-to-all, the scale-up fabric
5	NVLink spine and backplane	1	cable-free
6	ConnectX-9 SuperNIC	up to 72	scale-out fabric, configuration-dependent
7	BlueField-4 DPU	~18	storage, security, KV-cache
8	Power shelves and busbar	1 set	roughly 150 to 190 kW class³
9	CDU and liquid-cooling loop	1 set	fanless trays, about twice Blackwell's flow
10	Rack enclosure	1	around 1.5 tonnes fully populated

Use case: the "one unit" of frontier training and inference. Indicative cost: roughly four to five million dollars per rack, for reference a GB200 NVL72 was widely reported around three million.²

BOQ C, one DGX SuperPOD (Vera Rubin) turnkey AI supercomputer

#	Line item	Qty	Notes
1	DGX Vera Rubin NVL72 racks	8	= 576 GPUs, 288 CPUs, ~600 TB memory, 28.8 EF NVFP4
2	Quantum-X800 InfiniBand or Spectrum-X Ethernet switches	set	the scale-out fabric between racks
3	Spectrum-X co-packaged-optics switches	set	silicon-photonics option
4	Storage nodes (context and KV-cache tier)	multiple	high-speed model and data storage
5	Management and head nodes	multiple	orchestration
6	Software stack	included	Base Command, Mission Control, AI Enterprise, Run:ai
7	Integration, cabling, CDUs, support	included	turnkey delivery
8	Optional Groq 3 LPX inference racks	add-on	256 LPUs each, sit beside the racks via Spectrum-X

Use case: gigascale training of frontier models, ordered as a single AI supercomputer. Indicative cost: tens of millions of dollars.²

¹ Early HBM4 units may ship below the 22 TB/s target before the supply chain ramps. ² Prices are rough, non-official, order-of-magnitude figures for intuition only; real pricing swings enormously with configuration and contract. ³ Rubin pushes rack power density past Blackwell; exact figures are not yet final.

09 / Performance, with the marketing removed

GB200 vs GB300 vs Rubin, honestly

NVIDIA's slides love a big multiplier, so this section sticks to numbers you can defend and flags the ones that mix conventions. Two charts first, because two of the generational jumps are clean and unambiguous, and then the full tables.

Figure 08

Memory bandwidth per GPU, the wall everyone hits

This is the number that matters most in the inference era, and it is the one Rubin moves the furthest. HBM4 is the headline upgrade.

H100Hopper

3.35 TB/s

B200 / B300Blackwell

8 TB/s

Rubin R2002026

22 TB/s

Figure 09

Dense FP4 compute per GPU

Apples to apples, all dense figures, no sparsity tricks. The generational climb from Blackwell to Blackwell Ultra to Rubin is real and steep.

B2009 PF

9 PF

B30015 PF

15 PF

Rubin R20035 PF

35 PF dense

Per-GPU comparison

// Blackwell B200, then Blackwell Ultra B300, then Rubin R200. Dense figures unless noted.
Spec	B200 (in GB200)	B300 (in GB300)	Rubin R200
Process	TSMC 4NP	TSMC 4NP	TSMC N3P (3nm)
Transistors	208 B	208 B	336 B
Die structure	2 dies	2 dies	2 compute + 2 I/O
Memory	~180 to 192 GB HBM3e	288 GB HBM3e	288 GB HBM4
Bandwidth	8 TB/s	8 TB/s	22 TB/s
Dense FP4	9 PF	15 PF	35 PF (50 PF inf with TE)
Dense FP8	4.5 PF	5 PF	17.5 PF
NVLink per GPU	1.8 TB/s (v5)	1.8 TB/s (v5)	3.6 TB/s (v6)
TDP	~1,000 W	~1,400 W	~1,800 to 2,300 W

Per-rack comparison

// rack-scale systems, with the dense-versus-sparse caveat noted below
Spec	GB200 NVL72	GB300 NVL72	Vera Rubin NVL72	VR NVL144 CPX
GPUs	72 B200	72 B300	72 Rubin	144 Rubin + 144 CPX
CPUs	36 Grace	36 Grace	36 Vera	36 Vera
FP4 inference	~1.4 EF	~1.1 EF dense	3.6 EF	8 EF
HBM capacity	~13.4 TB HBM3e	20.7 TB HBM3e	20.7 TB HBM4	100 TB fast mem
HBM bandwidth	~576 TB/s	~576 TB/s	1,580 TB/s	1.7 PB/s
NVLink per rack	~130 TB/s	~130 TB/s	260 TB/s	260 TB/s
vs prior gen	baseline	1.5x GB200	~3.3x GB300	~7.5x GB300

The two economic claims NVIDIA actually leads with

For all the spec tables, NVIDIA knows the buyers care about two numbers above all, because those are the ones that show up in a profit-and-loss statement. Both are big, and both deserve a raised eyebrow until independent benchmarks land.

Training

A quarter of the GPUs

NVIDIA says Rubin can train large mixture-of-expert models with roughly one-quarter the number of GPUs that Blackwell needed. If it holds, that is a direct cut to both the capital bill and the power bill.

Inference

A tenth of the cost per token

For deep-reasoning agentic workloads, Rubin targets about one-tenth the cost per million tokens versus Blackwell. Token economics, not raw FLOPS, is the real battlefield of the inference era, and this is the number aimed at it.

On dense versus sparse: GB200's ~1.4 EF FP4 is a sparse figure, while GB300's ~1.1 EF is dense, which is why a naive reading makes the newer rack look slower. Rubin's 3.6 EF is the Transformer-Engine-boosted inference number, and NVIDIA's multipliers (3.3x, 7.5x) compare that against GB300. Treat all cross-generation FP4 comparisons as directional rather than exact, and trust the bandwidth and dense-FP4 charts above more than any headline multiplier.

10 / Terminology

Glossary

Every key term in one place. The dotted terms scattered through the guide also reveal a quick definition on hover.

CUDA (platform): NVIDIA's parallel-computing software model and its CUDA-X library stack. The dominant AI software ecosystem and the company's deepest moat. Rubin runs CUDA 13.
CUDA core: A general-purpose math unit in each SM, handling FP32, FP64 and integers. The basis for HPC, simulation and all the logic that surrounds the AI.
Tensor Core: A specialized core that performs matrix-multiply-accumulate at low precision. The source of the big PetaFLOPS numbers and the engine of roughly ninety percent of deep learning.
SM (Streaming Multiprocessor): The basic building block of an NVIDIA GPU, bundling CUDA cores, Tensor Cores, schedulers and caches. The Rubin GPU has 224.
Transformer Engine: A hardware-and-software layer that automatically manages precision per layer to reach FP4 speed at close to FP8 accuracy. Rubin uses the third generation, with adaptive compression.
NVFP4: NVIDIA's 4-bit floating-point format with micro-block scaling. Best for low-cost, high-throughput inference and the training of the largest models.
FP8 / NVFP8: 8-bit floating point. The safe default for training and for inference where quality cannot slip, thanks to more range than FP4.
FP16 / BF16 / FP32 / FP64: Higher-precision formats for training stability, mixed precision, and scientific or HPC work respectively.
HBM4: High Bandwidth Memory, fourth generation, stacked DRAM beside the GPU die. 288 GB at 22 TB/s on Rubin, doubling the interface width of HBM3e.
GDDR7: Cheaper, cooler graphics memory with no advanced packaging. Used on Rubin CPX (128 GB) for cost-efficient long-context work.
SRAM: Ultra-fast on-die memory. The LPU's defining bet: a huge SRAM pool on the chip itself instead of off-chip HBM, for extreme bandwidth at low latency.
NVLink / NVLink Switch: The in-rack, all-to-all GPU interconnect, the scale-up fabric. NVLink 6 runs at 3.6 TB/s per GPU and 260 TB/s per rack.
NVLink-C2C: The chip-to-chip coherent link between CPU and GPU inside a superchip, 1.8 TB/s on Vera Rubin.
RDMA: Remote Direct Memory Access, reading or writing another machine's memory without involving its CPU. Core to SuperNIC scale-out.
SuperNIC: A smart network card (ConnectX-9) tuned for GPU-to-GPU traffic between racks, the scale-out fabric, at 1.6 Tb/s per GPU.
DPU: Data Processing Unit (BlueField-4), offloading storage, networking, security and KV-cache from the CPU and GPU.
CPO / silicon photonics: Co-Packaged Optics, light generated right next to the switch chip. 5x power efficiency and 10x resiliency versus pluggable optics.
KV-cache: Key-Value cache, the model's working memory of the conversation so far. Central to long-context inference and able to live on a BlueField-4 SSD.
MoE: Mixture-of-Experts, a model that routes each token to a subset of expert sub-networks. Rubin and the LPU accelerate expert execution.
Prefill / decode: The two phases of inference. Prefill reads the prompt and is compute-heavy (CPX); decode generates tokens and is memory-bound (GPU and LPU).
Scale-up / scale-out: Scale-up links GPUs within a rack (NVLink). Scale-out links racks across a data center (Ethernet or InfiniBand).
Chiplet / CoWoS-L: A chiplet is a package built from several dies. CoWoS-L is TSMC's advanced 2.5D packaging that stitches those dies and the HBM onto one carrier.
MGX: NVIDIA's modular rack reference design, the blueprint. Third-generation MGX underpins Vera Rubin NVL72, with cable-free trays and 80-plus partners.
NVL72: The rack-scale system, 72 Rubin GPUs and 36 Vera CPUs, liquid-cooled with all-to-all NVLink. The default "one unit" of rack-scale AI.
HGX: An eight-GPU baseboard for x86 servers (HGX Rubin NVL8). The traditional-server entry point.
DGX: NVIDIA's own branded, turnkey, fully supported system line (DGX Vera Rubin NVL72, DGX SuperPOD).
SuperPOD: A pre-engineered cluster of many NVL72 racks plus networking, storage and software. Eight racks make 576 GPUs and about 28.8 ExaFLOPS.
LPU: Language Processing Unit (Groq 3), an SRAM-based inference accelerator for ultra-fast token decode, from NVIDIA's twenty-billion-dollar Groq deal.
Superchip: One Vera CPU plus two Rubin GPUs on one board. The building block, with thirty-six per NVL72 rack.
AFD (Attention-FFN Disaggregation): Splitting inference so the Rubin GPUs handle attention and prefill while LPUs handle feed-forward and MoE, the Rubin-plus-LPU teamwork model.

Vera Rubin NVL72 How a leather-jacketed CEO turned a graphics company into the power grid of artificial intelligence.