Every spring, in a packed arena that the press has taken to calling "AI Woodstock," a man in a black leather jacket walks on stage without notes and tells the technology industry what the next year will look like. He is almost always right, because by the time he says it, most of it has already been built. This is the story of the machine he unveiled for 2026, told the way a hardware enthusiast would tell it: every chip, every acronym, and the part nobody explains clearly, which is what you can actually buy and what "one unit" even means.
NVIDIA names its architectures after scientists, and the choices are rarely accidental. Pascal, Volta, Turing, Ampere, Hopper, Blackwell. This generation belongs to Vera Rubin, the American astronomer whose painstaking measurements of how galaxies rotate produced the most convincing evidence we have for dark matter, the unseen mass that holds the cosmos together. There is a quiet poetry in attaching her name to a machine built for inference, the act of drawing conclusions from what a model cannot directly see.
And then, as has become tradition, NVIDIA split the name across the two halves of the system, which tells you the entire design philosophy before you read a single specification.
The surname goes to the GPU, the engine that does the punishing AI mathematics. When people say "a Rubin," this is the silicon they picture.
The first name goes to a brand new CPU, the data-traffic controller and coordinator. It is the successor to "Grace," and it marks NVIDIA's return to designing its own processor cores from scratch.
So "Vera Rubin" is not two products bolted together at the last minute. It is a CPU and a GPU drawn up on the same whiteboard, intended from day one to behave as a single organism.
To understand why a new NVIDIA rack now moves global stock markets, you have to understand how improbable the company's position is. NVIDIA was founded in 1993 in a roadside Denny's in San Jose by Jensen Huang and two engineers, with a plan to accelerate computer graphics. It nearly died in its first years when an early chip flopped, and survived on a single rescue product. For most of its life it was, to the wider world, the company that made your video games run smoothly.
The pivot that created today's giant was a bet almost nobody else was willing to make. Around 2006, NVIDIA began pouring money and engineers into CUDA, a way to use graphics chips for general-purpose computing. For years this looked like an expensive distraction; analysts asked why a gaming company was funding a research-computing platform with no obvious customer. Huang kept funding it anyway. When deep learning arrived a few years later and researchers discovered that GPUs were almost magically suited to training neural networks, NVIDIA was the only company with a decade of software already in the ground. That patience is the moat. Rivals can copy the silicon; they cannot copy twenty years of libraries and the millions of developers who learned on them.
The rest reads like a coronation. The 2020 acquisition of Mellanox turned NVIDIA from a chip vendor into a networking company. The arrival of ChatGPT turned its data-center GPUs into the most fought-over hardware on the planet. By the mid-2020s NVIDIA had passed Apple and Microsoft to become the most valuable public company in the world, selling the picks and shovels for the entire AI gold rush with a market share in AI accelerators that competitors still measure in single-digit fractions of what is left over.
Huang's recurring line at these launches, delivered with a showman's grin, is "the more you buy, the more you save." It is a joke. It is also, for the hyperscalers racing each other, a description of how they actually think.
The texture matters too, because Huang has made it part of the product. The black leather jacket he has worn for decades is now as recognizable as the logo. He runs the company famously flat, with dozens of executives reporting to him directly. In 2016 he personally carried the first DGX-1 AI supercomputer to a young lab called OpenAI and signed its chassis with a dedication to the future of computing and humanity, a piece of theater that has aged into legend. The annual GTC keynote, where Blackwell was held aloft in 2024 and the Rubin roadmap was laid out in 2025, is less a product briefing than a stadium event. None of this is incidental. It is how a components company convinced the world it is building the future, one rack at a time.
Here is the lineage that explains why Rubin matters. NVIDIA has compressed what used to be a multi-year chip cycle into an annual drumbeat, and each beat has had a distinct personality.
| Year | Architecture | Flagship | What it really meant |
|---|---|---|---|
| 2022 to 2023 | Hopper | H100 / H200 | The chip the whole world fought over in the ChatGPT gold rush. |
| 2024 | Blackwell | B200 / GB200 NVL72 | NVIDIA stops selling a chip and starts selling a whole rack as one computer. |
| 2025 | Blackwell Ultra | GB300 NVL72 | Retuned for "reasoning" AI that thinks before it answers. |
| 2026 | Rubin | VR200 / Vera Rubin NVL72 | Built to run AI cheaply, for billions of users and AI agents. (this guide) |
| 2027 | Rubin Ultra | NVL576 | Doubles the rack to 144 packages (576 GPU dies). |
| ~2028 | Feynman | on the map | Named, scheduled, and still under wraps. |
Strip away the codenames and there is a single arc running through all of it: the industry moved from the problem of building giant models to the far larger problem of running them affordably for everyone, forever. Read top to bottom, this is that story.
When ChatGPT detonated, every company on Earth suddenly needed to train AI, and the H100 was the one chip that could really do it. Demand went vertical, supply evaporated, and the phrase "GPU shortage" entered ordinary conversation. The whole modern boom was built on these. NVIDIA's valuation crossed a trillion dollars on the strength of them.
Frontier models grew too large to live on a single chip, so Huang changed what NVIDIA sells. Instead of a card you slot into a server, Blackwell's flagship is an entire rack: seventy-two GPUs wired so tightly they behave like one colossal brain. This is the moment "AI factory," his favorite phrase, stopped being a metaphor and became a product category.
A new species of model appeared, the kind that reasons step by step before it answers, the source of that "thinking" pause you now see in chatbots. That style burns enormous compute at answer time, not just at training time. GB300 was a mid-cycle tune-up aimed squarely at this reasoning workload, a hint of where the center of gravity was about to move.
The hard problem flipped completely. The challenge is no longer training a model once. It is serving that model to billions of people and to armies of autonomous agents, affordably, every second of every day. Rubin is engineered for exactly this. NVIDIA is targeting roughly a tenth of the cost per answer compared with Blackwell, and it has added the new Vera CPU and the Groq LPU specifically to keep agentic workloads fast and cheap.
The same idea, scaled. The rack roughly doubles to 144 packages (576 GPU dies), readying NVIDIA for the next jump in both model size and global user volume.
Named for the physicist Richard Feynman. It exists on Huang's roadmap slide, which by now is enough to make suppliers plan around it, but the details are still locked away.
The single hardest adjustment for anyone coming from PCs is this: a consumer GPU lives inside one machine, but NVIDIA's unit of measure is now an entire rack, a refrigerator-sized cabinet that behaves as one computer assembled from roughly 1,300 chips. Once you accept that the rack is the product, two interconnect concepts unlock everything else, and they exist because moving data is now harder than computing on it.
Wiring every GPU within a single rack so they act like one enormous GPU. The hero here is NVLink, a private highway that runs at close to memory speed. This is the part nobody else can match at scale.
Linking many racks into a full data center. The heroes are the networking chips: ConnectX, Spectrum-X, Quantum-X InfiniBand. Think of this as the public motorway between buildings, fast but a tier below NVLink.
It helps to see the physical object. A Vera Rubin NVL72 is not abstract. It is a cabinet you could touch, if you were allowed near the cooling loop, weighing on the order of one and a half metric tons and drinking power at a rate that would trip the breakers of a small office block. Here is what is inside it, top to bottom.
That is the punchline of the whole platform in one image. The thinking happens in the green compute trays; the brighter switch trays in the middle are the connective tissue that lets all seventy-two GPUs pretend to be one; and the orange and teal bands at top and bottom are the unglamorous truth of modern AI, which is that the real engineering challenge is now feeding the thing power and carrying away its heat.
It is a fair thing to ask, and the honest answer is that asking whether a modern NVIDIA GPU "has CUDA" is a little like asking whether a Catholic cathedral has religion. CUDA is not a feature of the hardware. It is the gravity well that the entire AI industry now orbits, and it exists in two distinct forms that Rubin carries at once.
NVIDIA's programming model and its two-decade stack of libraries, CUDA-X. This is the real moat, the thing AMD and Intel keep failing to dislodge. Rubin runs CUDA 13, and the new Groq chips were deliberately wired in so that no CUDA code has to change.
The small general-purpose math units inside every NVIDIA GPU. They handle ordinary parallel arithmetic in FP32 and FP64 and integers, the foundation for scientific computing and all the logic that surrounds the AI itself.
Inside each GPU, compute is grouped into SMs, or Streaming Multiprocessors, and the Rubin GPU has 224 of them. Picture an SM as a workshop containing two very different teams. There is no single "AI core" doing everything; the magic is in how these two teams divide the labor.
| Core type | What it does | What it is for | Think of it as |
|---|---|---|---|
| CUDA core | General-purpose parallel math (FP32, FP64, integers) | HPC, physics, simulation, and the connective logic around the AI | A bench of versatile generalists |
| Tensor Core | Matrix-multiply-accumulate at low precision (FP4, FP8, FP16) | The roughly ninety percent of deep learning that is matrix math. This is where the headline PetaFLOPS come from. | Specialist robots that do one thing at terrifying speed |
Rubin's 224 SMs carry the latest generation of Tensor Cores, which NVIDIA documents as fifth-generation and tunes heavily for NVFP4 and FP8, alongside the general-purpose CUDA cores. For data-center parts NVIDIA leads its marketing with Tensor throughput rather than a raw CUDA-core count, and that choice is itself the story: for AI, the matrix engines are what matter, and the specialists have quietly become the main event.
Sitting on top of the Tensor Cores is a layer of hardware and software called the Transformer Engine, and it is one of NVIDIA's cleverest tricks. As a model runs, it watches the numbers flowing through each layer and decides, on the fly, which values can be safely crushed down to ultra-low precision and which need a little more room. The result is the holy grail of inference economics: something close to FP4 speed with close to FP8 accuracy. Rubin's third-generation engine adds an adaptive, two-level scaling scheme and retires the older "structured sparsity" trick from previous generations. This is how the same chip can quote 35 dense PetaFLOPS of NVFP4 yet reach an effective 50 PetaFLOPS for real inference.
Precision simply means how many bits you spend to represent each number. Fewer bits run faster and use less memory but carry less accuracy, so the entire discipline of modern AI hardware is a hunt for the lowest precision you can get away with without the model falling apart.
| Format | Bits | Primary use | Why you would reach for it |
|---|---|---|---|
| NVFP4 | 4 | Inference at massive scale, and increasingly the training of the very largest models | The cheapest possible cost per token. The Transformer Engine keeps the accuracy honest. |
| FP8 / NVFP8 | 8 | Training, and inference where quality cannot slip | More numerical range than FP4, the safe default when stability matters. |
| FP16 / BF16 | 16 | Sensitive training layers, gradients, mixed precision | Stability where FP8 would drift. BF16 trades precision for range. |
| TF32 | ~19 | A near drop-in accelerator for FP32 work | Faster than FP32 with almost no code change. |
| FP32 / FP64 | 32 / 64 | Scientific computing, simulation, HPC | Full precision where being wrong is not an option. This runs on the CUDA cores. |
"Seven new chips, one AI supercomputer" is how Huang framed it on stage, and the framing is doing real work. The competition tends to ship a faster GPU. NVIDIA ships an entire fleet of co-designed silicon, each piece solving a different bottleneck, and the bottlenecks are no longer mostly about compute. Here is the cast, in order.
This is the star, the silicon that does the overwhelming share of the AI mathematics, and physically it represents a clean break from the past. Old GPUs were a single monolithic slab. Rubin is a chiplet design, an assembly of separate dies bonded onto one carrier, because chips have hit the physical ceiling of how large a single piece of silicon can be manufactured. NVIDIA's answer is to stop fighting that ceiling and start sewing chips together.
The first is memory, and it is the more important of the two. As models swell, the limiting factor stopped being how fast a chip can compute and became how fast you can feed it. HBM4 doubles the interface width over the previous HBM3e, delivering 288 GB across eight stacks at up to 22 TB/s, roughly two and three-quarter times Blackwell's 8 TB/s. The second is the third-generation Transformer Engine described above. Together they target the exact wall that the inference era runs into.
This is the genuinely new piece, and the one most people miss. Blackwell's companion CPU was called Grace, and it leaned on off-the-shelf Arm Neoverse cores. Vera is its successor, and the headline is that NVIDIA designed the cores itself, a custom design code-named "Olympus." After years of buying its CPU cores off the shelf, NVIDIA has gone back to drawing its own, which is a statement of intent about how tightly it wants the CPU and GPU to fit.
The job here is the work a GPU is genuinely bad at. GPUs are gloriously fast at doing the same operation across thousands of data points, but they stumble on branchy, sequential, decision-heavy logic. Vera handles exactly that: staging data, deciding which GPU should get which piece at which moment, and orchestrating the long, multi-step loops of agentic reasoning. NVIDIA keeps using the word "deterministic," meaning predictable, jitter-free timing, and for AI agents that fire thousands of small dependent steps, that predictability is worth more than raw speed.
A frontier model does not fit on one GPU; it is sliced across dozens, and those slices have to talk to each other constantly and instantly, or the whole rack stalls waiting on itself. NVLink is the private highway that carries that traffic, and the NVLink Switch is the interchange that lets every GPU reach every other one at once. This is the single piece of the puzzle that competitors have struggled most to replicate at scale.
The payoff is the illusion at the heart of the NVL72: with NVLink 6, all seventy-two GPUs can behave as one enormous GPU with a shared pool of memory. Try to move model shards across ordinary networking instead and the latency would strangle performance. NVLink keeps that traffic running at something close to memory speed, which is the difference between a rack and a mere pile of servers.
One rack is never enough at the frontier, so you link hundreds of them into a single training cluster, and the SuperNIC is what does the linking. This is the lineage that traces straight back to the Mellanox acquisition; networking is no longer an afterthought bolted onto the side of an AI system, it is a first-class citizen of the design.
Its key trick is programmable RDMA, which lets a GPU reach directly into another server's memory without bothering either machine's CPU, at very low latency. The clean mental split is this: NVLink makes things fast inside the rack, and ConnectX-9 makes things fast between racks. Both have to be excellent or the cluster runs at the speed of its weakest link.
A DPU, or Data Processing Unit, is the data center's invisible custodian. Every AI cluster has a mountain of unglamorous chores: shuffling storage, managing the network, enforcing security boundaries between tenants, encrypting traffic. Run those on the expensive GPUs and CPUs and you are burning gold to do janitorial work. BlueField-4 takes all of it off their plate.
The new wrinkle for Rubin is that BlueField-4 can park the KV-cache on an integrated SSD. As models stretch to million-token contexts, they need a fast tier to hold their working memory, and putting it close to the network fabric rather than hogging precious HBM is one of those quiet architectural decisions that pays off enormously at scale.
This is the most science-fiction piece in the set. Traditional network switches turn electricity into light using little pluggable optical modules, and at data-center scale those modules are a genuine plague: they burn power, run hot, and fail often enough to be a real reliability problem. Co-packaged optics takes the radical step of building the silicon photonics right next to the switch chip, generating the light at the source rather than at the end of a copper run.
At the scale of a gigawatt AI factory, the power drawn by networking and the failures of optical links are real line items on the budget and the maintenance schedule. Co-packaged optics attacks both at once, moving data at the speed of light while spending dramatically less energy to do it. It is the kind of unglamorous infrastructure win that does not make headlines but quietly decides whether a build is economical.
If one card on NVIDIA's own slide makes people do a double-take, it is this one, and the confusion is completely understandable, because Groq spent years as a rival. The explanation is one of the more dramatic corporate moves of the decade. On the 24th of December 2025, NVIDIA signed a deal with Groq worth roughly twenty billion dollars. Structurally it is a non-exclusive license combined with a team transfer, an "acqui-hire," which is precisely the structure that let it close in under four months without triggering a full merger review. Groq's founder Jonathan Ross, who originally designed Google's TPU, and its president Sunny Madra moved over to NVIDIA.
Huang reached for a familiar comparison on stage. Mellanox, he reminded everyone, turned NVIDIA from a chip vendor into a networking company. Groq, in the same telling, turns it from a training-first GPU vendor into a full-stack platform optimized for inference. It is a tidy narrative, and like most things Huang says from a stage, it doubles as a roadmap.
An LPU, or Language Processing Unit, closes the one gap a GPU cannot close on its own. The problem is specific and brutal. At very high token-generation speeds, north of a thousand tokens per second, even NVLink-connected GPU systems choke, not on compute but on memory bandwidth, because the data has to keep making the round trip to the HBM stacks at the edge of the package. The LPU's solution is radical to the point of seeming reckless: throw out the expensive HBM entirely and pour a vast pool of SRAM directly onto the die, right alongside the compute, so the data barely has to travel at all.
The two chips do not compete inside the system; they divide the labor of generating an answer. The Rubin GPU handles the prefill, the KV-cache, and the heavy attention math, while the LPU takes the latency-sensitive feed-forward networks, the mixture-of-expert execution, and the pointwise operations. NVIDIA's claim for the pairing is striking: an LPX rack working alongside a Vera Rubin NVL72 delivers thirty-five times the inference throughput per megawatt of Blackwell. The LPX racks physically sit beside the Rubin racks, connected over Spectrum-X, and, in the line that is clearly aimed at reassuring nervous customers, none of it requires a single change to CUDA code.
Beyond the seven, NVIDIA lists one more part as a GPU in its own right: Rubin CPX, the Context Phase aXcelerator. To understand why it exists, you have to know that answering a prompt is really two jobs, with completely different appetites, and using one expensive chip for both is like running a fine-dining kitchen where the same chef chops the onions and plates the dessert.
Digesting the prompt, which today might be an entire codebase or a feature-length video, millions of tokens at once. This phase is compute-hungry but not especially latency-sensitive. The right tool is Rubin CPX.
Producing the answer one token at a time. This phase is bound by memory and latency, not raw compute. The right tools are the Rubin GPU and, at extreme speed, the LPU.
CPX delivers thirty PetaFLOPS of NVFP4, but its defining choice is to swap the expensive HBM4 for 128 GB of much cheaper GDDR7, the kind of memory you would find on a gaming card. For the prefill job, ingesting long sequences cheaply and without melting, that trade is exactly right, and CPX pairs it with hardware video decode and roughly three times the attention performance of GB300.
These names trip up almost everyone, and they shouldn't, because they are simply different ways of putting the same chips into different boxes at different scales. Here is the clean separation, and then a picture that makes the hierarchy obvious.
| Name | What it actually is | Scale | Who it is for |
|---|---|---|---|
| MGX | A modular reference design, a blueprint for building the rack with cable-free trays. Not a product you buy, a template partners build from. | A specification | The 80-plus OEM and system builders |
| NVL72 | The rack-scale system: 72 Rubin GPUs and 36 Vera CPUs, fully liquid-cooled, all-to-all NVLink. The scale-up domain made physical. | One rack | The all-in NVIDIA stack, with an Arm Vera host |
| HGX Rubin NVL8 | An eight-GPU baseboard for conventional x86 servers. The "keep one foot in the familiar world" option. | One server, 8 GPUs | Enterprises that still want x86 servers |
| DGX | NVIDIA's own branded, turnkey build of the above, pre-integrated, supported by NVIDIA itself, shipped with the full software stack. | One rack and up | Buyers who want it finished and warranted by NVIDIA |
| DGX SuperPOD | A pre-engineered cluster of many NVL72 racks plus networking, storage and software. As close as it gets to ordering a turnkey AI data center. | 8-plus racks | Gigascale and frontier AI labs |
This is the question that quietly trips up everyone arriving from the world of consumer hardware, and it is worth slowing down for. When you buy an RTX 4090, one unit means one GPU, a card you slide into a PC. AI-factory hardware refuses to play by that rule. The product is the system, or the rack, almost never the bare chip. Here is the purchasing ladder, from the rung you cannot actually stand on up to hyperscale.
Unlike an RTX card, you cannot walk in and buy one Rubin GPU. NVIDIA sells chips to system builders, not to end users. As a rough sense of scale only, the implied value of one GPU inside a rack lands somewhere around fifty to seventy thousand dollars, but you will never see it on its own price tag.
One Vera CPU plus two Rubin GPUs on a board. It is a building block sold to integrators. You do not buy "one superchip" as a thing that arrives in a box; you buy a system that happens to contain them.
A server from Dell, Supermicro, HPE and the like, carrying an eight-GPU HGX baseboard on an x86 host. This is the closest thing to "a box of GPUs," and it is where most enterprises actually start.
When someone in the data-center world says "a Vera Rubin," nine times out of ten they mean one NVL72 rack: 72 GPUs and 36 CPUs in a single liquid-cooled cabinet. You can buy a functionally equivalent rack from an OEM building on the MGX blueprint.
Or you buy the DGX version straight from NVIDIA: the same hardware, but integrated, supported and warranted by NVIDIA, arriving with the full software stack of DGX OS, Base Command and Mission Control.
A pre-engineered cluster. A common configuration is eight NVL72 racks: 576 GPUs, 288 CPUs, roughly 600 TB of memory and about 28.8 ExaFLOPS of NVFP4, plus switches, storage, DPUs, cabling and software, ordered and delivered as a single unit.
Stack SuperPODs and you arrive at the hyperscale AI factory, the level at which Microsoft, Google, xAI, Meta and OpenAI actually operate. This is what Huang means when he says the unit of computing is now the data center.
Because the instinct from PC building is "one RTX 4090 equals one GPU," the most useful thing I can show you is what a "one-unit" purchase really contains at each tier. These are illustrative reference BOQs meant to build intuition, not quotes; exact quantities flex with configuration, and the prices are deliberately rough.
| # | Line item | Qty | Notes |
|---|---|---|---|
| 1 | Liquid-cooled server chassis (OEM) | 1 | Dell, Supermicro, HPE and similar |
| 2 | HGX Rubin NVL8 baseboard | 1 | carries the eight GPUs and on-board NVLink |
| 3 | Rubin GPU | 8 | eight times 288 GB HBM4 is 2,304 GB of GPU memory |
| 4 | x86 host CPU | 2 | dual-socket host |
| 5 | System DRAM (DDR5) | ~2 to 4 TB | configuration-dependent |
| 6 | ConnectX-9 SuperNIC | 8 to 9 | scale-out, roughly one per GPU |
| 7 | BlueField-4 DPU | 1 to 2 | storage and infrastructure offload |
| 8 | NVMe storage | multiple | local scratch and KV-cache tier |
| 9 | Liquid-cooling manifold and PSUs | 1 set | direct-to-chip cooling |
Use case: enterprise inference and mid-size training, mixed workloads, x86 compatibility. Indicative cost: several hundred thousand dollars.2
| # | Line item | Qty | Notes |
|---|---|---|---|
| 1 | Compute trays | 18 | each tray is 2 Vera CPU plus 4 Rubin GPU, that is 2 superchips |
| 2 | Rubin GPU (total) | 72 | 72 times 288 GB is 20.7 TB HBM4, about 1,580 TB/s aggregate |
| 3 | Vera CPU (total) | 36 | 3,168 Olympus cores, 54 TB LPDDR5X |
| 4 | NVLink 6 switch trays | 9 | provides 260 TB/s all-to-all, the scale-up fabric |
| 5 | NVLink spine and backplane | 1 | cable-free |
| 6 | ConnectX-9 SuperNIC | up to 72 | scale-out fabric, configuration-dependent |
| 7 | BlueField-4 DPU | ~18 | storage, security, KV-cache |
| 8 | Power shelves and busbar | 1 set | roughly 150 to 190 kW class3 |
| 9 | CDU and liquid-cooling loop | 1 set | fanless trays, about twice Blackwell's flow |
| 10 | Rack enclosure | 1 | around 1.5 tonnes fully populated |
Use case: the "one unit" of frontier training and inference. Indicative cost: roughly four to five million dollars per rack, for reference a GB200 NVL72 was widely reported around three million.2
| # | Line item | Qty | Notes |
|---|---|---|---|
| 1 | DGX Vera Rubin NVL72 racks | 8 | = 576 GPUs, 288 CPUs, ~600 TB memory, 28.8 EF NVFP4 |
| 2 | Quantum-X800 InfiniBand or Spectrum-X Ethernet switches | set | the scale-out fabric between racks |
| 3 | Spectrum-X co-packaged-optics switches | set | silicon-photonics option |
| 4 | Storage nodes (context and KV-cache tier) | multiple | high-speed model and data storage |
| 5 | Management and head nodes | multiple | orchestration |
| 6 | Software stack | included | Base Command, Mission Control, AI Enterprise, Run:ai |
| 7 | Integration, cabling, CDUs, support | included | turnkey delivery |
| 8 | Optional Groq 3 LPX inference racks | add-on | 256 LPUs each, sit beside the racks via Spectrum-X |
Use case: gigascale training of frontier models, ordered as a single AI supercomputer. Indicative cost: tens of millions of dollars.2
NVIDIA's slides love a big multiplier, so this section sticks to numbers you can defend and flags the ones that mix conventions. Two charts first, because two of the generational jumps are clean and unambiguous, and then the full tables.
| Spec | B200 (in GB200) | B300 (in GB300) | Rubin R200 |
|---|---|---|---|
| Process | TSMC 4NP | TSMC 4NP | TSMC N3P (3nm) |
| Transistors | 208 B | 208 B | 336 B |
| Die structure | 2 dies | 2 dies | 2 compute + 2 I/O |
| Memory | ~180 to 192 GB HBM3e | 288 GB HBM3e | 288 GB HBM4 |
| Bandwidth | 8 TB/s | 8 TB/s | 22 TB/s |
| Dense FP4 | 9 PF | 15 PF | 35 PF (50 PF inf with TE) |
| Dense FP8 | 4.5 PF | 5 PF | 17.5 PF |
| NVLink per GPU | 1.8 TB/s (v5) | 1.8 TB/s (v5) | 3.6 TB/s (v6) |
| TDP | ~1,000 W | ~1,400 W | ~1,800 to 2,300 W |
| Spec | GB200 NVL72 | GB300 NVL72 | Vera Rubin NVL72 | VR NVL144 CPX |
|---|---|---|---|---|
| GPUs | 72 B200 | 72 B300 | 72 Rubin | 144 Rubin + 144 CPX |
| CPUs | 36 Grace | 36 Grace | 36 Vera | 36 Vera |
| FP4 inference | ~1.4 EF | ~1.1 EF dense | 3.6 EF | 8 EF |
| HBM capacity | ~13.4 TB HBM3e | 20.7 TB HBM3e | 20.7 TB HBM4 | 100 TB fast mem |
| HBM bandwidth | ~576 TB/s | ~576 TB/s | 1,580 TB/s | 1.7 PB/s |
| NVLink per rack | ~130 TB/s | ~130 TB/s | 260 TB/s | 260 TB/s |
| vs prior gen | baseline | 1.5x GB200 | ~3.3x GB300 | ~7.5x GB300 |
For all the spec tables, NVIDIA knows the buyers care about two numbers above all, because those are the ones that show up in a profit-and-loss statement. Both are big, and both deserve a raised eyebrow until independent benchmarks land.
NVIDIA says Rubin can train large mixture-of-expert models with roughly one-quarter the number of GPUs that Blackwell needed. If it holds, that is a direct cut to both the capital bill and the power bill.
For deep-reasoning agentic workloads, Rubin targets about one-tenth the cost per million tokens versus Blackwell. Token economics, not raw FLOPS, is the real battlefield of the inference era, and this is the number aimed at it.
Every key term in one place. The dotted terms scattered through the guide also reveal a quick definition on hover.