H100, H200, B200, B300, GB200, GB300, and Vera Rubin. What they are and when each one is the right choice. By XIRR Advisors, brokers of reserved GPU and colocation capacity.
When you strip away the marketing and the naming conventions, choosing an NVIDIA data center GPU comes down to three practical questions. How much memory sits on the chip? How fast can that memory move data? How much power does the chip draw under load? The right chip for you is the one that matches your workload, not the one with the loudest headlines.
For most organizations in 2026, no single chip is the answer. A practical setup blends a few of them based on what each is built for.
H200 is the workhorse for serving large language models in production. It still delivers the best dollar-per-token at scale.
B300 is the chip you reach for once your serving workload includes reasoning models, the kind that walk through a problem step by step before answering. Those generate far longer outputs than a standard chatbot reply, and they need the extra memory and throughput B300 provides.
B200 is the chip for serious training, including fine-tuning your own large models or running production training pipelines.
GB200 and GB300 NVL72 racks are full rack-scale systems (72 GPUs wired together to behave like one very large accelerator) for organizations training very large models from scratch. These are sometimes called "trillion-parameter" models. Frontier AI labs and hyperscalers operate at this scale. Most companies do not.
Vera Rubin is NVIDIA's next-generation platform, shipping in 2026 and 2027. It will deliver another leap in memory, bandwidth, and performance per watt. If your roadmap extends past 2026, it is worth reserving capacity early, because next-generation supply is allocated months in advance and tight allocations have already become the norm.
The harder problem is rarely the chip. It is finding the right capacity to rent at the right terms. The right data center, the right cooling envelope, the right contract length, and the right supplier. That is what XIRR Advisors does. This guide covers the chips. We help you find the capacity.
Before we go further, here are the acronyms used throughout this guide.
HBM3, HBM3e, HBM4, HBM4E are generations of High Bandwidth Memory, the very fast memory stacked next to the GPU die.
FP4, FP8, FP16 are floating-point number precisions used for AI math. FP4 means 4-bit. Lower precision means more operations per second, with some loss of accuracy. Modern AI models tolerate FP8 and FP4 well.
TFLOPS, PFLOPS, ExaFLOPS measure compute throughput. TFLOPS means trillion floating-point operations per second. PFLOPS is a thousand TFLOPS. ExaFLOPS is a thousand PFLOPS.
TDP stands for Thermal Design Power, measured in watts. It is roughly how much heat the chip produces under load and therefore how much cooling it needs.
NVLink is NVIDIA's high-speed link between GPUs and CPUs. NVLink-C2C is a chip-to-chip variant.
NVL72, NVL144, NVL576 are NVLink-connected rack systems containing 72, 144, or 576 GPUs that behave like one large accelerator.
HGX is NVIDIA's reference server platform that hardware partners (Supermicro, Dell, HPE, and others) build production servers around.
LLM means Large Language Model. RAG means Retrieval-Augmented Generation. SLA means Service Level Agreement.
GPUaaS means GPU-as-a-Service, an on-demand rental model. SMB means small or mid-sized business.
NVIDIA names each generation of data center GPU after a scientist. Inside each generation, there is usually a base chip and a mid-cycle refresh.
| Generation | Years | Base Chip | Refresh | Rack Form |
|---|---|---|---|---|
| Hopper | 2022 to 2024 | H100 | H200 | HGX H100 / H200 |
| Blackwell | 2024 to 2026 | B200 | B300 (Blackwell Ultra) | HGX B200 / B300 |
| Grace Blackwell | 2025 to 2026 | GB200 | GB300 | NVL72 (72 GPUs per rack) |
| Vera Rubin | 2026 to 2027 | Rubin | Rubin Ultra | NVL144 / NVL576 |
Four generations of NVIDIA data center GPUs, from Hopper to Vera Rubin. Each step up adds memory, bandwidth, and raw compute. The strategic question is whether your workload needs the headroom, or whether previous-gen silicon at a discount is the better return on capital.
The pattern is consistent. Each step up brings more memory, more bandwidth, more low-precision math, and more power per GPU. The strategic question is whether your workload genuinely needs the headroom, or whether previous-generation silicon at a discount delivers a better return on capital.
Each chip below follows the same four-part structure. What it is, what is inside, what it is best for, and when you would actually buy or lease one.
What it is. NVIDIA's 2022 flagship and the first chip designed natively for transformer training. Built on the Hopper architecture.
What is inside. 80 GB of HBM3 memory at 3.35 TB/s of bandwidth. Roughly 2 PFLOPS of FP8 compute. 700 W TDP[1].
Best for. Mainstream training, fine-tuning, computer vision, recommendation systems, and any production workload where the lowest cost per GPU hour matters more than raw peak performance.
When you would reach for H100. When you are running steady-state production AI workloads on models below 70 billion parameters. When you want to take advantage of the deep secondary market and discounted on-demand cloud capacity. When models you train or fine-tune already fit comfortably inside 80 GB. When budget discipline matters more than the latest specs.
What it is. The 2024 mid-cycle refresh of H100. Same Hopper compute engine, dramatically expanded memory.
What is inside. 141 GB of HBM3e memory at 4.8 TB/s of bandwidth. Same roughly 2 PFLOPS FP8 compute as H100. 700 W TDP[2].
Best for. Inference at scale, particularly serving LLMs in production. The extra memory lets one GPU host larger models and serve more concurrent users.
When you would reach for H200. When you serve LLMs to end users and dollar-per-token economics drive your business. When you have outgrown H100 for inference because models no longer fit on a single 80 GB GPU. When you want the best return on capital available in the secondary market today. When you run long-context inference and need the memory headroom but not yet the compute jump that Blackwell delivers.
What it is. NVIDIA's 2025 flagship and the workhorse of the current generation. A two-die package fused into a single GPU.
What is inside. 192 GB of HBM3e memory at 8 TB/s of bandwidth. Native FP4 support. Roughly 2.5x the training throughput and over 10x the inference throughput of H100 on LLM workloads. 1,000 W TDP[3].
Best for. Frontier training at scale, plus modern inference workloads where you want to consolidate larger H100 fleets onto fewer, denser nodes.
When you would reach for B200. When you are running serious training programs on models above 100 billion parameters. When power and rack-space efficiency matter, since one B200 replaces several H100s. When you are deploying new AI infrastructure today and want a useful runway before the next refresh. When you are launching multimodal or long-context inference workloads where H200 latency starts to bottleneck.
What it is. The mid-cycle refresh of Blackwell, shipping from late 2025 into 2026. Designed by NVIDIA explicitly for reasoning and agentic AI.
What is inside. 288 GB of HBM3e memory at 8 TB/s of bandwidth. Roughly 15 PFLOPS dense FP4 compute, a 1.5x lift over B200. 1,400 W TDP[4].
Best for. Reasoning model inference, agentic AI deployments, and the leading edge of foundation-model training. The chip-of-record for models that generate thousands of tokens before producing an answer.
When you would reach for B300. When you launch reasoning or agentic AI products with long, multi-step outputs. When inference contexts exceed 128,000 tokens. When real-time agent latency defines product quality. When you build sovereign or national AI infrastructure today.
What it is. A "superchip" pairing one Grace ARM CPU with two B200 GPUs over a 900 GB/s NVLink-C2C interconnect. Sold primarily as part of a rack-scale system.
What is inside. Each GB200 superchip carries 2 x 192 GB of HBM3e memory and 2 x 8 TB/s of bandwidth. The standard configuration is the GB200 NVL72 rack, which packs 72 B200 GPUs and 36 Grace CPUs into one liquid-cooled rack that behaves, from the software's point of view, like one extremely large accelerator[6].
Best for. Trillion-parameter model training, hyperscaler deployments, and customers operating at the foundation-model frontier where the workload is large enough to keep an entire rack saturated.
When you would reach for GB200. When you train foundation models from scratch and training cycles consume thousands of continuous GPU-hours. When workloads saturate an entire rack for weeks or months at a time. When the scale-up domain matters more than per-GPU specs - jobs that would otherwise span multiple HGX nodes over slower fabric. When you are placing a rack-scale order today and reasoning workloads do not yet justify the GB300 premium.
What it is. The same GB200 superchip design, upgraded with B300 GPUs. The current rack-scale standard for new builds.
What is inside. Each GB300 superchip carries 2 x 288 GB of HBM3e. The GB300 NVL72 rack delivers approximately 1.1 ExaFLOPS of FP4 compute, roughly 1.5x a GB200 NVL72[4].
Best for. Reasoning-model training and inference at rack-scale, where GB200's memory and throughput leave tokens-per-dollar on the table.
When you would reach for GB300. When you train frontier reasoning or agentic models from scratch and the longer outputs and larger KV caches break GB200 economics. When inference fleets serve reasoning models at rack-scale and the 50% memory uplift over GB200 translates directly into tokens-per-dollar. When training and inference share one fleet at sustained high utilization. When you are placing a new rack-scale order today and there is no reason to buy the prior generation.
What it is. NVIDIA's next platform, named after astronomer Vera Rubin. The Rubin GPU pairs with a new ARM-based CPU called Vera.
What is inside. 288 GB HBM4 at ~13 TB/s of bandwidth. Roughly 2x the FP4 throughput of Blackwell. Rubin GPUs in Q3 2026, Rubin Ultra in H2 2027 with ~384 GB HBM4E and ~100 PFLOPS FP4.
Best for. Workloads where HBM4 bandwidth and a larger NVLink scale-up domain unlock what Blackwell Ultra cannot - frontier training and next-generation reasoning inference.
When you would reach for Vera Rubin. When HBM4 bandwidth, not FP4 compute, is the bottleneck on your serving stack. When the next generation of reasoning models pushes context and output lengths past what 288 GB per GPU can hold efficiently. When you train at a scale where Rubin Ultra's NVL576 domain replaces what would otherwise be multi-rack GB300 deployments stitched together over slower fabric. When performance per watt is the binding constraint on your buildout, not raw FLOPS.
Figures are rounded to communicate orders of magnitude. Specifications for unreleased silicon (Rubin and Rubin Ultra) reflect public roadmap as of Q1 2026 and remain subject to revision.
| Chip | Architecture / Year | Memory | Bandwidth | FP4 (dense) | TDP | Best For |
|---|---|---|---|---|---|---|
| H100 | Hopper, 2022 | 80 GB HBM3 | 3.35 TB/s | n/a (FP8: ~2 PF) | 700 W | Proven training and inference, lowest-cost path |
| H200 | Hopper, 2024 | 141 GB HBM3e | 4.8 TB/s | n/a (FP8: ~2 PF) | 700 W | LLM inference at scale |
| B200 | Blackwell, 2025 | 192 GB HBM3e | 8 TB/s | ~10 PF | 1,000 W | Frontier training, modern inference |
| B300 | Blackwell Ultra, 2025 to 2026 | 288 GB HBM3e | 8 TB/s | ~15 PF | 1,400 W | Long-context, reasoning, agentic AI |
| GB200 | Grace and Blackwell, 2025 | 2 x 192 GB | 2 x 8 TB/s | ~20 PF / superchip | ~2,700 W | Rack-scale training (NVL72) |
| GB300 | Grace and Ultra, 2026 | 2 x 288 GB | 2 x 8 TB/s | ~30 PF / superchip | ~2,800 to 3,300 W | Trillion-parameter training, sovereign clusters |
| Rubin | Rubin, 2026 | 288 GB HBM4 | ~13 TB/s | ~50 PF | ~1,800 W (est.) | Forward planning, reservation-stage |
| Rubin Ultra | Rubin, 2027 | 384 GB HBM4E | ~32 TB/s | ~100 PF | ~3,600 W (est.) | Next-gen agentic AI, frontier training |
Each buyer profile below uses the same structure. The default recommendation, the upgrade trigger, and the conditions under which you would reach for top-tier rack-scale or next-generation silicon.
Default: B200 for training, H200 for production inference.
Upgrade to B300 when you launch reasoning or agentic models, or when serving workloads exceed 128,000 token contexts.
Move to GB200 or GB300 NVL72 when you train foundation models from scratch and your training cycles consume thousands of continuous GPU-hours.
Reserve Rubin or Rubin Ultra when HBM4 bandwidth and the larger NVLink scale-up domain are what your next training run or serving stack actually needs, not when the calendar turns.
Default: H100 or H200 in carrier-neutral colocation. Sufficient for RAG, internal copilots, computer vision, fraud detection, demand forecasting, and recommendation systems.
Upgrade to B200 when memory pressure forces multi-GPU sharding for what should be single-GPU inference, or when consolidation saves meaningful power and rack space.
Move to B300 when reasoning latency starts to impact SLA commitments, or when you deploy agentic AI applications that generate long outputs.
Buy GB200 or GB300 racks when you train your own foundation models in-house and training cycles run continuously at rack-scale - the threshold most enterprises never cross.
Default: Rent on-demand H100 or H200 capacity from a GPUaaS provider. Yesterday's frontier hardware at sharp discounts, with no infrastructure overhead.
Upgrade to B200 when your product genuinely requires it. Examples include fine-tuning open-weight models above 70 billion parameters, real-time multimodal applications, or workloads where H200 latency is a documented bottleneck.
Reach for B300 when you launch a reasoning or agentic feature where end-user latency drives retention, and H200 or B200 cannot keep up.
Use GB200, GB300, or Rubin when a single training run genuinely needs an NVLink-connected rack as one accelerator.
Default today: GB300 NVL72 racks. The standard for national AI infrastructure, whether built by the U.S. government, allied nations, or sovereign programs in emerging markets.
Add B300 standalone systems when the use case is targeted - a national language model, a single agency or ministry's reasoning workload, a classified inference pilot - and you do not yet need rack-scale density.
Step up to GB200, GB300, Rubin, or Rubin Ultra NVL racks when sovereign workloads - national reasoning models, multilingual foundation models, classified inference, defense and intelligence applications - require rack-scale density. Match the generation to the workload: GB200 for general-purpose training, GB300 for reasoning at scale, Rubin and Rubin Ultra for the next horizon.
Expect to operate a mixed fleet: B300 standalone for targeted workloads, GB200 or GB300 NVL72 as the rack-scale default depending on whether reasoning is the primary workload, and Rubin or Rubin Ultra reserved for the next horizon of bandwidth and scale-up domain size.
Most organizations do not have a chip problem. They have a capacity problem. The right answer is rarely a single chip. A practical setup is a portfolio:
Once you know which chip fits your workload, the next problem is finding the actual capacity to rent. The right data center, the right contract length, the right cooling envelope, and the right supplier on the right terms. That is what XIRR Advisors does. We broker reserved GPU and colocation capacity for customers who would rather build their products than spend months negotiating with providers.
XIRR Advisors is a GPU and colocation capacity broker. We work with a network of established providers across the United States, Europe, and emerging markets to source reserved GPU and colocation capacity at competitive terms. If you are sizing a new deployment, comparing reserved-capacity offers, or planning a sovereign or enterprise AI buildout, we will save you weeks.
contact@xirradvisors.com
Stay ahead of the GPU and data center capacity markets. Every Monday morning, XIRR Advisors publishes a curated brief on supply, pricing, operator news, and the deals shaping AI infrastructure.
Long: $NVDA, $IREN, $CRWV, $NBIS, $CIFR, $NUAI, $APLD, $HUT, $MSFT, $AMZN. Not investment advice.