Why AMD's MI355X is outpacing NVIDIA's B200 in real world multi GPU workloads

ighoshsubho · October 25, 2025, 11:39am

Just came across some fascinating FluidX3D CFD benchmark results that really highlight an unexpected advantage AMD has carved out in the datacenter GPU space. In single GPU tests, the AMD Instinct MI355X and NVIDIA B200 perform nearly identically when bandwidth bound, which makes sense given both are powerhouse accelerators.

But here’s where it gets interesting, scale up to 8 GPUs and AMD suddenly pulls ahead by 65%, hitting 362k MLUPs/s versus NVIDIA’s 219k MLUPs/s. The culprit might be PCIe bandwidth limitations and software exposure issues.

From what I observe from this is -

MI355X achieves 55GB/s PCIe throughput while B200 struggles at just 14GB/s in these tests
AMD provides 288GB VRAM per GPU vs NVIDIA’s 180GB
NVIDIA doesn’t expose NVLink peer-to-peer copy functionality to OpenCL, forcing workloads over slower PCIe paths
AMD’s InfinityFabric OpenCL extensions exist but reportedly have stability issues

Here are single gpu and multi gpu results.

So, are we seeing AMD gain ground because they’re actually treating OpenCL as a first-class citizen OR ss NVIDIA’s CUDA-centric strategy leaving performance on the table for heterogeneous compute environments? And given that many industrial CFD applications still rely on OpenCL rather than proprietary APIs, could this shift market dynamics in the HPC space?

Curious to hear thoughts on whether this represents a genuine strategic opening for AMD in compute!

AMD Instinct MI355X


|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD Instinct MI355X                                        |
| Device Vendor  | Advanced Micro Devices, Inc.                               |
| Device Driver  | 3662.0 (HSA1.1,LC) (Linux)                                 |
| OpenCL Version | OpenCL C 2.0                                               |
| Compute Units  | 256 at 2400 MHz (16384 cores, 78.643 TFLOPs/s)             |
| Memory, Cache  | 294896 MB VRAM, 32 KB global / 160 KB local                |
| Buffer Limits  | 294896 MB global, 301973504 KB constant                    |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                        62.858 TFLOPs/s (2/3 ) |
| FP32  compute                                       138.172 TFLOPs/s ( 2x ) |
| FP16  compute                                       143.453 TFLOPs/s ( 2x ) |
| INT64 compute                                         7.078  TIOPs/s (1/12) |
| INT32 compute                                        38.309  TIOPs/s (1/2 ) |
| INT16 compute                                        89.761  TIOPs/s ( 1x ) |
| INT8  compute                                       129.780  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       4903.01 GB/s |
| Memory Bandwidth ( coalesced      write)                       5438.98 GB/s |
| Memory Bandwidth (misaligned read      )                       5473.35 GB/s |
| Memory Bandwidth (misaligned      write)                       3449.07 GB/s |
| PCIe   Bandwidth (send                 )                         55.16 GB/s |
| PCIe   Bandwidth (   receive           )                         54.76 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   55.00 GB/s |
|-----------------------------------------------------------------------------|

Nvidia B200

|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA B200                                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 570.133.20 (Linux)                                         |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 148 at 1965 MHz (18944 cores, 74.450 TFLOPs/s)             |
| Memory, Cache  | 182642 MB VRAM, 4736 KB global / 48 KB local               |
| Buffer Limits  | 45660 MB global, 64 KB constant                            |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                        34.292 TFLOPs/s (1/2 ) |
| FP32  compute                                        69.464 TFLOPs/s ( 1x ) |
| FP16  compute                                        72.909 TFLOPs/s ( 1x ) |
| INT64 compute                                         3.704  TIOPs/s (1/24) |
| INT32 compute                                        36.508  TIOPs/s (1/2 ) |
| INT16 compute                                        33.597  TIOPs/s (1/2 ) |
| INT8  compute                                       117.962  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       6668.71 GB/s |
| Memory Bandwidth ( coalesced      write)                       6502.72 GB/s |
| Memory Bandwidth (misaligned read      )                       2280.05 GB/s |
| Memory Bandwidth (misaligned      write)                        937.78 GB/s |
| PCIe   Bandwidth (send                 )                         14.08 GB/s |
| PCIe   Bandwidth (   receive           )                         13.82 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   11.39 GB/s |
|-----------------------------------------------------------------------------|

Topic		Replies	Views
Can Kernel Compiler like Luminal + KernelBench v3 enable LLM-Driven SOTA Kernel Engineering? Systems benchmarking , llm-research	0	90	January 17, 2026
Doing a new 5090 build what are the odds we get to contribute in some meaningful way to compute? General help	4	247	October 17, 2025
About the Systems category Systems	0	8	February 5, 2026
About the Inference category Inference	0	4	February 5, 2026
Mining Pool Contributions Compute Contributions training	6	250	May 26, 2025
About the Compute Contributions category Compute Contributions	0	19	February 5, 2026
Psyche - Future Directions Updates & Announcements	0	105	October 9, 2025
About the Training Runs category Training Runs	0	11	February 5, 2026

Why AMD's MI355X is outpacing NVIDIA's B200 in real world multi GPU workloads

AMD Instinct MI355X

Nvidia B200

Related topics