Why AMD's MI355X is outpacing NVIDIA's B200 in real world multi GPU workloads

Just came across some fascinating FluidX3D CFD benchmark results that really highlight an unexpected advantage AMD has carved out in the datacenter GPU space. In single GPU tests, the AMD Instinct MI355X and NVIDIA B200 perform nearly identically when bandwidth bound, which makes sense given both are powerhouse accelerators.

But here’s where it gets interesting, scale up to 8 GPUs and AMD suddenly pulls ahead by 65%, hitting 362k MLUPs/s versus NVIDIA’s 219k MLUPs/s. The culprit might be PCIe bandwidth limitations and software exposure issues.

From what I observe from this is -

  • MI355X achieves 55GB/s PCIe throughput while B200 struggles at just 14GB/s in these tests
  • AMD provides 288GB VRAM per GPU vs NVIDIA’s 180GB
  • NVIDIA doesn’t expose NVLink peer-to-peer copy functionality to OpenCL, forcing workloads over slower PCIe paths
  • AMD’s InfinityFabric OpenCL extensions exist but reportedly have stability issues

Here are single gpu and multi gpu results.

So, are we seeing AMD gain ground because they’re actually treating OpenCL as a first-class citizen OR ss NVIDIA’s CUDA-centric strategy leaving performance on the table for heterogeneous compute environments? And given that many industrial CFD applications still rely on OpenCL rather than proprietary APIs, could this shift market dynamics in the HPC space?

Curious to hear thoughts on whether this represents a genuine strategic opening for AMD in compute!

AMD Instinct MI355X


|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD Instinct MI355X                                        |
| Device Vendor  | Advanced Micro Devices, Inc.                               |
| Device Driver  | 3662.0 (HSA1.1,LC) (Linux)                                 |
| OpenCL Version | OpenCL C 2.0                                               |
| Compute Units  | 256 at 2400 MHz (16384 cores, 78.643 TFLOPs/s)             |
| Memory, Cache  | 294896 MB VRAM, 32 KB global / 160 KB local                |
| Buffer Limits  | 294896 MB global, 301973504 KB constant                    |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                        62.858 TFLOPs/s (2/3 ) |
| FP32  compute                                       138.172 TFLOPs/s ( 2x ) |
| FP16  compute                                       143.453 TFLOPs/s ( 2x ) |
| INT64 compute                                         7.078  TIOPs/s (1/12) |
| INT32 compute                                        38.309  TIOPs/s (1/2 ) |
| INT16 compute                                        89.761  TIOPs/s ( 1x ) |
| INT8  compute                                       129.780  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       4903.01 GB/s |
| Memory Bandwidth ( coalesced      write)                       5438.98 GB/s |
| Memory Bandwidth (misaligned read      )                       5473.35 GB/s |
| Memory Bandwidth (misaligned      write)                       3449.07 GB/s |
| PCIe   Bandwidth (send                 )                         55.16 GB/s |
| PCIe   Bandwidth (   receive           )                         54.76 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   55.00 GB/s |
|-----------------------------------------------------------------------------|

Nvidia B200

|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA B200                                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 570.133.20 (Linux)                                         |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 148 at 1965 MHz (18944 cores, 74.450 TFLOPs/s)             |
| Memory, Cache  | 182642 MB VRAM, 4736 KB global / 48 KB local               |
| Buffer Limits  | 45660 MB global, 64 KB constant                            |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                        34.292 TFLOPs/s (1/2 ) |
| FP32  compute                                        69.464 TFLOPs/s ( 1x ) |
| FP16  compute                                        72.909 TFLOPs/s ( 1x ) |
| INT64 compute                                         3.704  TIOPs/s (1/24) |
| INT32 compute                                        36.508  TIOPs/s (1/2 ) |
| INT16 compute                                        33.597  TIOPs/s (1/2 ) |
| INT8  compute                                       117.962  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       6668.71 GB/s |
| Memory Bandwidth ( coalesced      write)                       6502.72 GB/s |
| Memory Bandwidth (misaligned read      )                       2280.05 GB/s |
| Memory Bandwidth (misaligned      write)                        937.78 GB/s |
| PCIe   Bandwidth (send                 )                         14.08 GB/s |
| PCIe   Bandwidth (   receive           )                         13.82 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   11.39 GB/s |
|-----------------------------------------------------------------------------|