Next-Gen Broadcom PCIe Switches to Support AMD Infinity Fabric XGMI

jauntywundrkind · on Dec 11, 2023

Super interesting possibilities here.

I'm curious how much of a role Infinity Fabric has to play though, as it seems like CXL has a number of the same upsides. For example:

> We also heard another nugget at the show. Imagine not just the CPU and GPU communication happening over XGMI/ Infinity Fabric. Instead, imagine a world with XGMI-connected NICs where the NIC was on the same coherent fabric. If AMD intends to extend its AI training clusters via Ultra Ethernet, then instead of having to go CPU/ GPU to PCIe with RDMA transfers over the PCIe Ethernet NIC, imagine having the XGMI NIC sitting on the Broadcom XGMI switch with the CPU and multiple GPUs. That is a fascinating concept.

CXL 3.1 was released two weeks ago, and it has support for p2p DMA, where one device transfers directly to another device. That would enable that same host-less transfer from NIC to GPU as hypothesized here. https://news.ycombinator.com/item?id=38497690 https://www.servethehome.com/cxl-3-1-specification-aims-for-...

CXL 3.1 also specified host-to-host connections. I don't know if it has anywhere near the coherency semantics Infinity Fabric has, but this is another example of the open standard having lots of overlapping capabilities with Infinity Fabric, and shows a general convergence in capabilities of these fabrics.

bastard_op · on Dec 11, 2023

I've been waiting to see a response to Nvidia's NVLink switching in the hubbub from AMD like this, as NVLink is what makes them viable across large clusters.

I've still not seen what a physical NVLink nic or switch actually consists of either in production beyond flashy marketing stating a 400/800G phy, I assume it's just a proprietary optimized Infiniband, but it would be nice to get some more info from some industry insiders building these "supercomputer" clusters, as it'll be a while before they start hitting ebay off-lease I think.

namibj · on Dec 22, 2023

The older generations had substantial docs; they were iirc basically MPLS on their L2. With a bunch of e.g. remote atomic verbs that even worked through to POWER9 hosts, great for throughout on fetch_and_add heavy (almost) wait-free data structures. Such as even just bump-allocating yourself a remote destination buffer to deliver your message to (in LLM context, this could e.g. be single-token granular output delivery from an MoE router to which GPU hosts that expert. The router spits out a ragged Tensor, and those parts can be delivered efficiently.).