In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU.
My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942
Code and details:
https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun...https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal)
Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.
In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU. My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942 Code and details: https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal) Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.