Paul Chan

is a writer, engineer, and researcher.

Hi, I'm Paul. I like high-performance computing and machine learning, mostly the interplay between how machine learning and hardware architectures influence each other.

Outperforming cuBLAS on B200

Leveraging Blackwell Features for SOTA General Matrix Multiplication Performance

NVIDIA's True Moat in the Compute Wars: TSMC

How NVIDIA Leverages TSMC allocation to Choke Out Rivals

The New Frontier of GPU Performance: From Memory Bound to Communication Bound

For decades, GPU performance optimization has been dominated by the memory wall problem. As we scale to multi-GPU and multi-node systems, a fundamental shift is occurring: the bottleneck is moving from memory bandwidth to inter-GPU communication.

The Demise of CUDA has been Greatly Exaggerated

Notes on the ongoing innovation within NVIDIA’s ecosystem.