Performance visualization

Paul Chan

is a software engineer, researcher, and writer.

Hi, I'm Paul. I like high-performance computing and machine learning, mostly the interplay between how machine learning and hardware architectures influence each other.

Outperforming cuBLAS on Blackwell

NVIDIA's cuBLAS library has long been considered the gold standard for GPU matrix operations, representing decades of optimization work. With the release of the Blackwell architecture, new opportunities have emerged to push beyond cuBLAS performance through careful exploitation of architectural features and novel algorithmic approaches.

The Demise of CUDA has been Greatly Exaggerated

Endless twitter threads, articles, and podcasts frequently declare the end of CUDA and NVIDIA’s dominance. The arguments typically hinge on three main claims: the rise of ASICs will render GPUs obsolete, a new software ecosystem will erode the CUDA MOAT, and that LLM based agents will make knowledge of CUDA and low-level implementations irrelevant. Yet, closer examination reveals that these predictions fail to capture the nuance and ongoing innovation within NVIDIA’s ecosystem.