Learn CUDA programming for NVIDIA Hopper GPUs.
We have just posted a course on freeCodeCamp.org YouTube channel that will teach you how to build efficient WGMMA pipelines and leverage Cutlass optimization to perform massive matrix multiplications that power modern AI.
In addition to single-chip performance, the curriculum includes multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a basic grasp of C++ syntax and linear algebra, especially how matrices are tiled and multiplied.
Introduction to the course
Table of Contents and Course Overview
Lesson 1 — H100 Hopper GPU Architecture
H100 Specifications: HBM3, Bandwidth and Power
Overview of tensor cores
Tensor Memory Accelerator (TMA)
Transformer engine
L2 cache architecture
GPCs, TPCs and SM layouts
Thread block clusters
Distributed shared memory
SM Subpartitions (SMSPs)
Warp Schedulers and Dispatch Units
Shared memory and data movement
Possession
Lesson 2 — Clusters, Data Types, Inline PTX and Pointers
Thread block clusters programming
Setting cluster dimensions
Inline PTX assembly
State places.
Data types in PTX
General indications
Address space conversion
Lesson 3 – Asynchrony and Constraints
Introduction to Async Operations
Proxies
Fences and memory ordering
Fence layout and visibility
Scope of the fence
Get the fence and leave.
Expected number and arrival of thread
M Barrier Arrival Operations
M Barrier PTX Instructions
Barrier Weight Operations
Phase and Parity.
Commit operations.
Lesson 4 — CuTensorMap Descriptors
Tensor shape, stride and data type
Velocity and amplitude of the element
Box Dimensions (Tile Size)
Bank disputes
swizzling
Swizzle Formula Deep Dive
Interleave layout
Out of bounds (OOB) fill
Lesson 5 — cp.async.bulk (Async Bulk Copies via TMA)
Bulk Tensor Operations (1D–5D)
Multicast operations
Recover in advance.
Lesson 6 — WGMMA Part 1 (Warp Group Matrix Multiply Sum)
Warp groups and matrix multiplication
WGMMA descriptors
Reuse accumulators and registers.
Scale Factors (Scale D, Scale A, Scale B)
Core matrix and 16×16 tiles.
Lesson 7 — WGMMA Part 2
Commit groups and wait groups
WGMMA with FP8 data types
Lesson 8 – Kernel Design
Compute Bound vs. Memory Bound Kernels
Warp Specialization
Cooperative v. Ping Pong Pipelines
Fundamentals of Pipelining
Circular buffering
Ping Pong Pipeline Deep Dive
Epilog handling in pipelines
Constant scheduling
Split-K and Stream-K strategies
Data Parallel Tile Scheduling
Epilogue Fusion (Interference, Activation, Scaling)
Overview of epilogue operations
Cutlass Source Code Walkthrough
Main loop and scheduling policies
Dispatch Policy
SM90 Tile Scheduler
SM90 Epilogue (TMA Warp Specialized)
SM90 Builder
Community builder
FAST.CU Kernel Walkthrough
Implementation of the main loop
Producer Warp Group (Dependency Wall)
Consumer Warp Group
Prologue
Multi-GPU Programming – Part 1
NVSwitch
Topology and System Architecture
NVSwitch, BlueField DPUs and Storage Fabrics
CUDA peer-to-peer communication
MPI (Message Passing Interface)
P2P limitations and trade-offs
Multi-GPU Programming – Part 2
SLURM resource allocation
PMIx Process Management
NCCL (NVIDIA Collective Communications Library)
NCCL internal and ring algorithms
AllReduce operations
NCCL Collective: Broadcast, All Gather, Redo Scatter
Parallel Strategies: Data, Tensor, Pipeline and Expert Parallelism
End of Course and Next Steps