CUDA Programming for NVIDIA H100s

Learn CUDA programming for NVIDIA Hopper GPUs.

We have just posted a course on freeCodeCamp.org YouTube channel that will teach you how to build efficient WGMMA pipelines and leverage Cutlass optimization to perform massive matrix multiplications that power modern AI.

In addition to single-chip performance, the curriculum includes multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a basic grasp of C++ syntax and linear algebra, especially how matrices are tiled and multiplied.

Introduction to the course

Table of Contents and Course Overview

Lesson 1 — H100 Hopper GPU Architecture

H100 Specifications: HBM3, Bandwidth and Power

Overview of tensor cores

Tensor Memory Accelerator (TMA)

Transformer engine

L2 cache architecture

GPCs, TPCs and SM layouts

Thread block clusters

Distributed shared memory

SM Subpartitions (SMSPs)

Warp Schedulers and Dispatch Units

Shared memory and data movement

Possession

Lesson 2 — Clusters, Data Types, Inline PTX and Pointers

Thread block clusters programming

Setting cluster dimensions

Inline PTX assembly

State places.

Data types in PTX

General indications

Address space conversion

Lesson 3 – Asynchrony and Constraints

Introduction to Async Operations

Proxies

Fences and memory ordering

Fence layout and visibility

Scope of the fence

Get the fence and leave.

Expected number and arrival of thread

M Barrier Arrival Operations

M Barrier PTX Instructions

Barrier Weight Operations

Phase and Parity.

Commit operations.

Lesson 4 — CuTensorMap Descriptors

Tensor shape, stride and data type

Velocity and amplitude of the element

Box Dimensions (Tile Size)

Bank disputes

swizzling

Swizzle Formula Deep Dive

Interleave layout

Out of bounds (OOB) fill

Lesson 5 — cp.async.bulk (Async Bulk Copies via TMA)

Bulk Tensor Operations (1D–5D)

Multicast operations

Recover in advance.

Lesson 6 — WGMMA Part 1 (Warp Group Matrix Multiply Sum)

Warp groups and matrix multiplication

WGMMA descriptors

Reuse accumulators and registers.

Scale Factors (Scale D, Scale A, Scale B)

Core matrix and 16×16 tiles.

Lesson 7 — WGMMA Part 2

Commit groups and wait groups

WGMMA with FP8 data types

Lesson 8 – Kernel Design

Compute Bound vs. Memory Bound Kernels

Warp Specialization

Cooperative v. Ping Pong Pipelines

Fundamentals of Pipelining

Circular buffering

Ping Pong Pipeline Deep Dive

Epilog handling in pipelines

Constant scheduling

Split-K and Stream-K strategies

Data Parallel Tile Scheduling

Epilogue Fusion (Interference, Activation, Scaling)

Overview of epilogue operations

Cutlass Source Code Walkthrough

Main loop and scheduling policies

Dispatch Policy

SM90 Tile Scheduler

SM90 Epilogue (TMA Warp Specialized)

SM90 Builder

Community builder

FAST.CU Kernel Walkthrough

Implementation of the main loop

Producer Warp Group (Dependency Wall)

Consumer Warp Group

Prologue

Multi-GPU Programming – Part 1

NVSwitch

Topology and System Architecture

NVSwitch, BlueField DPUs and Storage Fabrics

CUDA peer-to-peer communication

MPI (Message Passing Interface)

P2P limitations and trade-offs

Multi-GPU Programming – Part 2

SLURM resource allocation

PMIx Process Management

NCCL (NVIDIA Collective Communications Library)

NCCL internal and ring algorithms

AllReduce operations

NCCL Collective: Broadcast, All Gather, Redo Scatter

Parallel Strategies: Data, Tensor, Pipeline and Expert Parallelism

End of Course and Next Steps

Editor's pick

Get latest news

CUDA Programming for NVIDIA H100s

Offsite | Create teams of humans and agents, watch them work.

Awshar AI: Understands India’s Critical Digital Conversations

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news