CUDA Programming for NVIDIA H100s

by SkillAiNest

Learn CUDA programming for NVIDIA Hopper GPUs.

We have just posted a course on freeCodeCamp.org YouTube channel that will teach you how to build efficient WGMMA pipelines and leverage Cutlass optimization to perform massive matrix multiplications that power modern AI.

In addition to single-chip performance, the curriculum includes multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a basic grasp of C++ syntax and linear algebra, especially how matrices are tiled and multiplied.

  • Introduction to the course

  • Table of Contents and Course Overview

  • Lesson 1 — H100 Hopper GPU Architecture

  • H100 Specifications: HBM3, Bandwidth and Power

  • Overview of tensor cores

  • Tensor Memory Accelerator (TMA)

  • Transformer engine

  • L2 cache architecture

  • GPCs, TPCs and SM layouts

  • Thread block clusters

  • Distributed shared memory

  • SM Subpartitions (SMSPs)

  • Warp Schedulers and Dispatch Units

  • Shared memory and data movement

  • Possession

  • Lesson 2 — Clusters, Data Types, Inline PTX and Pointers

  • Thread block clusters programming

  • Setting cluster dimensions

  • Inline PTX assembly

  • State places.

  • Data types in PTX

  • General indications

  • Address space conversion

  • Lesson 3 – Asynchrony and Constraints

  • Introduction to Async Operations

  • Proxies

  • Fences and memory ordering

  • Fence layout and visibility

  • Scope of the fence

  • Get the fence and leave.

  • Expected number and arrival of thread

  • M Barrier Arrival Operations

  • M Barrier PTX Instructions

  • Barrier Weight Operations

  • Phase and Parity.

  • Commit operations.

  • Lesson 4 — CuTensorMap Descriptors

  • Tensor shape, stride and data type

  • Velocity and amplitude of the element

  • Box Dimensions (Tile Size)

  • Bank disputes

  • swizzling

  • Swizzle Formula Deep Dive

  • Interleave layout

  • Out of bounds (OOB) fill

  • Lesson 5 — cp.async.bulk (Async Bulk Copies via TMA)

  • Bulk Tensor Operations (1D–5D)

  • Multicast operations

  • Recover in advance.

  • Lesson 6 — WGMMA Part 1 (Warp Group Matrix Multiply Sum)

  • Warp groups and matrix multiplication

  • WGMMA descriptors

  • Reuse accumulators and registers.

  • Scale Factors (Scale D, Scale A, Scale B)

  • Core matrix and 16×16 tiles.

  • Lesson 7 — WGMMA Part 2

  • Commit groups and wait groups

  • WGMMA with FP8 data types

  • Lesson 8 – Kernel Design

  • Compute Bound vs. Memory Bound Kernels

  • Warp Specialization

  • Cooperative v. Ping Pong Pipelines

  • Fundamentals of Pipelining

  • Circular buffering

  • Ping Pong Pipeline Deep Dive

  • Epilog handling in pipelines

  • Constant scheduling

  • Split-K and Stream-K strategies

  • Data Parallel Tile Scheduling

  • Epilogue Fusion (Interference, Activation, Scaling)

  • Overview of epilogue operations

  • Cutlass Source Code Walkthrough

  • Main loop and scheduling policies

  • Dispatch Policy

  • SM90 Tile Scheduler

  • SM90 Epilogue (TMA Warp Specialized)

  • SM90 Builder

  • Community builder

  • FAST.CU Kernel Walkthrough

  • Implementation of the main loop

  • Producer Warp Group (Dependency Wall)

  • Consumer Warp Group

  • Prologue

  • Multi-GPU Programming – Part 1

  • NVSwitch

  • Topology and System Architecture

  • NVSwitch, BlueField DPUs and Storage Fabrics

  • CUDA peer-to-peer communication

  • MPI (Message Passing Interface)

  • P2P limitations and trade-offs

  • Multi-GPU Programming – Part 2

  • SLURM resource allocation

  • PMIx Process Management

  • NCCL (NVIDIA Collective Communications Library)

  • NCCL internal and ring algorithms

  • AllReduce operations

  • NCCL Collective: Broadcast, All Gather, Redo Scatter

  • Parallel Strategies: Data, Tensor, Pipeline and Expert Parallelism

  • End of Course and Next Steps

  • You may also like

    Leave a Comment

    At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

    Get latest news

    Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

    @2025 Skillainest.Designed and Developed by Pro