Write your first GPU Dana in Azar with Nimba and Kada

Photo by Author | Ideogram

GPU is great for tasks where you need to do the same operation in different pieces of data. This is known by his name Single instructions, multiple data (SIM D) The point of view. Unlike the CPU, there are only a few powerful covers, there are thousands of small in the GPU that can work together. You will see a lot in this style in machine learning, for example when adding or multiplied by large vectors, because every calculation is free. This is the ideal scene of using GPUs to accelerate tasks with parallel.

Nvidia formed Cuda As a method for developers that writes GPU -run programs instead of CPUs. It is based on C and allows you to write special functions, called Dana, which can operate a lot of operations at the same time. The problem is that writing CUDA in C or C ++ is not at all friendly. You have to allocate manual memory, thread coordination, and how GPU works at a lower level. This is especially if you are accustomed to writing the code in Azar.

It is that place Ancestor Can help you This LLVM (LOL Level Virtual Machine) allows to write CUDA kernels using infrastructure using infrastructure so that your code can be compiled directly on the CUDA synchronized kernel. With only (JIT) compilation, you can interpret your functions with decoration, and the namas handle everything for you.

In this article, we will use a shared example of vector addiction, and convert simple CPU code into coda kernel with NUMBA. Adding vector is an ideal example of parallel, as it is free from other indicators in the same index. This is the perfect SID scenario, so all the indicators can be added simultaneously to complete the vector addition in an operation.

Note that you will need CUDA GPU to follow this article. You can use Of the koalab Free T4 GPU or NVIDIA Toll Kit and a local GPU installed with NVCC.

. Sorting the environment and installing namba

NUMBA is available as a picker package, and you can install it with PIP. More that we will use nUmpy For vector operations. Set up the atmosphere of the following orders using the following orders:

python3 -m venv venv
source venv/bin/activate
pip install numba-cuda numpy

. Increase vector in CPU

Let’s take an easy example of vector addiction. We WE for two given vector, we add values related to each index to get the final price. We will use the NIMP to produce random float 32 vectors and prepare the final output using A for the loop.

import numpy as np 

N = 10_000_000 # 10 million elements 
a = np.random.rand(N).astype(np.float32) 
b = np.random.rand(N).astype(np.float32) 
c = np.zeros_like(a) # Output array 

def vector_add_cpu(a, b, c): 
    """Add two vectors on CPU""" 
    for i in range(len(a)): 
        c(i) = a(i) + b(i)

An error of the code is:

Start two vectors with 10 million random floating point numbers
We also make a blank vector c To store the result
vector_add_cpu Function easily looses through each index and adds elements a And bTo store the result c

It’s a Serial Operation; Each increase is one after another. Although it works well, it is not the most effective approach, especially for large datases. Since every addition is free from others, it is a great candidate for parallel implementation of GPUs.

In the next section, you will see how to change the same operation to run on the GPU using Numba. By dividing thousands of GPU threads according to each element, we can significantly accelerate this work.

. In addition to vector on GPU with Numba

Now you will use NUMBA to describe an aggregate function that can run on CUDA, and will run it inside. We are operating the same vector addition, but now it can operate in parallel to every index of NAP array, which is implemented rapidly.

The Dana Writing Code is:

from numba import config

# Required for newer CUDA versions to enable linking tools. 
# Prevents CUDA toolkit and NVCC version mismatches.
config.CUDA_ENABLE_PYNVJITLINK = 1

from numba import cuda, float32

@cuda.jit
def vector_add_gpu(a, b, c):
	"""Add two vectors using CUDA kernel"""
	# Thread ID in the current block
	tx = cuda.threadIdx.x
	# Block ID in the grid
	bx = cuda.blockIdx.x
	# Block width (number of threads per block)
	bw = cuda.blockDim.x

	# Calculate the unique thread position
	position = tx + bx * bw

	# Make sure we don't go out of bounds
	if position < len(a):
    	    c(position) = a(position) + b(position)

def gpu_add(a, b, c):
	# Define the grid and block dimensions
	threads_per_block = 256
	blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

	# Copy data to the device
	d_a = cuda.to_device(a)
	d_b = cuda.to_device(b)
	d_c = cuda.to_device(c)

	# Launch the kernel
	vector_add_gpu(blocks_per_grid, threads_per_block)(d_a, d_b, d_c)

	# Copy the result back to the host
	d_c.copy_to_host(c)

def time_gpu():
	c_gpu = np.zeros_like(a)
	gpu_add(a, b, c_gpu)
	return c_gpu

Let’s break what is happening.

!! Understanding the GPU Function

@cuda.jit The decorator tells Namiba to behave the following function like a coda. A special function that will run parallel in many threads of GPUs. At the time of the run time, the NUMBA will set this function for the CUDA code and handle C-Api Transplation for you.

@cuda.jit
def vector_add_gpu(a, b, c):
	...

This function will go on thousands of threads at the same time. But we need a way to find out what data should work on each thread. This is what the next few lines do:

tx It has the thread ID inside its block
bx The block inside the grid is the identification of
bw How many threads are in a block

We add them to Calculate a unique positionWhich tells every thread which element should be included. Note that threads and blocks may not always provide a valid index, as they work in 2 options. This can lead to wrong indicators when the length of the vector does not match the basic architecture. Therefore, before we perform the vector addition, we add the guard’s condition to verify the index. It prevents the run -time mistake outside any limit while accessing the row.

Once we know the unique location, we can now add the values as we did for the implementation of the CPU. The following line will be similar to the implementation of CPU:

c(position) = a(position) + b(position)

!! Keen to launch

gpu_add Function determines things:

It explains how many threads and blocks use. You can experiment with different values of block and thread size, and print relevant values in GPU kernel. This can help you understand how GPU indexing works.
It copies the input rows (aFor, for, for,. bAnd c) From CPU memory to GPU memory, so Vector GPUs are accessible in Ram.
It runs with GPU kernel vector_add_gpu(blocks_per_grid, threads_per_block).
Finally, this result copies back from GPU c Soldier, so we can access CPU values.

. To compare the implementation and potential speed

Now that we have both CPUs and GPU versions of the vector addition, it is time to compare how to compare. In order to promote the results and implementation, we can get CUDA with harmony.

import timeit

c_cpu = time_cpu()
c_gpu = time_gpu()

print("Results match:", np.allclose(c_cpu, c_gpu))

cpu_time = timeit.timeit("time_cpu()", globals=globals(), number=3) / 3
print(f"CPU implementation: {cpu_time:.6f} seconds")

gpu_time = timeit.timeit("time_gpu()", globals=globals(), number=3) / 3
print(f"GPU implementation: {gpu_time:.6f} seconds")

speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")

First, we both implement and check whether their results are available. It is important to ensure that our GPU code is operating properly and the output should be the same as CPU.

Next, we use a built -in timeit Module to measure how long each version takes. We run each function a few times and take an average of getting reliable time. Finally, we calculate how fast the GPU version is compared to the CPU. You should see a huge difference because GPU can do many actions at the same time, while the CPU handles them in a loop at a time.

Here is the expected production on Nvidia’s T4 GPU on Koalab. Note that precise speedups based on the CUDA version and basic hardware may vary.

Results match: True
CPU implementation: 4.033822 seconds
GPU implementation: 0.047736 seconds
GPU speedup: 84.50x

This easy test helps to show the strength of GPU acceleration and why it is so useful for data and large amounts of parallel work.

. Wrap

And that is. Now you have written your first CUDA Dana with Numba, in fact without writing a C or CUDA code. Numba allows a simple interface to use GPU via Python, and it makes it very easy for engineers to start with CUDA programming.

Now you can use the same template to write Advanced CUDA algorithms, which are currently in learning machine learning and deep learning. If you have a problem after the SIMD paradigm, it is always a good idea to use GPUs to improve implementation.

The full code is available on the Kolb notebook you can access Here. Make easy changes to check this and how to get a better understanding of how the Koda indexing and implementation works internally.

Kanwal seals A machine is a learning engineer and is a technical author that has a deep passion for data science and has AI intersection with medicine. He authored EBook with “Maximum Production Capacity with Chat GPT”. As a Google Generation Scholar 2022 for the APAC, the Champions Diversity and the Educational Virtue. He is also recognized as a tech scholar, Mitacs Global Research Scholar, and a Taradata diversity in the Harvard Wacked Scholar. Kanwal is a passionate lawyer for change, who has laid the foundation of a Fame Code to empower women in stem fields.

. Sorting the environment and installing namba

. Increase vector in CPU

. In addition to vector on GPU with Numba

!! Understanding the GPU Function

!! Keen to launch

. To compare the implementation and potential speed

. Wrap

Editor's pick

Get latest news

Write your first GPU Dana in Azar with Nimba and Kada

. Sorting the environment and installing namba

. Increase vector in CPU

. In addition to vector on GPU with Numba

!! Understanding the GPU Function

!! Keen to launch

. To compare the implementation and potential speed

. Wrap

What is hugging a job? The latest trend of career stagnation.

A memory cartographer. 2025.08.18 | By Yu-Chuan Tseg | August, 2025

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news