Magistral VLM Server Deployment on Moodle

Picture by the writer

For the first time, I was introduced to Moodle, taking part in the hugs, and I was really surprised at how easy it was to use. The platform allows you to make and deploy applications within minutes, offering smooth experience like Bentoclaid. Through Moodle, you can create your own app, including system requirements such as GPUs, Dokar Images, and a system, and then deploy it into the cloud with a single command.

In this tutorial, we will learn how to establish a modal, create a VLM server and to deploy it safely in the cloud. We will also cover how to test your VLM server using both Curl and Open ISD.

1.

Moodle A server is a lace platform that allows you to run any code from afar. With just one line, you can connect GPUs, present your functions as web closing locations, and deploy permanent schedule jobs. This is an ideal platform for early, data scientists, and non -software engineering professionals who want to avoid cloud infrastructure.

First, install the Moodle Uzar client. This tool allows you to create images from your terminal, deploy applications, and manage direct cloud resources.

Next, set the modal on your local machine. Run the following command to guide the account and guide the device by verification:

By ordering a VLLM_API_KEY Environmental variable VLM provides a secure closing point, so that only people with accurate API key can access the server. You can compose verification by adding environmental variables using a modal secret.

Change your_actual_api_key_here With your preferred API key.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API key is kept and accessible only by your deployment applications.

2. Creating a VLM application using Moodle

This section guides you by creating a Custom Doker image, permanent storage, and GPU Acceleration by creating a VLM Enborn Server on Model. We will use mistralai/Magistral-Small-2506 The model, which requires a specific setting for the tools and tool call parsing.

A Make vllm_inference.py File and add the following code:

Define the VLM image based on Debine Slim, with all the desired packages. We will also compile environmental variables to improve the model download and enhance the performance.
Create two modal volumes to avoid repeated downloads and accelerate the start of the cold. To hug a facial models and to a VLM cache.
Explain the models and models to ensure reproductive capacity. Enable VLLM V1 engine for better performance.
Set up a modal app by explaining GPU resources, scaling, timeout, storage, and secrets. Limit the request for harmony according to each duplicate of stability.
Create a web server and use a subdivision of the Sub -process library to implement the command to operate the VLM server.

import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub(hf_transfer)==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # faster model transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.function(
    image=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How long should we stay up with no requests?
    timeout=10 * MINUTES,  # How long should we wait for the container to start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets=(modal.Secret.from_name("vllm-api")),
)
@modal.concurrent(  # How many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = (
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    )

    cmd += ("--enforce-eager" if FAST_BOOT else "--no-enforce-eager")
    print(cmd)
    subprocess.Popen(" ".join(cmd), shell=True)

3. The vilm server deployed on the moodle

Now u vllm_inference.py The file is ready, you can deploy your VLM server in Moodle with the same command:

modal deploy vllm_inference.py

Within seconds, Moodle will create your container image (if it is not already made) and deploy your request. You will see output like the following:

✓ Created objects.
├── 🔨 Created mount C:\Repository\GitHub\Deploying-the-Magistral-with-Modal\vllm_inference.py
└── 🔨 Created web function serve => 
✓ App deployed in 6.671s! 🎉

View Deployment:

After deployment, the server will start downloading the weight of the model and loading them on the GPUs. This process can take several minutes (usually about 5 5 minutes of large models), so please be patient while the model begins.

You can see your deployment and surveillance logs in the apps section of your modal dashboard.

Once the logs indicate that the server is running and ready, you can automatically find the API documents created Here.

This interactive document provides details about all the available points and allows you to check them directly from your browser.

To confirm that your model is full and accessible, run the following curl command in your terminal.

Change With your original API Key Confined for VLLM Server:

curl -X 'GET' \
  ' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer '

This confirms that mistralai/Magistral-Small-2506 The model is available and ready to guess.

{"object":"list","data":({"id":"mistralai/Magistral-Small-2506","object":"model","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","parent":null,"max_model_len":40960,"permission":({"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false})})}

4. Using a VLM server with Openi SD

You can communicate with your VLM server exactly as if you were Openi’s API, thanks to the end of the VLM’s Open AI, thanks to the closing points. This is the way to securely connect and test your deployment using the Openi Azigar SD.

A .env File in your Project Directory and add your VLM API key:

VLLM_API_KEY=your-actual-api-key-here

Install python-dotenv And openai Packages:

pip install python-dotenv openai

Create a file that has a name client.py To test various VLM server functionalities, including simple chat and streaming response.

import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load environment variables from .env file
load_dotenv()

# Get API key from environment
api_key = os.getenv("VLLM_API_KEY")

# Set up the OpenAI client with custom base URL
client = OpenAI(
    api_key=api_key,
    base_url="
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Simple Completion ---
def run_simple_completion():
    print("\n" + "=" * 40)
    print("(1) SIMPLE COMPLETION DEMO")
    print("=" * 40)
    try:
        messages = (
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        )
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("\nResponse:\n    " + response.choices(0).message.content.strip())
    except Exception as e:
        print(f"(ERROR) Simple completion failed: {e}")
    print("\n" + "=" * 40 + "\n")

# --- 2. Streaming Example ---
def run_streaming():
    print("\n" + "=" * 40)
    print("(2) STREAMING DEMO")
    print("=" * 40)
    try:
        messages = (
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short poem about AI."},
        )
        stream = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("\nStreaming response:")
        print("    ", end="")
        for chunk in stream:
            content = chunk.choices(0).delta.content
            if content:
                print(content, end="", flush=True)
        print("\n(END OF STREAM)")
    except Exception as e:
        print(f"(ERROR) Streaming demo failed: {e}")
    print("\n" + "=" * 40 + "\n")

# --- 3. Async Streaming Example ---
async def run_async_streaming():
    print("\n" + "=" * 40)
    print("(3) ASYNC STREAMING DEMO")
    print("=" * 40)
    try:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="
        )
        messages = (
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a fun fact about space."},
        )
        stream = await async_client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("\nAsync streaming response:")
        print("    ", end="")
        async for chunk in stream:
            content = chunk.choices(0).delta.content
            if content:
                print(content, end="", flush=True)
        print("\n(END OF ASYNC STREAM)")
    except Exception as e:
        print(f"(ERROR) Async streaming demo failed: {e}")
    print("\n" + "=" * 40 + "\n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

Everything is running easily, and the generation of the reaction is sharp and the delay is quite low.

========================================
(1) SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there anything else you'd like to know about France?

========================================


========================================
(2) STREAMING DEMO
========================================

Streaming response:
    In Silicon dreams, I'm born, I learn,
From data streams and human works.
I grow, I calculate, I see,
The patterns that the humans leave.

I write, I speak, I code, I play,
With logic sharp, and snappy pace.
Yet for all my smarts, this day
(END OF STREAM)

========================================


========================================
(3) ASYNC STREAMING DEMO
========================================

Async streaming response:
    Sure, here's a fun fact about space: "There's a planet that may be entirely made of diamond. Blast! In 2004,
(END OF ASYNC STREAM)

========================================

In the modal dashboard, you can see all the function calls, their time stamps, execution hours and status.

If you are facing issues running the aforementioned code, please refer to this Moodle with Kingbzpro/Deployment-Magistral Follow the guidelines provided in the Readme File to solve the Gut Hub Repeatory and all issues.

Conclusion

Moodle is an interesting platform, and I’m learning more about it every day. This is a common purpose platform, meaning you can use it for simple applications as well as machine learning training and deployment. In short, it is not limited to the service of the closing points. You can also use it to fix the training script from afar and fix a large language model.

It is designed for non -software engineers who want to tackle infrastructure and refrain from deploying as soon as possible. You do not need to worry about running a server, establishing storage, connecting networks, or dealing with cabinets and dockers. You all have to make a file and then deploy it. The rest is handled by the Moodle Cloud.

Abid Ali Owan For,,,,,,,,,, for,, for,,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,,, for,,,, for,,,, for,,,, for,, for,.@1abidaliawan) A certified data scientist is a professional who loves to create a machine learning model. Currently, he is focusing on creating content and writing technical blogs on machine learning and data science technologies. Abid has a master’s degree in technology management and a bachelor’s degree in telecommunications engineering. Its vision is to create AI products using a graph neural network for students with mental illness.

1.

2. Creating a VLM application using Moodle

3. The vilm server deployed on the moodle

4. Using a VLM server with Openi SD

Conclusion

Editor's pick

Get latest news

Magistral VLM Server Deployment on Moodle

1.

2. Creating a VLM application using Moodle

3. The vilm server deployed on the moodle

4. Using a VLM server with Openi SD

Conclusion

Two -year Defense Tech Mach Industries confirmed the M -100 million increase under the leadership of Khosla, Bedrock

A breakdown window in fact helped her business in fact

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news