How to drive a multiple LLM locally using Lama-Sweep on a server

Photo by Author | Ideogram

Many large language models can be useful in running, whether comparing the model output, if anyone fails, formation of a foamback, or customizing behavior (such as the use of a model for coding and other technical writing). That way we often use LLM. Like apps like poe.com Which offers such a setup. This is the only platform where you can run multiple LLMs. But what if you want to do all this locally, save API costs, and keep your data private?

Well, the real problem appears in the same place. Sorting this means usually wake up to different ports, run separately, and switch to them manually. Not ideal.

Is exactly the same pain Laama-Sweep The solution is an open source proxy server that is extremely lightweight (only one binary), and it easily allows you to switch between multiple local LLMs. In simple terms, it listens to your machine for open -i -style API calls and automatically starts or stops the correct model server based on the model you request. Let’s break how it works and operates on your local machine through a step -by -step setup.

. How Lama Sweep works

Imagine, Lama Sweep sits as a smart router in front of your LLM servers. When the API application comes (as, A POST /v1/chat/completions Call), it looks towards "model" Field in JSON Payload. Then it loads the appropriate server process for this model, closing another model if needed. For example, if you request the first model "A" And then request the model "B"Lalama-Sweep will automatically stop the server for “A” and start the server for “B” to submit each request through the correct model. This dynamic exchange is transparent, so the clients see the expected reaction without worrying about the basic process.

By default, Lalama-Sweep allows only one model to run at a time (it removes others when switching). However, the feature of its groups allows you to change this behavior. A group can list several models and control their exchange. For example, sequence swap: false In a group, it means that all members of the group can walk together without any descendants. In practice, you can use heavyweight models (just one active at a time) and the second “parallel” group Lion a group that you want to walk simultaneously. This gives you full control over the use and harmony of resources on the same server.

. Provisions

Before starting, make sure you have the following in your system:

Dear 3 (> = 3.8): Required for basic scripting and tooling.
Hoomerio (On Macos): LLM makes it easy to install run time. For example, you can install llama.cpp With the server:

It provides llama-server Binary to host models locally.

llama.cpp (llama-server:: Open AI -favorable server binary (mounted through the top Humbario, or made from the source), which actually operates the LLM model.
Hugging face CLI: To download models directly into your local machine without logging into the site or manually naving the model pages. Install it using it:

pip install -U "huggingface_hub(cli)"

Hardware: Any modern CPU will work. For a faster diagnosis, a GPU is useful. (On Apple Silicon Max, you can run on CPU or try Piturich MPs for supported models. On Linux/Windows with NVIDIA GPUs, you can use a Doker/CUDA container for acceleration.)
Doctor (Optional): Pre -Bullet Doer to run images. However, I chose not to use it for this guide because these images are primarily created for the X86 (Intel/AMD) system and the Apple Silicon (M1/M2) does not work reliably on Mac. Instead, I used the naked metal installation method, which works directly to Macos without a container overhead.

To summarize, you will need an exciting environment and a local LLM server (such as `llama.cpp` server). We will use them to host a two example model on a machine.

. Step -by -step instructions

!! 1.

Download the latest Lalama-Sweep Release for your OS Gitob issued the page. For example, I could see v126 As the latest release. Run the following orders:

# Step 1: Download the correct file
curl -L -o llama-swap.tar.gz \

Output:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3445k  100 3445k    0     0  1283k      0  0:00:02  0:00:02 --:--:-- 5417k

Now, extract the file, make it viable, and test it by checking the version:

# Step 2: Extract it
tar -xzf llama-swap.tar.gz

# Step 3: Make it executable
chmod +x llama-swap

# Step 4: Test it
./llama-swap --version

Output:
version: 126 (591a9cdf4d3314fe4b3906e939a17e76402e1655), built at 2025-06-16T23:53:50Z

!! 2. Download and preparation for two or more LLM

Select two example models to run. We will use Qwen2.5-0.5b And smollm2-135m (Small Model) The hugs face. You need model files (I GGUF Or similar shape) on your machine. For example, using a hugging face:

mkdir -p ~/llm-models

huggingface-cli download bartowski/SmolLM2-135M-Instruct-GGUF \
  --include "SmolLM2-135M-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models

huggingface-cli download bartowski/Qwen2.5-0.5B-Instruct-GGUF \
  --include "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models

This will:

Create a directory llm-models In your user’s home folder
Download GGUF model files in this folder safely. After download, you can confirm that it is there:

Output:

SmolLM2-135M-Instruct-Q4_K_M.gguf
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf

!! 3. Lama-Sweep Configure

Lalama-Sweep uses a single yelmal file to describe the model and server commands. A config.yaml File with such contents:

models:
  "smollm2":
    cmd: |
      llama-server
      --model /path/to/models/llm-models/SmolLM2-135M-Instruct-Q4_K_M.gguf
      --port ${PORT}

  "qwen2.5":
    cmd: |
      llama-server
      --model /path/to/models/llm-models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
      --port ${PORT}

Change /path/to/models/ With your original local route. Under every entry models: Gives an id (as "qwen2.5") And a shell cmd: To drive her server. We use llama-server (from llama.cpp) with --model To point to the gguf file and --port ${PORT}. ${PORT} Macro Lama-Sweep calls each model automatically assigning a free port. groups The section is optional. I have left it for this example, so according to default, Lama sweep will run only one model at a time. You can customize many options for each model (alias, timeout, etc.) in this setting. For more details related to the options available, see Full Configration Example File.

!! 4.

With the binary and config.yaml Ready, start pointing Lama Sweep your formation:

./llama-swap --config config.yaml --listen 127.0.0.1:8080

It launches proxy server localhost:8080. It will read config.yaml And (first) do not load any model until the first application comes. Lalama-Sweep will now handle API requests at the port 8080Send them to the appropriate basic extent llama-server The process -based process "model" Parameter

!! 5. To communicate with your models

Now you can make an open -style API calls to test each model. Install Ji Qi If you do not have this before running orders below:

!! Using Qwen2.5

curl -s  \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
        "model": "qwen2.5",
        "prompt": "User: What is Python?\nAssistant:",
        "max_tokens": 100
      }' | jq '.choices(0).text'

Output:
"Python is a popular general-purpose programming language. It is easy to learn, has a large standard library, and is compatible with many operating systems. Python is used for web development, data analysis, scientific computing, and machine learning.\nPython is a language that is popular for web development due to its simplicity, versatility and its use of modern features. It is used in a wide range of applications including web development, data analysis, scientific computing, machine learning and more. Python is a popular language in the"

!! using Smollm2

curl -s  \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
        "model": "smollm2",
        "prompt": "User: What is Python?\nAssistant:",
        "max_tokens": 100
      }' | jq '.choices(0).text'

Output:
"Python is a high-level programming language designed for simplicity and efficiency. It's known for its readability, syntax, and versatility, making it a popular choice for beginners and developers alike.\n\nWhat is Python?"

Each model will respond according to its training. The beauty of Lalama-Sweep is that you don’t have to resume anything manually-just change it "model" The field, and it handles the rest. As shown in the above examples, you will see:

qwen2.5: A more verb, technical response
smollm2: A simple, more comprehensive answer

This confirms that the Lilama-Sweep is rooting the correct model requests!

. Conclusion

Congratulations! You have set up a Lama-Sweep to run two LLMs on a machine, and now you can switch to the fly through API calls between them. We installed a proxy, developed a configuration with two models, and saw how the Lama sweep routes request the right back.

Next steps: Add it to you you can increase it:

Large models (such as TinyLlamaFor, for, for,. Phi-2For, for, for,. MistralJes
Groups to serve harness
With integration LangchenFor, for, for,. PhostepOr other Front End

Enjoy in search of different models and structures!

Kanwal seals A machine is a learning engineer and is a technical author that has a deep passion for data science and has AI intersection with medicine. He authored EBook with “Maximum Production Capacity with Chat GPT”. As a Google Generation Scholar 2022 for the APAC, the Champions Diversity and the Educational Virtue. He is also recognized as a tech scholar, Mitacs Global Research Scholar, and Harvard Vacod Scholar as a Taradata diversity. Kanwal is a passionate lawyer for change, who has laid the foundation of a Fame Code to empower women in stem fields.

. How Lama Sweep works

. Provisions

. Step -by -step instructions

!! 1.

!! 2. Download and preparation for two or more LLM

!! 3. Lama-Sweep Configure

!! 4.

!! 5. To communicate with your models

!! Using Qwen2.5

!! using Smollm2

. Conclusion

Editor's pick

Get latest news

How to drive a multiple LLM locally using Lama-Sweep on a server

. How Lama Sweep works

. Provisions

. Step -by -step instructions

!! 1.

!! 2. Download and preparation for two or more LLM

!! 3. Lama-Sweep Configure

!! 4.

!! 5. To communicate with your models

!! Using Qwen2.5

!! using Smollm2

. Conclusion

Top 5 errors keep making smart business people

The founder of the mirror will unveil his new start in disrupting Bern Putnam 2025

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news