Hosting language models on a budget

Photo by editor

# Introduction

Chat GPT, Cloud, Gemini. You know the name. But here’s a question, what if you run your own model instead? That sounds ambitious. It is not. You can deploy to work Large language model (LLM) in less than 10 minutes without spending a single dollar.

This article breaks it down. First, we’ll figure out what you actually need. Then we will look at the actual costs. Finally, we’ll deploy Tynelama for free on Hugging Face.

Before launching your model, you probably have a lot of questions on your mind. For example, what am I expecting my model to do?

Let us try to answer this question. If you need a boot for 50 users, you don’t need GPT-5. Or if you’re thinking of doing sentiment analysis on 1,200+ tweets a day, you won’t need a model with 50 billion parameters.

Let’s first look at some popular use cases and the models that can perform these tasks.

Hosting language models

As you can see, we matched the model to the task. This is what you should do before you begin.

# Breaking down the actual costs of hosting an LL.M

Now that you know what you need, let me show you how much it costs. Hosting a model isn’t just about the model. It’s also where the model runs, how often it runs, and how many people interact with it. Let’s decode the actual costs.

// Compute: The biggest cost you’ll face

If you run a Central processing unit (CPU) 24/7 ON Amazon Web Services (AWS) EC2, it will cost $36 per month. However, if you run a Graphics processing unit . So be careful about calculating the cost of your large language model, as this is the primary cost.

(Calculations are approximate. To see the actual price, please check here: AWS EC2 Pricing)

// Storage: Small cost unless your model is massive

Let’s estimate the disk space. A 7B (7 billion parameter) model takes about 14 Gigabytes (GB) Cloud storage costs $0.023 per GB per month. So the difference between the 1GB model and the 14GB model is only about $0.30 per month. Storage costs can be negligible if you do not plan to host a 300B parameter model.

// Bandwidth: Inexpensive unless you scale

Bandwidth is important when your data moves, and your data moves when others use your model. AWS charges $0.09 per GB after the first GB, so you’re looking at money. But if you scale to millions of applications, you should also estimate that.

(Calculations are approximate. To see the actual price, please check here: AWS Data Transfer Pricing)

// Free Hosting Options You Can Use Today

Hugging facial spaces Lets you host a small model with CPU for free. Render And Railways Offer free tiers that work for low-traffic demos. If you’re experimenting or building a proof-of-concept, you can get far without spending a cent.

# Choose a model you can actually drive

Now we know the costs, but which model should you drive? Each model has its advantages and disadvantages. For example, if you download a 100 billion parameter model to your laptop, I guarantee it won’t work unless you have a top-notch, specially built workstation.

Let’s look at the different models available on Face Hug so you can play them for free, as we’re going to do in the next section.

Tinilama: This model requires no setup and the huggers run using a free CPU tire on the face. It is designed for simple conversational tasks, answering simple questions and text generation.

It can be used to quickly build and test chatbots, run automation experiments quickly, or build internal Q&A systems for testing before expanding infrastructure investments.

distilgpt-2: It is also fast and lightweight. This makes it perfect for hugging facial areas. Fine for completing texts, very simple classification tasks, or short responses. It is useful to understand how LLMS works without resource constraints.

phi-2: A small model developed by Microsoft that proves to be quite effective. It still runs on the free tier from Face Hugger but offers better reasoning and code generation. Use it for natural language SQL query generation, simple Python code execution, or sentiment analysis of customer reviews.

flan-t5-small: This is Google’s instruction tuning model. Designed to respond to commands and provide responses. Useful for generation when you want defined output on free hosting, such as summarizing, translating, or answering questions.

# Deploy Tynelama in 5 minutes

Let’s build and deploy Tynelama using the free huggable facespaces. No credit cards, no AWS accounts, no Docker headaches. Only one working chatbot you can share with a link.

// Step 1: Go to Embrace the Facial Areas

The head huggingface.co/spaces And click on “New Space”, as in the screenshot below.

Name the place whatever you want and add a short description.

You can leave the same as other settings.

Hosting language models

Click “Create Space”.

// Step 2: Write app.py

Now, click on “Create app.py” from the below screen.

Hosting language models

Paste the below code inside this app.

This code loads tinellama (with the files available on the hugger face), wraps it in a chat function, and uses Gradio To create a web interface. chat() The method correctly formats your message, generates a response (up to a maximum of 100 tokens), and returns only the answer from the model (not iteratively) to the query you asked.

Here is the page where you can learn how to write code for any huggable face model.

Let’s look at the code.

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def chat(message, history):
    # Prepare the prompt in Chat format
    prompt = f"<|user|>\n{message}\n<|assistant|>\n"
    
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100,  
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(outputs(0)(inputs('input_ids').shape(1):), skip_special_tokens=True)
    return response

demo = gr.ChatInterface(chat)
demo.launch()

After pasting the code, click “Commit new file to main”. Please check the screenshot below for example.

Hosting language models

Hugging Face will automatically detect it, install dependencies, and deploy your app.

During this time, make a requirements.txt file or you will get an error like this.

// Step 3: Create requirements

Click “Files” in the upper right corner of the screen.

Here, click on “Create a new file” as in the screenshot below.

Name the file “requirements.acts” and add the 3 Python libraries, as shown in the following screenshot (transformersfor , for , for , . torchfor , for , for , . gradio)

Transformer Here loads the model and deals with tokenization. The torch The model runs because it provides a neural network engine. Gradio creates a simple web interface so users can chat with the model.

// Step 4: Run and test your deployed model

When you see the green light “on” it means you’re done.

Now let’s test it.

You can test it first by clicking on the app from here.

Let’s use this to write a Python script that detects outliers in a Comma separated values (CSV) file using Z-Score and Intercortical range (IQR)

Here are the test results.

// Understanding the deployment you just built

The result is that you are now able to spin up a 1B+ parameter language model and never touch a terminal, set up a server, or spend a dollar. Hugface takes care of hosting, compute, and scaling (to a degree). A paid tier is available for more traffic. But for the purposes of experiments, it is ideal.

The best way to learn? Deploy first, optimize later.

# Where to go next: Improving and expanding your model

You now have a working chatbot. But Tinilama is just the beginning. If you need better response, try upgrading to PHI 2 or FALSE 7B using the same process. Just change the model name app.py And add a bit more compute power.

For faster response, consider quantization. You can also connect your model to a database, add memory to the conversation, or fine-tune it to your data, so the only limit is your imagination.

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

# Introduction