How to Compress Your Signals and Reduce LLM Costs

by SkillAiNest

Every developer working with large language models eventually faces the same challenge: pointers keep getting longer, models keep getting slower, and API bills keep getting higher.

Whether you’re building a retrieval-over-generation (RAG) system or a chatbot that remembers past conversations, each additional token adds cost and latency.

Microsoft quietly introduced a fix that few people outside of research circles noticed, in a project called llmlingua. It compresses the pointer before sending it to the model, keeping only the most important information. This results in faster response times, smaller bills, and an easier way to scale the LLMS.

In this tutorial, we’ll see how to use Llmlingua to improve your signage and make them more effective while saving money.

What we will cover:

The problem is hiding in plain sight

When an LLM processes a signal, each token goes towards your cost and the model’s focus range.

for Context-heavy requestsit’s common to hit a maximum token window before you get to the useful part of your data.

Adding more context can better support the model’s causality, but it also reduces derivation. Long pointers not only take longer to generate a response, but also eat into your budget when using APIs like GPT-4 or Cloud.

llmlingua targets this problem directly with intelligent hints without training or modifying the underlying model.

What makes llmlingua different?

llmlingua uses a small, compact language model, such as GPT-2 Small or Llama-7b. This small model helps identify and remove unnecessary tokens in a given token.

Instead of feeding thousands of tokens into your central model, you send a compact version that retains meaning.

This approach achieves 20x compression with negligible precision loss. Simply put, llmlingua allows your LLM to read the same content in fewer words.

Working with llmlingua

Getting started is easy. The library is available on PYPI and works out of the box.

pip install llmlingua

Once installed, you can import it into Python to start compressing notation.

Here’s how you can compress a large text prompt using llmlingua:

from llmlingua import PromptCompressor


llm_lingua = PromptCompressor()


prompt = "Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box..."

compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

print(compressed_prompt)

When you run it, you’ll get a dictionary like this:

{
  'compressed_prompt': 'Question: Sam bought a dozen boxes each with 30 highlighter pens...',
  'origin_tokens': 2365,
  'compressed_tokens': 211,
  'ratio': '11.2x',
  'saving': 'Saving $0.1 in GPT-4.'
}

You can also load different models depending on your resources.


llm_lingua = PromptCompressor("microsoft/phi-2")



llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

If you’re processing long documents or chat histories, this simple setup can save you hundreds of dollars in production.

Handling Long Contexts with Langale Millingua

Longllmlingua Extends this concept to large-scale inputs such as PDF, transcript, or multi-document retrievals. It dynamically maintains and filters the context to ensure that the model sees only the most relevant parts.

Here’s how you can use it:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    prompt_list,
    question="What are the main regulatory changes in the last quarter?",
    rate=0.55,
    condition_in_question="after_condition",
    reorder_context="sort",
    dynamic_context_compression_ratio=0.3,
    condition_compare=True,
    context_budget="+100",
    rank_method="longllmlingua",
)

This works particularly well in patch systems where documents vary in length and consistency. By combining retrieval with compression, you can fit more context into your LLM without hitting token limitations.

llmlingua-2: Fast and Smart

The Microsoft team didn’t stop there. He introduced llmlingua-2which is faster and more general purpose.

It uses data distillation from GPT-4 and BRIT level encoders to improve compression fidelity.

This version handles out-of-domain data better and performs 3-6 times faster than the original.

from llmlingua import PromptCompressor


llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

compressed_prompt = llm_lingua.compress_prompt(prompt, rate=0.33, force_tokens=('\n', '?'))


llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True,
)

For multilingual and enterprise scenarios, llmlingua-2 offers the right balance between cost, accuracy and speed.

Structural quick compression

Sometimes, you want to control which parts of a signal should be compressed.

llmlingua supports structured compression using special tags. You can mark sections of text to compress at different rates or skip them entirely.

structured_prompt = """Speaker 4:
 Thank you. And can we do the functions for content? Items I believe are 11, three, 14, 16 and 28, I believe.
Speaker 0:
 Item 11 is a communication from Council on Price recommendation..."""

compressed_prompt = llm_lingua.structured_compress_prompt(
    structured_prompt,
    instruction="",
    question="Summarize the meeting notes",
    rate=0.5,
)
print(compressed_prompt('compressed_prompt'))

This feature is particularly useful in supplementing summarization or note-taking systems where speaker tags or section headers must remain intact.

Security: Compression as a defense

A new addition, SecurityTangwa, uses security-aware compression to detect malicious jailbreak attempts.

It detects malicious intent hidden in complex signals and defends against attacks with 100x less token cost than traditional defenders.

from llmlingua import PromptCompressor

securitylingua = PromptCompressor(
    model_name="SecurityLingua/securitylingua-xlm-s2s",
    use_slingua=True
)
intention = securitylingua.compress_prompt(malicious_prompt)

This model offers a unique approach: instead of filtering generation after generation, it prevents malicious instructions from reaching the model in the first place.

Integration with ecosystems

One of the reasons llmlingua stands out is that it fits seamlessly into the modern AI ecosystem.

Rather than being a standalone research prototype, it is integrated into already popular frameworks such as Lingchenllamaindex, and Microsoft Instant Flow.

This means you can plug it directly into your existing RAG or document processing pipelines without rewriting code or changing your models.

For example, in LangChain, llmlingua acts as a smart middle layer that compresses the retrieved context before reaching the llm.

Imagine you are retrieving to pull documents from a knowledge base. Instead of sending these long texts straight to your model, llmlingua filters out unnecessary tokens so your session is quick and efficient.

Here’s how you can integrate it:

from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMLinguaCompressor
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(temperature=0)


compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu")


compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever  
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)

In this setup, the retriever first collects the relevant documents, and llmlingua compresses them before transferring them to LLM. The model yields a condensed but information-rich signal, which keeps token usage low while maintaining accuracy.

This integration works out of the box with any supported model on LangChain. You can customize it to use your preferred compression rate or model variant (such as llmlingua-2).

The result is a more efficient pipeline: your LLM reads less but understands more.

Why llmlingua matters

llmlingua may not make headlines like GPT-5 or Gemini, but its impact is fundamental. This identifies the most expensive part of LLM workflows: context handling.

By removing redundant tokens and preserving intent, it changes how developers build scalable AI applications.

Whether you’re summarizing regulatory data, processing lengthy legal documents, or powering multilingual chatbots, LameLingua gives you a new lever for optimization.

The takeaway is simple: the future of AI performance will come not just from big models, but from smart cues, and smart cues.

The result

Microsoft’s LameLingua is more than a research project. This is a quiet revolution in the way we impart knowledge to LLMs. This allows developers to scale context boundaries, reduce costs and increase speed, without having to train a single model.

By intelligently learning to apply gesture pressure, LemLingua helps you communicate with machines more effectively. And in the world of big language models, saying more with less is exactly the sort of progression that matters most.

Hope you enjoyed this article. Sign up for my free newsletter turingtalks.ai For more tutorials on AI. You can too Visit my website.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro