How does a bag of words work – the base of the language model

by SkillAiNest

When people talk about modern AI, they often refer to a large language model like Chat GPT.

These models are known to be smart, because they are able to write, respond and describe in natural language.

But the roots of this technology went back to the very easy thing: The bag of words Model. This method, which was first published in decades ago, was one of the early methods of converting the text into numbers. Without it, the progress that we are seeing in the natural language processing today was not possible.

In this article, you will learn what the words bag is algorithm and write your code to create a function to create a bag of words.

What is a bag of words?

The word bag, which is often called bow, is a way to represent the text. It takes a phrase, a paragraph, or document and treats it as a “bag” of words.

Word order, grammar, and phrase structures are ignored. The presence or frequency of each word is only important.

Take the phrase

The cat sat on the mat.

In the bag of words, it becomes a count of words.

the:2, cat:1, sat:1, on:1, mat:1.

Another phrase like this:

The mat sat on the cat

Looks the same, though the meaning is different.

It is both the power and the weakness of the bow. It simplifies the text for the computer but removes the context.

Why was the bow

Before the words bag, the computer had no easy way to handle the human language. Words are not numbers, and the algorithm needs numbers to work.

The Bo gave researchers a way to convert waste text into counting vectors. Once in numerical form, words can be used in data, clustering and machine learning.

Initial requests included spam filters, where some words such as “free” or “win” indicate unwanted emails. Search engines also used it to meet questions with documents. For the first time, a scale can be processed on the text.

A simple bag of words in azagar

Here is a brief example here to look at the words bag in action. We will take some sentences and turn them into a word -count vector.

from sklearn.feature_extraction.text import CountVectorizer
docs = (
    "the cat sat on the mat",
    "the dog barked at the cat",
    "the mat was sat on by the cat"
)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Document-Term Matrix:\n", X.toarray())

This will give you the following result:

The bag of words

The thing you are seeing in the output is the word model bag changes your sentences. The first line shows the words, which is a combination of each unique word that appears in three input sentences.

“Et,” “barking,” “cat,” dog, “” mat, “” on, “” sitting, “” “”, “and” “and” “become part of this dictionary. Words have a fixed order in each position, and they are used to map the words of columns in the order matrix.

The second part of the output is the document Matrix. In this matrix, each row represents a document, and each number in the row tells you how many times the word appears in this document.

For example, in the first row, compatible with the phrase “The cat was sitting on the mat”, to show that “The” appeared twice, while the “cat,” “sitting,” “on,” and “mat” appeared once. Every other word in this row’s WROC is a zero, meaning that he never revealed in this document.

This is the essence of a bag of words. It reduces each sentence in a row of numbers, where meaning and grammar are ignored, and only words are counted. Instead of working with the raw text, the machine now works with the numbers made of numbers.

That is an easy idea that started to analyze and learn from the language for a computer.

Where the bag of words falls short

As useful, the words bags have limits.

The most obvious thing is that he ignores meaning. The sentences with the same vector with the opposite character (“dog chase cat” vs. “cat chase dog”).

The smell can not handle the synonyms well. Words like “happy” and “happy” are considered different, though they mean the same thing.

Another problem is size. If there are thousands of unique words of datastas, the vector becomes very large and worse. Most values ​​are zero, which makes storage and calculation less efficient.

Better than the words bag

Word bags impressed in better ways. Had an improvement TF-AdfWhich gives rare but important words overweight, and ordinary people like low weights like “give”.

Came later Word embeddings Such as Word 2 Week and gloves. Instead of counting the words, make the embedded map in dense vector where meaning and relationships are seized. Words like “King” and “Queen” end together in this place.

Modern transformers, such as Brit and GPT, move it even more. They catch not only the meaning of the word but also the context. The word “bank” and “mini bank” will have different embedded in terms of this punishment in the word “bank”. This is something that could never be a bag of words.

Why words bags still matter

Even today, the bag of words is not useless. IT of limited data projects, it can still produce strong results.

Using the smell, a quick text rating often works faster and requires less computing power than deep nervous network training. In teaching, it is also valuable as it shows the first step to convert the raw text into a machine reading.

More importantly, the basic idea of ​​the bag bag is underway. Large models of language still turn the text into a vector. The difference is that they do it in a more complicated and meaningful way.

The bag of words was the spark that made the researchers realize: To take action on the language, we need to represent it first.

Conclusion

The word bag looks simple, perhaps even ancient, compared to the tools we now use. But it was an important turning point. It provided the computer as a way to see the text as a data, and it founded everything that came after. Although it cannot capture a deep meaning or context, it taught us how to eliminate the difference between words and numbers.

Large models of language may look like magic, but their roots have gone back to the bag of words. In a sentence with billions of parameters, traveling from transformers to the journey is proof that major revolutions in technology often begin with small, easy ideas.

Hope you enjoy this article. Sign up for my free AI newsletter turningtalks.ai For more lessons on AI. You can find me too Linked.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro