How Transformers Think: The Flow of Information That Makes a Language Model Work

Photo by editor

# Introduction

thank you Major language models (LLMS), we have impressive, incredibly useful applications today Geminifor , for , for , . Chat GPTand Claudeto name a few. However, few people realize that the basic architecture behind LLM is called A Transformer. This architecture is carefully designed to “think” – that is, to process the data that describes human language – in a particular and no particular way. Are you interested in gaining a broader understanding of what goes on inside these so-called transformers?

Using a gentle, understandable and rather non-technical tone in this article, the transformer model behind the LLM analyzes input information such as user gestures and how they produce coherent, meaningful and relevant output text word by word (or, a bit more technically, token by token).

# First Steps: Making Language Comprehensible to Machines

The first key concept to understand is this AI models don’t really understand human language; They only understand and manipulate numbers, and the transformers behind the LLM are no exception. Therefore, it is necessary to convert human language – i.e. text – into a form that can be fully understood by the transformer before it is able to perform deep processing on it.

Put another way, the first few steps before entering the core and inner layers of the transformer mainly focus on converting the raw text into a numerical representation that preserves the key features and characteristics of the original text under the hood. Let’s review these three steps.

Making language understandable by machines (click to expand)

// Tokenization

The tokenizer is the first actor on the scene, which works in conjunction with the transformer model, and is responsible for chunking the raw text into smaller pieces called tokens. Depending on the tokenizer used, these tokens can be words in most cases, but tokens can sometimes be parts of words or punctuation marks. Furthermore, each token in a language has a unique numeric identifier. This happens when the text is no longer text, but at the number: token level, as shown in this example in which a simple tokenizer converts a text containing five words into five token identifiers, one word: one word:

Tokenization of text into token identifiers

// Token Embeddings

Then, each token ID is converted into a \(d\)-dimensional vector, which is a list of size \(d\). This complete representation of a token as an embedding is like a description of the overall meaning of that token, whether it is a word, part of a word, or a punctuation mark. The magic is in the fact that tokens are associated with concepts of meaning, e.g the queen And The empresswill be associated with embedding vectors that are identical.

// Positional encoding

Until now, embedding a token contains information in the form of a collection of numbers, yet this information still relates to a single token in isolation. However, in a “fragment of language” like the arrangement of a text, it is important not only to know the words or tokens it contains, but also their position in the text of which they are a part. Positional encoding is a process that injects into each token some additional information about its position in the original text sequence, using mathematical functions.

# A change in the core of the transformer model

Now that the numeric representation of each token adds information about its position in the text sequence, it’s time to enter the first layer of the main body of the transformer model. Transformer is a very dense architecture, with many stacked components replicated throughout the system. There are two types of transformer layers. Just be aware that there are two types of layers in a transformer, although they both have a lot in common.

Transformation through the core of the transformer model (click to enlarge)

// Multi-head focus

This is the first major subprocess within the transformer layer, and is perhaps the most effective and distinctive feature of transformer models compared to other types of AI systems. Multi-head attention is a mechanism that allows a token to observe or “pay attention” to other tokens in the sequence. It collects and incorporates useful contextual information into its token representation, i.e. linguistic aspects such as grammatical relationships, long distance dependencies between words not necessarily next to each other in text, or semantic similarities. In summary, thanks to this method, diverse aspects of relevance and relationship between parts of the original text have been successfully captured. As the token representation travels through this component, it acquires a more context-aware representation of itself and the text.

Some transformer architectures that are designed for specific tasks, such as translating text from one language to another, also analyze possible dependencies between tokens through this mechanism, looking at both the input text and the output (translated) text, as shown below:

Multi-head focus in translational transformers

// Feed forward neural network sublayer

Simply put, after attention, the second common step within each replication layer of the transformer is a set of chained neural network layers that further process and help learn additional patterns of our enriched token representation. This process amounts to sharpening these representations, identifying and reinforcing the features and patterns that are relevant. Finally, these layers are the mechanisms used for gradual learning in general, in which the entire text is being processed.

The process of passing through the multi-sard attention and feed-forward cells is repeated a number of times in this order: as many times as the number of duplicated transformer layers we have.

// The final destination: predicting the next word

After repeating the previous two steps several times in an alternating fashion, the token representations coming from the initial text should have allowed the model to gain a very deep understanding, helping it to recognize complex and subtle relationships. At this point, we reach the final component of the transformer stack: a special layer that transforms the final representation into words for all possible token probabilities. That is, we calculate—based on all the information learned along the way—for each word in the target language the next word should be output by the transformer model (or LLM). The model eventually selects the token or word with the highest probability as the next generated as part of the output for the next user. The entire process repeats to produce each word as part of the model’s response.

# wrap up

This article provides a gentle and conceptual tour through an experienced journey through text-based information as it flows through the signature model architecture behind LLMS: Transformer. After reading this, you’ll hopefully have a better understanding of what goes on inside the models behind ChatGPT.

Ivan Palomares Carrascosa Is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning, and LLMS. He trains and guides others in real-world applications of AI.

# Introduction