

Photo by Author | Canva
We use large language models in many of our daily tasks. These models have been trained on billions of online documents and diverse diversions, which enable them to understand, understand and respond in a human language. However, not all LLMs are made the same. While the basic idea is the same, They are different in their basic architecture and these variations have significant impact on their abilities. For example, as seen in various benchmarks, Dipic is Excel in reasoning tasks, performs well in cloud coding, and Chat GPT stands in creative writing.
In this article, I will run you through 7 popular LLM architecture to offer you a clear review, all this in just minutes. So, let’s start.
1. Burt
Paper Link: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, Burt identified a significant change in the understanding of natural language by introducing deep bilateral attention to language modeling. Unlike the previous models that read the text in the left to right or right to left style, the Britr uses transformer encoder to consider both directions simultaneously. It is trained using two tasks: modeling of masked language (predicting masked words) and forecasting the next sentence (to determine whether one sentence logically follows the other). Construction, the Burt comes in two sizes: Brit base (12 layers, 110 meters parameters) and the British large (24 layers, 340 meters parameters). Its structure fully depends on the encoder steaks and includes special tokens (CLS) Representing the entire sentence and (September) To separate two sentences. You can do it right -fine tone for the analysis of emotions, the question (such as squad), and much more. It was the first event of its kind to truly understand the complete meaning of the sentence.
2. GPT
Link of Paper (GPT4): https://arxiv.org/pdf/2303.08774
Open AI introduced the GPT (Generative Pre -Trend Transformer) family. The series started from GPT1 to 2018 and was developed in GPT4 from 2023, which was released in the latest version, GPT -4O, May 2024, which showcases multi -modal capabilities, which handled both text and photos. They are already trained on a very large text carpura with a standard next token forecast for modeling the language: at each step the model predicts the next word in which all words are given. After this unmanaged pre -training phase, the same model can be finely tones on specific tasks or used in a zero/some shot method with at least additional parameters. Decodeer just design means that GPT is only shared in previous tokens unlike the Bilate Encoder. What was noteworthy on the occasion of the introduction was the GPT scale and ability: As every permanent generation (GPT-2, GPT-3) increased massively, this model demonstrated the gender of very fluent text and some shot learning, which was established. “Pre -train and quick/fine tone” Sample for large language models. However, they are proprietary, usually with access to APIS, and for their exact architecture, especially for the current version, has not been fully disclosed.
3. Lalama
Lalama 4 Blog Link: https://ai.meta.com/blog/llama-4-multimodal-intelliance/
Link of Paper (Lalama 3): https://arxiv.org/abs/2407.21783
Lama, which was developed by Meta AI and was first released in February 2023, open source decoader is only a family of transformer models. It is completed from 7 billion to 70 billion parameters, which were released in the latest version, Lama 4, April 2025. Like GPT, Lilama uses only a transformer decoder (each model is an automatic transformer) but with some architectural tweakes. For example, the original Lama models used Swigelo activation instead of fixed rotary positional embedding (rope), and RMSNorm in the usual place of the layer. The Lama family was released in several sizes in Llama1 in 7B to 65B parameters, later in Llama3, even more large -scale models were made more accessible. Specifically, despite the relatively modest parameter count, these models performed competitively with great contemporaries: Meta reported that Lilama’s 13B model performed well on several benchmarks of Openai’s 175 BGPT3, and its 65b model was competitive with contemporaries such as Google’s palm and Deep Mind.. Lilama’s open (though limited to research) has given rise to the widespread use of the community. Its key novelty was to combine efficient training on a scale with more open access to the model.
4. The palm
Palm 2 Technical Report: https://arxiv.org/abs/2305.10403
Link of Paper (Palm): https://arxiv.org/pdf/2204.02311
Palm (Pathways Language Model) is a series of large language models developed by Google Research. The original palm (announced 2022) was a 540 billion parameter, only a decoder transformer and is part of the Google pathway system. It was trained in thousands of TPU4 chips in high quality carps and Google infrastructure of 780 billion tokens, which were employed parallel to achieve high hardware use. The model also has multi -focus attention to reduce the needs of memory bandut during diagnosis. Palm is known for that The capabilities of learning some shotDue to its large and diverse training data, perform well on new tasks with minimal examples, including web pages, books, Wikipedia, News, Gut Hub Code, and social media conversations. Palm 2, which was announced in May 2023, improved multi -linguistic, reasoning and coding capabilities, with the features of Google Border and Workpace AI, such as applications.
5. Gemini
Gemini 2.5 Blog: https://blog.google/technology/google-deepmind/geemini-model-inking-thinting-march-2025/
Paper Link (Gemini 1.5): https://arxiv.org/abs/2403.05530
Paper Link (Gemini): https://arxiv.org/abs/2312.11805
Gemini Google’s Next Generation is the LLM Family (from Google Deep Mind and Google Research), which was introduced at the end of 2023. Gemini models are locally multi -modal, which means they are made from the ground to handle text, photos, audio, video and even the code in a model. Like Palm and GPT, Gemini is also based on transformer, but its key features include mass scale, support for extremely long contexts, and a compound (MOE) architecture for performance (Gemini 1.5). For example, Gemini 1.5 (“Pro”) is used to enhance the capacity without proportional computement costs (with hundreds of expert sub -networks, with only a few active per input). The Gemini 2.5 series, which was launched in March 2025, was built on this foundation with even more deeply “thinking” capabilities. In June 2025, Google released Gemini 2.5 Flash and Pro as a stable model and flash-light, their most costly, fastest version, which is still better for high-thropped tasks, while still millions of token context winds and toll integrations such as search and codes. The Gemini family comes in multiple sizes (Ultra, Pro, Nano) so that it can run from the cloud server to mobile devices. The combination of multi -modal pretending and MOE -based scaling makes Gemini a flexible, highly qualified foundation model.
6. Invalid
Link of Paper (Invalid 7b): https://arxiv.org/abs/2310.06825
Mr is a French AI startup that released its first LLM in 2023. Its flagship model, Mr. 7B (September 2023), is a 7.3 billion parameter transformer -based decoder model. Construction, the wrong 7b GPT is similar to the styling model but it includes reforms for reduction: It uses the Group Group Corey Focus (GQA) to improve itself and accelerate the sliding window attention to handle long contexts more efficiently. In terms of performance, the wrong 7B performed well on Meta Lama 2 13B and even gave strong results than the 34B model, while very small. Mr. AI released the model under Apache 2.0 license, which is freely available for use. Its next largest release was Maxteral 8 × 7b, a viral mix of existopters (MOE) model, which consists of eight 7b parameter expert networks per layer. This design helped defeat the math, coding, and multi -linguistic standards such as Maxterial Match or GPT -3.5 and Lalma 2 70B. In May 2025, Mr. Released on Mr Medium 3, a proprietary medium model that aims to enter into businesses. This model provides more than 90 % of the Pride model like Claude 3.7 Sonate on the standard benchmark, while dramatically reduces the cost per token (about $ \ 0.40 in VS \ 3.00 V for Sonate). It supports multi -modal tasks (text + images), professional reasoning, and is offered for a premature deployment via API or on four GPUs. However, contrary to the previous models, the medium is 3 closed sources, which indicates the criticism of the community that Mistal is turning away from his open source morals. Shortly afterwards, in June 2025, Mr. Introduced the Magistral, which is dedicated to his first model. The smaller version is open under Apache 2.0, while the Magistral Medium is only enterprise. The magistral medium scored 73.6 % on Aime2024, the smaller version scored 70.7 %, which showed strong math and logic skills in many languages.
7. DPSEC
Link of Paper (Deep Sak-R 1): https://arxiv.org/abs/2501.12948
DiPsic is a Chinese AI company (the spin of high -flying AI, which is based 2023), which produces a large LLM. Its recent models (such as DiPsic V3 and Deepstek-R1) hire a very low-activated compound-off-expert transformer architecture. In the Dipic V3/R1, each transformer layer has hundreds of expert sub -networks, but only a few tokens are activated. This means that instead of running all parts of the model together, the model has hundreds of specialist networks and what is needed for each input depends on only a few (such as 257). This allows the DPSic to keep a large size of the model (more than 670 billion parameters), while only 37 billion uses during each reaction, which is faster and cheaper than a similar dense model. Like other advanced LMS, it uses Savagloo activities, rotary embedding (rope), and modern reforms (experimental FP8 precision) more efficient. This aggressive MOE design lets deep sek get very high capacity (compared to very large dense models) at low compute costs. DPESC models (issued under open licenses) attracted attention to leading models like GPT4 in multi -linguistic generation and reasoning, while significantly reducing training and interference resources.
Kanwal seals Kanwal is a machine learning engineer and a technical writer who has a deep passion for data science and has AI intersection with medicine. He authored EBook with “Maximum Production Capacity with Chat GPT”. As a Google Generation Scholar 2022 for the APAC, the Champions Diversity and the Educational Virtue. He is also recognized as a tech scholar, Mitacs Global Research Scholar, and Harvard Vacod Scholar as a Taradata diversity. Kanwal is a passionate lawyer for change, who has laid the foundation of a Fame Code to empower women in stem fields.