VTeam AI

Exploring Popular Large Language Models: BERT, GPT-3, T5, and More

With ChatGPT's inception, the Gen-AI era has seen a rise in Large Language Models (LLMs) from top tech firms. The 'Attention is All You Need' paper spurred models like BERT and GPT. Today's LLMs range from OpenAI's GPT to Salesforce's Llama 2 and Meta's Falcon-40B. Your application needs dictate the ideal LLM choice.

Since the inception of ChatGPT, the world has finally turned its head towards Gen-AI. The impact is big, be it any industry. Within a few months, the world has witnessed a flurry of LLMs, coming from different tech giants, be it Google, Meta, Salesforce, and obvious OpenAI that has started the revolution. Our aim with the post is to explore the best of the LLMs released till now and provide you with a curated list of LLms to try out.

We will be exploring the following topics in this blog

🤖 Transformers: This is where the Gen-AI revolution started

🖍️ What is Attention? The core of any LLM

🔋  How are BERT and GPT related to Transformers?

ğŸŽ¯ Some of the popular LLMs

  1. GPT 3
  2. GPT 3.5
  3. GPT 4
  4. Llama 2
  5. Falcon-40B
  6. T5
  7. StableLM
  8. Phi-1
  9. PaLM 2
  10. LaMDa

⚖️ Which LLM to choose

🧭 Sample code to run GPT-2 on local for text autocompletion

But before we jump ahead, we will trace back to a revolutionary paper that is at the core of this Gen-AI trend i.e. ‘Attention is All you need’ in 2017. This was the paper that introduced the world to its first LLM which is Transformers.


To your surprise, all the LLMs released to date are nothing but either bigger transformers or some part of it.

So, without wasting time, we will quickly understand what Transformers and other LLms that came after it


As you can see, a transformer is basically a combination of two neural networks, called an Encoder(left) & Decoder(right). A special concept introduced in this paper that took the world by storm was ‘Attention’. What is it?

The attention mechanism functions as a dynamic mechanism that assigns varying degrees of importance to specific key tokens within the input sequence by modifying the token embeddings. In every sentence, certain keywords encapsulate the essence of the message conveyed. For instance, in the sentence "He is a good boy," the word "is" does not carry the same significance as "boy" in understanding the overall meaning. To enable our model to concentrate on these crucial words within a sentence, the attention layer proves to be highly beneficial and enhances the output quality produced by neural networks.

The Multi-Head Attention Layer can be visualized as a collection of parallel Attention Layers, allowing us to comprehend various facets of a sentence or language. Think of it as having different individuals interpreting a shared question. Each person perceives the question differently and responds accordingly, with their answers potentially differing from one another. This Multi-Head Attention layer is a crucial part of any LLM.

How does a transformer work?

  1. Input text is tokenized and converted into embeddings.
  2. Encodings are generated using encoding layers.
  3. The encoder processes the input using self-attention and feed-forward networks.
  4. The decoder generates the output sequence with masked self-attention.
  5. Cross-attention allows the decoder to focus on relevant input parts.
  6. Output probabilities are computed using softmax.
  7. The model is trained on paired input and target sequences.

Moving ahead,

  • The Encoder component of the Transformer evolved into the BERT (Bidirectional Encoder Representations from Transformers) architecture
  • While the Decoder component evolved into GPT (Generative Pre-trained Transformer). A variant of GPT is used by ChatGPT

How are BERT & GPT different from Transformers?

  1. Bidirectional Training: Unlike the traditional Transformer, which uses unidirectional training, BERT & GPT are trained bidirectionally. It takes into account both the left and right context of each word during training.
  2. Pre-training with Masked Language Model: BERT & GPT employs a pre-training phase where a portion of the input tokens in each training example is masked, and the model learns to predict the masked tokens based on the context of the surrounding words. This enables them to learn contextual representations of words that capture their meaning in various contexts.
  3. Language Understanding, not Task-Specific: The objective of both GPT & BERT is to develop a deep understanding of language in a general sense, rather than specializing in specific tasks. By learning from a vast amount of text data, they acquire a broader understanding of language, making it more versatile for transfer learning.
  4. Transfer Learning for Downstream Tasks: Both the pre-trained models can be fine-tuned on specific downstream tasks, such as sentiment analysis, question answering, or named entity recognition. This fine-tuning process enables GPT & BERT to adapt its knowledge to different tasks, achieving state-of-the-art results in various NLP applications.

Apart from all the distinguishing features that BERT has compared to Transformers, GPT also has text generational capabilities making it suitable for LLMs. Hence, most of the LLMs are nothing but upgraded versions of GPT.


Now as we know the baseline models, we must start with the SOTA LLMs


  1. GPT-3 is a powerful language model developed by OpenAI which is the third generation of the GPT series of models.
  2. Large Scale Model: GPT-3 is the largest neural network language model ever created (at that time), surpassing its predecessor GPT-2 in size. It contains a staggering 175 billion trainable parameters.
  3. Task-Agnostic Nature: GPT-3 is a task-agnostic language model, meaning it is not specifically designed for a particular task. Instead, it is trained using vast amounts of internet data and can perform a wide range of natural language processing tasks, such as language translation, text classification, sentiment analysis, reading comprehension, question-answering, news article generation, and more.
  4. Few-Shot and Zero-Shot Learning: One of the most remarkable features of GPT-3 is its ability to perform new tasks, even ones it has never been explicitly trained on, with just a few examples or sometimes even none at all. It demonstrates strong few-shot and zero-shot learning capabilities, making it highly adaptable and versatile.

GPT 3.5

  1. GPT-3.5 is a language model released by OpenAI and is an improved version of its predecessor, GPT-3. It was launched as an upgrade to address specific limitations of GPT-3 and enhance its capabilities. It is the model behind ChatGPT.
  2. GPT-3.5 is known for its impressive advancements, making it ten times more advanced than GPT-3  This improvement enables the model to better understand the context and distinguish nuances, resulting in more accurate and coherent responses.
  3. One of the significant enhancements in GPT-3.5 is its linguistic finesse. It has a greater ability to understand and generate different dialects and respond to emotions expressed in the text.
  4. GPT-3.5 has also been optimized for information synthesis. It can answer complex questions by synthesizing information from multiple sources, providing more comprehensive and nuanced answers.
  5. In terms of capacity, GPT-3.5 offers a maximum token limit of 32,000, equivalent to approximately 25,000 words. This is a significant increase from GPT-3, which had a token limit of 4,000, equivalent to approximately 3,125 words.


Based on the information from the provided URLs, here are five key points about GPT-4:

  1. GPT-4, is the latest language model released by OpenAI in the GPT series available with ChatGPT Premium version.
  2. One of the significant advancements in GPT-4 is its multimodal capabilities. Unlike its predecessors, GPT-4 can handle both text and image inputs, making it more versatile in processing various modes of input data. This means the AI can accept images as input and interpret and understand them similarly to text prompts, enabling it to handle more complex and nuanced inputs.
  3. GPT-4 demonstrates human-level performance on various professional and academic benchmarks. It has been tested on exams and standardized tests like SAT, BAR, and GRE, and it performed relatively well, outperforming its predecessor GPT-3.5 in each case.
  4. The new model has a significantly larger word limit compared to GPT-3.5. GPT-4 can handle input prompts of up to 25,000 words, while GPT-3.5 was limited to 8,000 words. This larger word limit allows users to provide more detailed and extensive prompts, which gives the model more information to work with and can result in lengthier and more comprehensive outputs.


  1. Meta has released Llama 2, which is the next generation of its open-source large language model (LLM). Llama 2 is a family of state-of-the-art open-access language models and has been trained on 40% more data tha inn its predecessor, Llama 1, and has double the context length. It is available in three sizes: 7B, 13B, and 70B parameters.
  2. Multimodal Capabilities: Llama 2 is a generative AI model with multimodal capabilities. It can understand and handle text prompts and images, similar to other advanced language models in the field of artificial intelligence.
  3. Availability: Llama 2 is made available for research and commercial use, and the service is offered free of cost.
  4. Performance and Language Support: Llama 2 has undergone rigorous testing and performs well on various external benchmarks, including reasoning, coding, proficiency, and knowledge tests.
  5. Focus on Safety and Responsible AI: Meta emphasizes safety and responsible AI with Llama 2. The company wants to ensure that Llama 2 is used responsibly and not misused. Users need to fill out a download request, and Meta evaluates each request to prevent misuse.


  1. Model Architecture and Parameters: Falcon-40B is a powerful large language model (LLM) with a causal decoder-only architecture. It has an impressive 40 billion parameters, making it one of the largest publicly available LLMs.
  2. Training Data: Falcon-40B was trained on a massive dataset consisting of 1,000 billion tokens from RefinedWeb, which was enhanced with curated corpora. This extensive training data contributes to its ability to generate high-quality text and perform well in various language tasks.
  3. Performance and Ranking: Falcon-40B has earned its place as one of the top-ranked AI models on the OpenLLM Leaderboard. It has outperformed other notable LLMs like LLaMA, StableLM, RedPajama, and MPT.
  4. Open-Source and Deployment: One of the significant advantages of Falcon-40B is its transparent, open-source nature. As an open-source model, it offers researchers and developers the freedom to use and fine-tune the model for various downstream tasks.


  1. T5 (Text-to-Text Transfer Transformer) is a powerful language model that was introduced in the research paper titled "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." T5 is an encoder-decoder model that is pre-trained on a mixture of unsupervised and supervised tasks and is capable of handling a wide range of language understanding tasks.
  2. Unified Framework: T5 presents a unified framework for transfer learning in natural language processing (NLP) by converting every language problem into a text-to-text format. This means that all language tasks, including translation, summarization, question-answering, text classification, and more, are represented as text-generation problems.
  3. Pretraining and Architecture: The pretraining of T5 includes both supervised and self-supervised training.
  4. Flexible Task Handling: T5 demonstrates flexibility in handling different tasks by using task-specific prefixes for input texts. By prepending a specific prefix to the input, T5 can be utilized for various tasks without requiring extensive fine-tuning. For example, translation tasks can be achieved by using the prefix "translate English to German," and summarization tasks can be performed with the prefix "summarize". This approach simplifies the utilization of the model for different tasks.
  5. State-of-the-Art Performance: Through a systematic study and exploration of various transfer learning techniques, T5 has achieved state-of-the-art results on numerous language understanding benchmarks, covering tasks like summarization, question-answering, and text classification.


  1. Open-Source Language Model: StableLM is an open-source suite of large language models (LLMs) developed by Stability AI.
  2. Different Versions and Parameters: The Alpha version of StableLM is available in two variants, one with 3 billion parameters and the other with 7 billion parameters. Additionally, Stability AI is working on models with larger parameter counts, including 15 billion, 30 billion, 65 billion, and even a massive 175 billion parameter model. The model's capacity to deliver high performance with relatively fewer parameters is noteworthy.
  3. Transparency and Accessibility: One of the key features of StableLM is its transparent and open-source nature, allowing researchers to inspect and verify model performance and develop interpretability techniques.
  4. Training Dataset: StableLM is trained on an experimental dataset that is three times larger than The Pile, containing a staggering 1.5 trillion tokens of content.


  1. Phi-1 is a new large language model (LLM) developed by Microsoft Research, specifically specialized in Python coding tasks. It is based on the Transformer design and is a transformer-based language model for code.
  2. Focus on High-Quality Data: The research team at Microsoft investigated the impact of high-quality data on enhancing the performance of Phi-1 and other large language models. They utilized "textbook quality" data for training the model, including data filtered from The Stack and StackOverflow datasets, along with synthetic generation using GPT-3.5 and GPT-4.
  3. Despite its smaller size compared to competing models, Phi-1 outperformed larger models in benchmarks, even those using 100 times more data.
  4. Limitations and Specialization: While Phi-1 showed promising results, it does have some limitations. Its specialization in Python programming makes it less versatile compared to larger LLMs that have domain-specific knowledge, such as programming with specific APIs. Additionally, Phi-1's structured nature might make it less robust to style variations or input errors in prompts.


  1. PaLM 2 is developed by Google AI, and it is the successor to the original PaLM, which was announced in 2022. This advanced language model builds on Google's fundamental research and infrastructure and is highly capable of performing various tasks and is the model behind Bard.
  2. Multilingual and Multifunctional: PaLM 2 is trained on a massive dataset of text and code, spanning over 200 languages and 20 programming languages. It has the ability to perform diverse tasks, including natural language understanding, generation, and translation, code generation, audio, video, and image generation, as well as logical reasoning.
  3. Improvements over PaLM 1: Google has made significant advancements with PaLM 2 compared to its predecessor, PaLM 1, which was announced in April 2022. PaLM 2 is much stronger in logic and reasoning, and it has been trained on a wider range of texts, making it proficient in various language-related tasks.


  1. LaMDA is a breakthrough conversation technology developed by Google. It is a machine learning-powered chatbot designed to engage in free-flowing conversations about a wide range of topics, making it more versatile and adaptive than traditional chatbots.
  2. Core Technology: LaMDA is built on the Transformer neural network architecture. This architecture allows LaMDA to read and process words, predict the context, and generate responses.
  3. Open Domain Model: LaMDA is an "open domain" model, meaning it doesn't require separate training for different conversations or subjects. It can seamlessly switch between topics and provide coherent responses across various domains.


We have discussed many LLms but which one to choose for your business? We really can’t answer that but surely can help you to consider which factors should decide your final choice:

  1. Review the performance benchmarks of different LLMs. Pay attention to how well they perform on various NLP tasks, such as natural questions, common-sense reasoning, mathematical reasoning, etc.
  2. Evaluate how many parameters each LLM has. Smaller models with fewer parameters might outperform larger ones in certain scenarios, as demonstrated by Meta AI's LLaMA, which achieved competitive performance despite having 10 times fewer parameters than GPT-3.
  3. Understand the underlying model architecture used by each LLM, such as transformer models, which have been fundamental in the NLP advancements.
  4. Consider the training data used for each model. Some LLMs are trained on vast amounts of data from multiple languages, enabling them to handle diverse linguistic contexts.
  5. Check the accessibility of each LLM. Some models may only be available for researchers, while others might have API access or be fully hosted.
  6. Pay attention to the license restrictions, as some models might have limitations on commercial usage, while others might have more permissive licensing.
  7. Assess the ease of fine-tuning each LLM for your specific tasks. Some models might be more straightforward to adapt to specific use cases and might require less labeled data for fine-tuning.
  8. Consider the training time and resource requirements for each LLM. Smaller models might require less training time and computational resources, making them more practical for certain applications.

Will end this post with a sample code to load a pretrained GPT-2 model using a hugging face and using it locally for text generation. Don’t expect a very high performance as ChatGPT but even GPT-2 is fine. Let’s check it out

Blog Post2 Pic2

So the above code

  • Loads a GPT-2 model from Hugging Face
  • Loads tokenizer for GPT-2
  • Passes text for autocompletion to tokenizer
  • Using the model, generates 16 new tokens extending the input
  • Using the tokenizer, the tokens generated by the model are converted to readable format

See the output for yourself

Blog Post2 Pic3

With this, we will be wrapping this long blog post. We’ll be coming back soon with something really exciting about recent developments in the field of AI. Stay tuned by the time.

Disclaimer: The views and opinions expressed in this blog post are solely those of the authors and do not reflect the official policy or position of any of the mentioned tools. This blog post is not a form of advertising and no remuneration was received for the creation and publication of this post. The intention is to share our findings and experiences using these tools and is intended purely for informational purposes.