Sergii Shelpuk
- 14 min read

LLM Practitioner's Guide: Llama-2 Prompt Structure

Llama-2, a family of open-access large language models released by Meta in July 2023, became a model of choice for many who cared about data security and wanted to develop their own custom large language model instead of relying on third-party generic ones.

We've been deeply involved with customizing, fine-tuning, and deploying Llama-2. Through the 'LLM Practitioner's Guide' posts series, we aim to share our insights on what Llama-2 can and can't do, along with detailed instructions and best practices for fine-tuning.

LLM Practitioner's Guide:

How Multilingual Llama-2 Actually Is?
How Multilingual Falcon, Mistral, Smaug, and other LLMs Are?
Llama-2 Prompt Structure
Gemma, a Game-Changing Multilingual LLM

In today's post, we will explore the prompt structure of Llama-2, a crucial component for inference and fine-tuning.

LLM Development Stages

Modern large language models (LLMs) like ChatGPT, Llama-2, Falcon, and others all function based on the same underlying principle: they predict the next word in a sequence of words. This process, which is purely statistical in nature, creates an illusion of human-level intelligence and understanding.

Let's look at an example. Consider the phrase:

"To be or not to ..."

What do you think the next word is?

Most likely, you'll think of "be". These famous words from Shakespeare are hard to forget, and it's difficult to imagine any other word fitting here.

A simpler example is "I beg your ...". Here, most English speakers would immediately think of "pardon" as the logical next word.

But it's not just about predicting a single word. If you add "be" to the sequence and then ask, "What's the next word?" most people familiar with English literature would suggest "that". Continuing this process, adding "that" leads to the next prediction of "is", and so on. Eventually, you reconstruct the famous line: "To be, or not to be, that is the question."

Predicting the next word becomes more challenging in less clear-cut cases, but this is precisely what ChatGPT, Llama-2, and others do. However, they predict token IDs instead of actual words.

Let's refer to a slide from Andrej Karpathy's presentation to understand how these LLMs are developed and trained. He offers a fantastic one-hour lecture on this topic, which I highly recommend to those with the time to watch.

State of GPT by Andrej Karpathy, BRK216HFS

Pretraining

Pretraining is the first step in creating an AI language model. It involves feeding a brand new model a massive amount of text data, usually gathered from the Internet.

The learning process is relatively straightforward. The model is taught to guess what word comes next in a sentence or to fill in a missing word if it's intentionally hidden (imagine a game of fill-in-the-blanks). This is done with billions of sentences taken from the vast data pool.

The model learns an incredibly rich and nuanced understanding of language through this seemingly simple word prediction task. This way, we get what's known as the 'base model.'

Supervised Finetuning

An interesting aspect of these base models is that you can trick them into performing tasks by shaping these tasks as document completion. Here is the example from the GPT-2 paper from 2019.

Language Models are Unsupervised Multitask Learners, 2019

Yet, since we want the model to perform various tasks and not merely complete whatever it gets, we can train it further. At this stage, we need to show the model what the right auto-completion is, not just for the Internet-scrapped data but for the task at hand.

To achieve this, OpenAI recruited numerous contractors. Their job was to simulate the desired response to a given prompt. Then, they trained the model to simulate these contractors' answers. After this stage, the evolved model is known as the SFT (Supervised Fine-Tuning) model.

Reward Modeling and Reinforcement Learning

In the following stage, OpenAI's team compiled another dataset. This involved evaluating various responses ('completions') the model generated for the same prompts. For instance, the model might offer three solutions to a programming challenge, and a human expert assesses and ranks them.

State of GPT by Andrej Karpathy, BRK216HFS

The model is then trained to refine its responses. It learns to predict and produce answers that are more likely to receive higher rankings from human evaluators. This training method leads to creating more advanced models, known as chat or instruct models. ChatGPT, for example, is an 'instruct' model developed by further training on the GPT-3 base model.

Crucially, Supervised Finetuning and Reinforcement Learning stages involve passing the data through the model in a specific structure. Using and especially fine-tuning these models requires understanding these structures to harness the models' potential fully.

Getting back to Llama-2

Now, let us get back to Llama-2.

First, we import the model from the transformers library. Transformers are developed and supported by Huggingface, a marvelous open-source community and default destination for modern-day AI scientists and engineers.

Here are the Llama-2 models available on Huggingface.

Llama-2-7b - base Llama-2 model weights.
Llama-2-7b-hf - base Llama-2 model integrated into the Huggingface transformers library.
Llama-2-7b-chat-hf - chat Llama-2 model fine-tuned for responding to questions and task requests and integrated into the Huggingface transformers library.

7b part of the model name indicates the number of model weights. Huggingface provides all three Llama-2 in all three sizes released by Meta:

7b - 7 billion weights
13b - 13 billion weights
70b - 70 billion weights

Let us explore actual Llama-2 models and see how they work.

Llama-2 base model

First, let us load the Llama-2-7b base model model by following the instructions provided on the Huggingface model page. By setting device_map="auto" we ask the loader to distribute the model efficiently across all available GPU devices.

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', device_map="auto")

Tokenizer overview

The only data type machine learning models inherently understand, including the LLMs, is numerical data. Everything else, such as images, sound, and text, must be represented as numbers to make it applicable to machine learning models. With LLMs, tokenizers are objects responsible for converting text into numbers.

Let us load the Llama-2 tokenizer. We will also set use_fast=True to enable fast tokenization using more advanced string search algorithms inside the tokenizer.

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', use_fast=True)

The tokenizer breaks down the words into the common parts called tokens and then represents every token with the corresponding number. Let us see how it works.

Let us ask the Llama-2 tokenizer to break the sentence “Meta developed and publicly released the Llama 2 family of large language models” into the tokens.

> print(tokenizer.tokenize('Meta developed and publicly released the Llama 2 family of large language models'))

['▁Meta', '▁developed', '▁and', '▁public', 'ly', '▁released', '▁the', '▁L', 'l', 'ama', '▁', '2', '▁family', '▁of', '▁large', '▁language', '▁models']

Now, here are the corresponding integer numbers for each of these tokens.

> sentence = 'Meta developed and publicly released the Llama 2 family of large language models'
> print(tokenizer(sentence))

{'input_ids': [1, 20553, 8906, 322, 970, 368, 5492, 278, 365, 29880, 3304, 29871, 29906, 3942, 310, 2919, 4086, 4733], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note that the tokenizer also provides us with the attention mask. The attention mask indicates whether the model should pay attention to the corresponding token. If we need to pad the sentence representation to a certain length, the tokenizer will set the attention_mask for the padded tokens to zero so that the model does not learn how to pad sentences (we can do that ourselves).

The tokens are perfectly reversible, and we can ask the tokenizer to convert the list of tokens back to the human language.

> tokenizer.decode(tokenizer(sentence)['input_ids'])

'<s> Meta developed and publicly released the Llama 2 family of large language models'

Notice the <s> token we get after reversing the tokenization. Where does it come from?

The model knows nothing about the world or user input, so we need to enforce some structure. <s> is the symbol explaining Llama-2 that this is the beginning of the user input.

Let us do something with this model. Let us ask it to explain how the language models work.

> request = 'What is the language model, and how does it work?'
> print(tokenizer.tokenize(request))

['▁What', '▁is', '▁the', '▁language', '▁model', ',', '▁and', '▁how', '▁does', '▁it', '▁work', '?']

The tokenizer nicely breaks the sentence into twelve tokens.

So far, so good. Let us turn these tokens into the numbers.

> inputs = tokenizer(request, return_tensors="pt")
> print(inputs)

{'input_ids': tensor([[1, 1724, 338, 278, 4086, 1904, 29892, 322, 920,   947, 372, 664, 29973]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

As expected, the twelve-token sentence becomes thirteen integers: twelve for the sentence, plus the "input start" token <s> represented by the integer 1. Now, let us pass this through the LLM.

> outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=1024)

Generation takes some time. We can limit it by reducing the max_new_tokens, which defines the maximum amount of tokens the model can generate.

Let us explore what is the model answer to our request.

> print(outputs)

tensor([[ 1, 1724, 338, ..., 5618, 338, 263]], device='cuda:0')

The model returns the numbers - LLMs only understand the numbers, remember. Let us use the tokenizer to transform these numbers back to text. For the sake of this demonstration, we turn skip_special_tokens to False since we want to observe the complete model behavior.

> response = tokenizer.decode(outputs[0], skip_special_tokens=False)
> print(response)

<s> What is the language model, and how does it work?
What is a language model in machine learning?
What is a language model example?
How do you use language model?
What is language model in NLP?
How do you train a language model?
How do you build a language model?
How do you train a language model in Python?
What is a language model in NLP quizlet?
What is a language model in machine learning quizlet?
What is the language model in NLP?
How do you create a language model?
What is a language model in NLP?
How do you create a language model?
What is a language model in NLP?
How do you build a language model in NLP?
How do you train a language model in NLP?
How do you create a language model in Python?
…

The output starts with our input: 'What is the language model, and how does it work?'. Observe that instead of answering our question, the model responds with more questions about the language models. Why so?

Remember that we are exploring the base model and not the chat model. The base model was trained to predict the next word based on the vast set of Internet data, and the Internet is full of web pages listing Q&A or similar structures containing the list of questions. Following our input, the model generated an "average Internet page" starting from our question. So, the base model behaves as expected.

As you can see, the base model does not have much use by itself. It needs to be trained further. Now, let us explore the Llama-2 chat model.

Llama-2 chat model

Let us load Llama-2-7b-chat-hf, the Llama-2 chat model, and its tokenizer.

The tokenizer is likely identical to the base model's, but it is always good practice to load the tokenizer that corresponds to the model. Since the model operates numbers only and the tokenizer is responsible for translating words to numbers, applying the tokenizer it was not trained with messes up the model.

> model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', device_map="auto")
> tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf', use_fast=True)

Now, let us ask the chat model the same question: 'What is the language model, and how does it work?'

> inputs = tokenizer(request, return_tensors="pt").to('cuda')
> outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=1024)
> response = tokenizer.decode(outputs[0], skip_special_tokens=False) 
> print(response)

<s> What is the language model, and how does it work?
A language model is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to predict the likelihood of a given sequence of words or characters. The model learns to predict the next word in a sequence, given the previous words, by analyzing the patterns and structures of language.
There are different types of language models, such as: 
1. Recurrent Neural Networks (RNNs): These models use a loop of recurrent connections to capture the sequential dependencies in language.
2. Long Short-Term Memory (LSTM) networks: These models use a special type of RNN that can learn long-term dependencies in language.
3. Transformers: These models use self-attention mechanisms to capture the relationships between different words in a sequence.
The process of how a language model works can be broken down into the following steps:
1. Data Preprocessing: The training data is preprocessed by tokenizing the text, removing stop words, punctuation, and converting all words to lowercase.
2. Model Training: The preprocessed data is then fed into the language model, which is trained using a loss function such as cross-entropy. The model learns to predict the next word in a sequence, given the previous words.
3. Model Evaluation: Once the model is trained, it is evaluated on a test set to measure its performance. The evaluation metrics used to measure the performance of a language model include perplexity, accuracy, and F1 score.
4. Model Deployment: Once the model is trained and evaluated, it can be deployed in various applications such as language translation, language generation, and text summarization.
Some of the applications of language models include:
1. Language Translation: Language models can be used to translate text from one language to another.
2. Language Generation: Language models can be used to generate new text that is similar in style and structure to a given input text.
3. Text Summarization: Language models can be used to summarize long documents or articles by identifying the most important phrases and sentences.
4. Chatbots: Language models can be used to power chatbots and other conversational AI systems by generating responses to user input.
5. Sentiment Analysis: Language models can be used to analyze the sentiment of text, such as determining whether a piece of text is positive, negative, or neutral.
6. Named Entity Recognition: Language models can be used to identify and classify named entities in text, such as people, organizations, and locations.
7. Question Answering: Language models can be used to answer questions based on the content of a document or article.
8. Text Classification: Language models can be used to classify text into categories such as spam/not spam, positive/negative review, etc.
In conclusion, language models are a fundamental component of many natural language processing (NLP) applications, and can be used for a wide range of tasks, including language translation, language generation, text summarization, and sentiment analysis.</s>

That is much better. Yet, there is more to it. Note the </s> symbol at the end of the generated text. Where does it come from?

Meta trained Llama-2 for chatting, and the chat consists of multiple messages. As you already know, the LLMs, in nature, are stateless. That means they produce output to the input and do not remember any previous inputs. Yet, when you chat with ChatGPT, it clearly remembers not only your last message but the whole conversation. How does it do it?

The answer is that whenever you ask ChatGPT something, the system passes through its LLM, not just your last message but the whole conversation.

Llama-2 was trained the same way. To set different messages aside (and also to provide the general chat "instructions"), Meta introduced a unique format for Llama-2 messages.

Here is this format.

<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST] {model_reply_1}</s><s>[INST] {user_message_2}[/INST]

This way, </s> denotes the end of the output. This element informs you that the model has completed the generation.

Let us use this format to ask the Llama-2 chat model the same question but also instruct it to provide simple explanations using examples.

> prompt = f"""[INST] <<SYS>>
You are an expert machine learning engineer skillful in explaining complex concepts in a simple manner. In your explanations, you use examples and helpful analogies to make sure the audience understands the details. <</SYS>>

{request} [/INST]""" 
> print(prompt)

[INST] <<SYS>>
You are an expert machine learning engineer skillful in explaining complex concepts in a simple manner. In your
explanations you use examples and helpful analogies to make sure the audience understands the details.
<</SYS>>

What is the language model, and how does it work? [/INST]

Note that we do not add <s> at the beginning of the prompt since the tokenizer takes care of it. Now, let us pass this new prompt through the model and see how it works.

> inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
> outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=1024
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
> print(response)

<s> [INST] <<SYS>>
You are an expert machine learning engineer skillful in explaining complex concepts in a simple manner. In your explanations you use examples and helpful analogies to make sure the audience understands the details.
<</SYS>>

What is the language model, and how does it work? [/INST]  Ah, a fellow curious mind! *adjusts glasses* Let me tell you, my friend, about the magical world of language models! 🧙‍♂️
A language model is a type of artificial intelligence (AI) that is trained to process and generate human-like language. Think of it like a super-powerful word processor that can not only type out sentences for you but also create entirely new sentences that you've never seen before! 💻
Now, how does it work? Well, the model is trained on a massive dataset of text, which can be anything from books and articles to social media posts and chat logs. The model learns to predict the next word in a sentence based on the context provided by the previous words. It's like a game of word association, but instead of just guessing, the model uses complex algorithms to analyze the patterns and structures of language. 🤔
But wait, there's more! Once the model is trained, it can be used for all sorts of cool things like:
1. Language Translation: Imagine being able to translate text from one language to another with near-perfect accuracy! The model can learn to recognize patterns in language and translate them into different languages. 🌎
2. Text Summarization: Need to summarize a long article or document? The model can condense it into a shorter, more digestible version while still retaining the important information. 📰
3. Chatbots: Have you ever interacted with a chatbot that can understand and respond to your queries? That's the magic of language models! They can generate responses based on the context of your messages. 💬
4. Content Generation: Imagine being able to generate high-quality content, like articles or social media posts, with just a few clicks! The model can generate text based on a given prompt or topic. 📱
These are just a few examples of the incredible things that language models can do. They are a powerful tool for anyone looking to work with language, from writers and content creators to language learners and researchers. 🎯
So there you have it, my friend! Language models are like superheroes of the language world, capable of feats that were once thought impossible. 🦸‍♂️ And with their help, we can create, communicate, and connect in ways we never thought possible! 🌟
Now, if you'll excuse me, I have some language modeling to do... 😉</s>

Well, that is not what we expected, but still, it demonstrates the power of the system prompts as well as the flexibility of the model :) It is also a good illustration that it is pretty hard to predict the model behavior by simply observing the prompts - the output may still surprise you (just like this one surprised us).

Chat history

Finally, prompts help you to have a multi-message conversation with Llama-2. Remember that Llama-2, just like every other model, is stateless. Thus, you need to pass the whole conversation to the model every time you ask for the next model response.

Consider a short dialog like the following:

User: What is the capital of France?

AI: The capital of France is Paris.

User: And what is its population size?

Let us use the Llama-2 prompt structure to represent it for the language model.

> prompt = f"""[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

What is the capital of France? [/INST] The capital of France is Paris.</s><s>[INST] And what is its population size? [/INST]"""

> inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
> outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=256)
> response = tokenizer.decode(outputs[0], skip_special_tokens=False)
> print(response)

<s> [INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

What is the capital of France? [/INST] The capital of France is Paris.</s><s> [INST] And what is its population size? [/INST]  As of January 2023, the estimated population of Paris, France is around 1.4 million people (within the city limits) and over 12.5 million in the metropolitan area.</s>

Conclusion

LLM prompt structure may be tricky because getting it wrong does not result in model failure. Yet, it may hinder its performance.

Harnessing Llama-2's full potential dramatically depends on structuring the input data correctly. And it is crucial for preparing the dataset for model fine-tuning, which we will explain in detail in the next post.

Building your custom LLM could enhance data security and compliance and enable an AI competitive advantage for your product. You can check our other posts to get an extensive explanation of what the network effect is and how AI enables it, how to build an AI competitive advantage for your company, what culture helps you build the right AI products, what to avoid in your AI strategy and execution, and more.

If you need help in building an AI product for your business, look no further. Our team of AI technology consultants and engineers have years of experience in helping technology companies like yours build sustainable competitive advantages through AI technology. From data collection to algorithm development, we can help you stay ahead of the competition and secure your market share for years to come.

If you want to keep posted on how to build a sustainable competitive advantage with AI technologies, please subscribe to our blog post updates below.

Shelpuk

AI Technology Consulting