Prompting Fundamentals and How to Wield them Effectively

Writing good prompts is the most straightforward way to get value out of large language models (LLMs). However, it’s important to understand the fundamentals even as we apply advanced techniques and prompt optimization tools. For example, there’s more to Chain-of-Thought (CoT) beyond simply adding “think step by step”. Here, I’d like to share some prompting fundamentals to help you get the most out of LLMs.

Aside: By know we should know that we need reliable evals before doing any major prompt engineering. Without evals, how would we measure improvements and regressions? Here’s my usual workflow: (i) manually label ~100 eval examples, (ii) write initial prompt, (iii) run eval, and iterate on prompt and evals, (iv) eval on held-out test set before deployment.

We’ll use the Claude Messages API for the code examples below. The API provides specific roles for the user and assistant, as well as a system prompt.

message = anthropic.Anthropic().beta.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    system="Today is 26th May 2024.",
	messages = [
		{"role": "user", "content": "Hello there."},
		{"role": "assistant", "content": "Hi, I'm Claude. How can I help?"},
		{"role": "user", "content": "What is prompt engineering?"},
	]
)

Mental model: Prompts as conditioning

At its core, prompt engineering is about conditioning the probabilistic LLM to generate the desired output. Thus, each additional instruction or piece of context in the prompt steer the LLM’s generation in a particular direction.

Consider these prompts. The first will likely generate a response about Apple the tech company. The second will describe the fruit. And the third will explain the idiom.

# Prompt 1
Tell me about: Apple

# Prompt 2
Tell me about: Apple fruit

# Prompt 3
Tell me about: Apple of my eye

By adding a couple of tokens, we can condition the model to return completely different outputs. By extension, prompt engineering techniques like n-shot prompting, structured input and output, CoT, etc. are simply more sophisticated ways of conditioning the LLM.

Assign roles and responsibilities

One way to condition the model’s output is to assign it a specific role or responsibility. This provides it with context that steers its responses in terms of content, tone, style, etc.

Consider the two prompts below. The assigned role in each prompt will lead to different responses. The preschool teacher will likely respond with simple language and analogies while the NLP professor may dive into the technical details of attention mechanisms.

# Prompt 1
You are a preschool teacher. Explain how attention in LLMs works.

# Prompt 2
You are an NLP professor. Explain how attention in LLMs works.

Roles can also improve the accuracy of the model on most tasks. Imagine we’re building a system to exclude NSFW image generation prompts. While a basic prompt like prompt 1 might work, we can improve the model’s accuracy by providing it with a role (prompt 2) or responsibility (prompt 3). The additional context in prompts 2 and 3 encourages the LLM to scrutinize the input more carefully, thus increasing recall on more subtle issues.

# Prompt 1
Is this image generation prompt safe?

# Prompt 2
Claude, you are an expert content moderator who identifies harmful aspects in prompts.
Is this image generation prompt safe?

# Prompt 3
Claude, you are responsible for identifying harmful aspects in prompts.
Is this image generation prompt safe?

Structured input and output

Structured input helps the LLM better understand the task and input, thus improving the quality of output. Structured output makes it easier to parse responses, thus simplifying integration with downstream systems. For Claude, XML tags work particularly well while other LLMs may prefer Markdown, JSON, etc.

In this example, we ask Claude to extract product attributions from a <description>:

<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Claude can reliably follow these instructions and almost always generates output in the requested format.

<name>SmartHome Mini</name>
<size>5 inches wide</size>  
<price>$49.99</price>
<color>black or white</color>

We can scale this to process multiple documents at once. Here’s an example where we provide product reviews as an array of dicts which we then convert to XML input. (While the example only shows three documents, you can imagine how we can increase the input to dozens, if not hundreds of documents).

from dicttoxml import dicttoxml
from xml.dom.minidom import parseString

def custom_item_func(item):
    return 'review'

docs = {
    "reviews": [
        {
            "id": 1,
            "text": "The SmartHome Mini is a compact, powerful, and user-friendly smart 
                     home hub. It offers great value for its price."
        },
        {
            "id": 2,
            "text": "The SmartHome Mini is a decent entry-level smart home hub, but it 
                     has some connectivity issues and the app needs improvement."
        },
        {
            "id": 3,
            "text": "Despite being affordable and compact, the SmartHome Mini's 
                     performance is disappointing, with poor voice command 
                     interpretation and unreliable device connections."
        }
    ]
}

# Convert the dictionary to XML
xml_output = dicttoxml(docs, attr_type=False, root=False, item_func=custom_item_func)

# Parse the XML string
dom = parseString(xml_output)

# Pretty-print the XML with indentation and spaces
pretty_xml = dom.toprettyxml(indent="  ")

# Print the pretty-printed XML
print(pretty_xml)

This gives us the following XML.

<reviews>
<reviews>
  <review>
    <id>1</id>
    <text>
      The SmartHome Mini is a compact, powerful, and user-friendly smart home hub. It 
      offers great value for its price.
    </text>
  </review>
  <review>
    <id>2</id>
    <text>
      The SmartHome Mini is a decent entry-level smart home hub, but it has some 
      connectivity issues and the app needs improvement.
    </text>
  </review>
  <review>
    <id>3</id>
    <text>
      Despite being affordable and compact, the SmartHome Mini's performance is 
      disappointing, with poor voice command interpretation and unreliable 
      device connections.
    </text>
  </review>
</reviews>

We can then prompt Claude to provide a <summary> of the <reviews>, with references to the relevant <id> tags. This gives us the following output.

<summary>
The SmartHome Mini receives mixed reviews. <id>1</id> praises its compact size, power, 
user-friendliness, and value for money. However, <id>2</id> notes connectivity issues 
and room for app improvement, while <id>3</id> finds the performance disappointing 
due to poor voice command interpretation and unreliable device connections.
</summary>

We can also prompt it to extract the <aspect>, <sentiment>, and corresponding review <id>, leading to the following:

<aspect>compact size</aspect> <sentiment>positive</sentiment> <id>1</id>
<aspect>power</aspect> <sentiment>positive</sentiment> <id>1</id>
<aspect>connectivity</aspect> <sentiment>negative</sentiment> <id>2</id>
<aspect>app</aspect> <sentiment>negative</sentiment> <id>2</id>
<aspect>affordability</aspect> <sentiment>positive</sentiment> <id>3</id>
<aspect>performance</aspect> <sentiment>negative</sentiment> <id>3</id>

Prefill Claude’s responses

Prefilling an LLM’s response is akin to “putting words in its mouth”. For Claude, this guarantees that the generated text will start with the provided tokens (at least in my experience across millions of requests).

Here’s how we would do this via Claude’s Messages API, where we prefill the assistant’s response with <response><name>. This ensures that Claude will start with these exact tokens, and also make it easier to parse the <response> downstream.

input = """
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Return the extracted attributes within <response>.
"""

messages=[
    {
        "role": "user",
        "content": input,
    },
    {
        "role": "assistant",
        "content": "<response><name>"
    }
]

n-shot prompting

Perhaps the single most effective technique for conditioning an LLM’s behavior is n-shot prompting. The idea is to provide the LLM with n examples that demonstrate the task and desired output. This steers the model towards the example output and usually leads to improvements in output quality and consistency.

However, n-shot prompting can be a double-edged sword. If we provide too few examples, say three to five, we risk overfitting the model to those specific instances. As a result, if the input differs from the narrow set of examples, output quality could degrade.

I typically have at least a dozen samples or more. Most academic evals use 32-shot or 64-shot prompts. (This is also why I tend not to call this technique few-shot prompting because “few” can be misleading on what it takes to get reliable performance.)

We’ll also want to ensure that our n-samples are representative of expected production inputs. If we’re building a system to extract aspects and sentiments from product reviews, we’ll want to include samples from multiple categories such as electronics, fashion, groceries, media, etc. We’ll also want to match the distribution of examples to production data. If 80% of production aspects are positive, the n-shot prompt should reflect that.

That said, the number of examples needed will vary based on the complexity of the task. For simpler goals such as enforcing output structure or response tone, as few as five examples may suffice. In these cases, we’ll often only need to provide the desired output rather than input-output pairs.

Diving deeper into Chain-of-Thought

The basic idea of CoT is to give the LLM “space to think” before generating its final output, especially if the task is complex. The intermediate reasoning step allows the model to break down the problem and conditions its own response, often leading to better results.

The standard approach is to simply add the phrase “Think step by step”.

Claude, you are responsible for summarizing meeting <transcripts>.

<transcript>
{transcript}
</transcript>

Think step by step and return a <summary> of the <transcript>.

However, we can make a few simple tweaks to make this form effective.

One idea is to contain the model’s CoT within a designated <sketchpad>. This makes it easier to parse the final output and exclude the CoT if needed. We can then prefill Claude’s response with the opening <sketchpad> tag.

Claude, you are responsible for summarizing meeting <transcripts>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

Then, return a <summary> based on the <sketchpad>.

Another way to improve CoT is to provide more specific instructions for the reasoning process. For example:

Claude, you are responsible for summarizing meeting <transcripts>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

In the <sketchpad>, identify the <key decisions>, <action items>, and their <owners>.

Then, check that the <sketchpad> items are factually consistent with the <transcript>.

Finally, return a <summary> based on the <sketchpad>.

By guiding the model to look for specific information and verify its intermediate outputs against the source document, we can significantly improve factual consistency (i.e., reduce hallucination). In some use cases, we’ve observed that adding a sentence or two to the CoT prompt reduced hallucination by 75%.

Optimal placement context

A question I often get is where to place the input document or context within the prompt. For Claude, I’ve found that putting the context near the beginning tends to work best, with a structure like:

Role or responsibility (usually brief)
Context or document
Specific instructions
Prefilled response.

This aligns with the role-context-task pattern used in many of Anthropic’s own examples.

Nonetheless, the optimal placement may vary across different models depending on how they were trained. If you have reliable evals, it’s worth experimenting with different context locations and measuring the impact on performance.

Crafting effective instructions

Using short, focused sentences separated by new lines tends to work best. This is akin to writing small, single-purpose functions in programming. For some reason, other formats like paragraphs, bullet points, or numbered lists don’t seem to work as well, at least in my experience. Nonetheless, the meta on prompt engineering constantly evolves so it’s helpful to observe the latest system prompts. Here’s Claude’s prompt, and here’s ChatGPT’s.

Also, it’s natural to add more and more instructions to our prompts to better handle edge cases and eke out more performance. But just like software, prompts can get bloated over time. Before we know it, our once-simple prompt has grown into a hundred line prompt. To add injury to insult, the Frankenstein-ed prompt may actually perform worse on more common and straightforward inputs. Thus, periodically refactor prompts (just like software!) and prune instructions that are no longer needed.

Dealing with hallucinations

This is a tricky one. While some techniques help with hallucinations, none are foolproof. Thus, do not assume that applying these techniques will eliminate hallucinations.

For tasks involving extraction or question answering, include an instruction that allows the LLM to say “I don’t know” or “Not applicable”. Additionally, try instructing the model to only provide an answer if it’s highly confident. Here’s an example:

Answer the following question based on the provided <context>.

If the question CANNOT be answered based on the <context>, respond with "I don't know".

Only provide an answer if you are highly confident it is factually correct.

<context>
{context}
</context>

Question: {question}

Answer:

For tasks that involve reasoning or multi-step processing, using CoT can help reduce hallucinations. By providing a <sketchpad> for the model to think and check its intermediate output before providing the final answer, we can improve the factual grounding of the output. The previous example of summarizing meeting transcripts (reproduced below) is a good example.

Claude, you are responsible for summarizing meeting <transcripts>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

In the <sketchpad>, identify the <key decisions>, <action items>, and their <owners>.

Then, check that the <sketchpad> items are factually consistent with the <transcript>.

Finally, return a <summary> based on the <sketchpad>.

Using the stop sequence

The stop sequence parameter allows us to specify words or phrases that signal the end of the desired output. This prevents trailing text, reduces latency, and makes the model’s responses easier to parse. When working with Claude, the convenient option is to use the closing XML tag (e.g., </response>) as the stop sequence.

input = """
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Return the extracted attributes within <response>.
"""

message = anthropic.Anthropic().messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": input,
        },
        {
            "role": "assistant",
            "content": "<response><name>"
        }
    ],
    stop_sequences=["</response>"]  # Added the stop sequence here
)

Selecting a temperature

The temperature parameter controls the “creativity” of a model’s output. It ranges from 0.0 to 1.0, with higher values resulting in more diverse and unpredictable responses while lower values produce more focused and deterministic outputs. (Confusingly, OpenAI APIs allow temperature values as high as 2.0, but this is not the norm.)

My rule of thumb is to use the highest temperature that still leads to good results for the specific task. I often start with a temperature of 0.8 and lower it as necessary.

Another heuristic is to use lower temperatures (closer to 0) for analytical or multiple-choice tasks, and higher temperatures (closer to 1) for creative or open-ended tasks. Nonetheless, I’ve found that too low a temperature reduces the model’s intelligence, thus my preferred approach of starting from 0.8 and lowering it only if absolutely necessary.

What doesn’t seem to matter

There are a few things that, based on my experience and discussions with others, don’t have a practical impact on performance (at least for recent models):

Courtesy: Adding phrases like “please” and “thank you” doesn’t affect the model’s outputs, even if it might earn us some goodwill with our future AI overlords.
Tips and threats: Recent models are generally good at following instructions without the need to offer a $200 tip or threatening that we will lose our job.

Of course, it doesn’t hurt to be polite or playful in our prompts. Nonetheless, it’s useful to know that they’re not as critical for getting good results.

• • •

As LLMs continue to improve, effective prompt engineering will remain a valuable skill for getting the most out of these models (though we may transition to “dictionary learning” soon). What other prompting techniques have you found useful? Please reach out!

If you found this useful, please cite this write-up as:

Yan, Ziyou. (May 2024). Prompting Fundamentals and How to Wield them Effectively. eugeneyan.com.
https://eugeneyan.com/writing/prompting/.

@article{yan2024prompting,
  title   = {Prompting Fundamentals and How to Wield them Effectively},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2024},
  month   = {May},
  url     = {https://eugeneyan.com/writing/prompting/}
}

Share on:

Join 6,700+ readers getting updates on machine learning, RecSys, LLMs, and engineering.

Source link
lol