Foundational Models & Prompt Engineering

5-Day Gen AI Intensive - Day 1

Nov 11, 2024

These notes are copy/pasted from the emails I received from enrolling in Google’s “5-Day Gen AI Intensive”, and the corresponding Kaggle notebooks. All content credit goes to Google and Kaggle.

===================================================

11/10/24

Today’s Assignments

1. Complete the Intro Unit – “Foundational Large Language Models & Text Generation”, which is:

[Optional] Listen to the summary podcast episode for this unit (created by NotebookLM).
Read the “Foundational Large Language Models & Text Generation” whitepaper.

2. Complete Unit 1 – “Prompt Engineering”, which is:

[Optional] Listen to the summary podcast episode for this unit (created by NotebookLM).
Read the “Prompt Engineering” whitepaper.
Complete this code lab on Kaggle where you’ll learn prompting fundamentals. Make sure you phone verify your account before starting, it's necessary for the code labs.

What You’ll Learn

Today you’ll explore the evolution of LLMs, from transformers to techniques like fine-tuning and inference acceleration. You’ll also get trained in the art of prompt engineering for optimal LLM interaction.

The code lab will walk you through getting started with the Gemini API and cover several prompt techniques and how different parameters impact the prompts.

===================================================

Output length

When generating text with an LLM, the output length affects cost and performance. Generating more tokens increases computation, leading to higher energy consumption, latency, and cost.

To stop the model from generating tokens past a limit, you can specify the max_output_length parameter when using the Gemini API. Specifying this parameter does not influence the generation of the output tokens, so the output will not become more stylistically or textually succinct, but it will stop generating tokens once the specified length is reached. Prompt engineering may be required to generate a more complete output for your given limit.

Temperature

Temperature controls the degree of randomness in token selection. Higher temperatures result in a higher number of candidate tokens from which the next output token is selected, and can produce more diverse results, while lower temperatures have the opposite effect, such that a temperature of 0 results in greedy decoding, selecting the most probable token at each step.

Temperature doesn't provide any guarantees of randomness, but it can be used to "nudge" the output somewhat.

Top-K and top-P

Like temperature, top-K and top-P parameters are also used to control the diversity of the model's output.

Top-K is a positive integer that defines the number of most probable tokens from which to select the output token. A top-K of 1 selects a single token, performing greedy decoding.

Top-P defines the probability threshold that, once cumulatively exceeded, tokens stop being selected as candidates. A top-P of 0 is typically equivalent to greedy decoding, and a top-P of 1 typically selects every token in the model's vocabulary.

When both are supplied, the Gemini API will filter top-K tokens first, then top-P and then finally sample from the candidate tokens using the supplied temperature.

Run this example a number of times, change the settings and observe the change in output.

Zero-shot

Zero-shot prompts are prompts that describe the request for the model directly.

Enum mode

The models are trained to generate text, and can sometimes produce more text than you may wish for. In the preceding example, the model will output the label, sometimes it can include a preceding "Sentiment" label, and without an output token limit, it may also add explanatory text afterwards.

The Gemini API has an Enum mode feature that allows you to constrain the output to a fixed set of values.

One-shot and few-shot

Providing an example of the expected response is known as a "one-shot" prompt. When you provide multiple examples, it is a "few-shot" prompt.

JSON mode

To provide control over the schema, and to ensure that you only receive JSON (with no other text or markdown), you can use the Gemini API's JSON mode. This forces the model to constrain decoding, such that token selection is guided by the supplied schema.

Chain of Thought (CoT)

Direct prompting on LLMs can return answers quickly and (in terms of output token usage) efficiently, but they can be prone to hallucination. The answer may "look" correct (in terms of language and syntax) but is incorrect in terms of factuality and reasoning.

Chain-of-Thought prompting is a technique where you instruct the model to output intermediate reasoning steps, and it typically gets better results, especially when combined with few-shot examples. It is worth noting that this technique doesn't completely eliminate hallucinations, and that it tends to cost more to run, due to the increased token count.

As models like the Gemini family are trained to be "chatty" and provide reasoning steps, you can ask the model to be more direct in the prompt.

Code prompting

Generating code

The Gemini family of models can be used to generate code, configuration and scripts. Generating code can be helpful when learning to code, learning a new language or for rapidly generating a first draft.

It's important to be aware that since LLMs can't reason, and can repeat training data, it's essential to read and test your code first, and comply with any relevant licenses.

Code execution

The Gemini API can automatically run generated code too, and will return the output.

Explaining code

The Gemini family of models can explain code to you too.

===================================================

Cross Entropy

Foundational Models & Prompt Engineering

5-Day Gen AI Intensive - Day 1

Today’s Assignments

What You’ll Learn

Output length

Temperature

Top-K and top-P

Zero-shot

Enum mode

One-shot and few-shot

JSON mode

Chain of Thought (CoT)

Code prompting

Generating code

Code execution

Explaining code

Resources