From 2898a9bc9e3578473a4898c3aacd717c00551f62 Mon Sep 17 00:00:00 2001 From: Kai Wu Date: Mon, 4 Nov 2024 12:38:44 -0800 Subject: [PATCH] prompt guide added --- docs/Prompt_Engineering_with_Llama_3.ipynb | 483 --------------------- docs/safety101.md | 4 +- 2 files changed, 2 insertions(+), 485 deletions(-) diff --git a/docs/Prompt_Engineering_with_Llama_3.ipynb b/docs/Prompt_Engineering_with_Llama_3.ipynb index f9e705666..681c2b8a8 100644 --- a/docs/Prompt_Engineering_with_Llama_3.ipynb +++ b/docs/Prompt_Engineering_with_Llama_3.ipynb @@ -34,296 +34,6 @@ "Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**." ] }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Llama Models\n", - "\n", - "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n", - "\n", - "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n", - "\n", - "#### Llama 3.1\n", - "1. `llama-3.1-8b` - base pretrained 8 billion parameter model\n", - "1. `llama-3.1-70b` - base pretrained 70 billion parameter model\n", - "1. `llama-3.1-405b` - base pretrained 405 billion parameter model\n", - "1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", - "1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model\n", - "1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)\n", - "\n", - "\n", - "#### Llama 3\n", - "1. `llama-3-8b` - base pretrained 8 billion parameter model\n", - "1. `llama-3-70b` - base pretrained 70 billion parameter model\n", - "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", - "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n", - "\n", - "#### Llama 2\n", - "1. `llama-2-7b` - base pretrained 7 billion parameter model\n", - "1. `llama-2-13b` - base pretrained 13 billion parameter model\n", - "1. `llama-2-70b` - base pretrained 70 billion parameter model\n", - "1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model\n", - "1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model\n", - "1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Code Llama is a code-focused LLM built on top of Llama 2 also available in various sizes and finetunes:" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Code Llama\n", - "1. `codellama-7b` - code fine-tuned 7 billion parameter model\n", - "1. `codellama-13b` - code fine-tuned 13 billion parameter model\n", - "1. `codellama-34b` - code fine-tuned 34 billion parameter model\n", - "1. `codellama-70b` - code fine-tuned 70 billion parameter model\n", - "1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n", - "2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n", - "3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n", - "3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model\n", - "1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n", - "2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n", - "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model\n", - "3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting an LLM\n", - "\n", - "Large language models are deployed and accessed in a variety of ways, including:\n", - "\n", - "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", - " * Best for privacy/security or if you already have a GPU.\n", - "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n", - " * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n", - "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", - " * Easiest option overall." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Hosted APIs\n", - "\n", - "Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:\n", - "\n", - "1. **`completion`**: generate a response to a given prompt (a string).\n", - "1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tokens\n", - "\n", - "LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...\n", - "\n", - "> Our destiny is written in the stars.\n", - "\n", - "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n", - "\n", - "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n", - "\n", - "Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook Setup\n", - "\n", - "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3.1 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n", - "\n", - "To install prerequisites run:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import sys\n", - "!{sys.executable} -m pip install groq" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from typing import Dict, List\n", - "from groq import Groq\n", - "\n", - "# Get a free API key from https://console.groq.com/keys\n", - "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n", - "\n", - "LLAMA3_405B_INSTRUCT = \"llama-3.1-405b-reasoning\" # Note: Groq currently only gives access here to paying customers for 405B model\n", - "LLAMA3_70B_INSTRUCT = \"llama-3.1-70b-versatile\"\n", - "LLAMA3_8B_INSTRUCT = \"llama3.1-8b-instant\"\n", - "\n", - "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n", - "\n", - "client = Groq()\n", - "\n", - "def assistant(content: str):\n", - " return { \"role\": \"assistant\", \"content\": content }\n", - "\n", - "def user(content: str):\n", - " return { \"role\": \"user\", \"content\": content }\n", - "\n", - "def chat_completion(\n", - " messages: List[Dict],\n", - " model = DEFAULT_MODEL,\n", - " temperature: float = 0.6,\n", - " top_p: float = 0.9,\n", - ") -> str:\n", - " response = client.chat.completions.create(\n", - " messages=messages,\n", - " model=model,\n", - " temperature=temperature,\n", - " top_p=top_p,\n", - " )\n", - " return response.choices[0].message.content\n", - " \n", - "\n", - "def completion(\n", - " prompt: str,\n", - " model: str = DEFAULT_MODEL,\n", - " temperature: float = 0.6,\n", - " top_p: float = 0.9,\n", - ") -> str:\n", - " return chat_completion(\n", - " [user(prompt)],\n", - " model=model,\n", - " temperature=temperature,\n", - " top_p=top_p,\n", - " )\n", - "\n", - "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n", - " print(f'==============\\n{prompt}\\n==============')\n", - " response = completion(prompt, model)\n", - " print(response, end='\\n\\n')\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Completion APIs\n", - "\n", - "Let's try Llama 3.1!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\"The typical color of the sky is: \")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\"which model version are you?\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Chat Completion APIs\n", - "Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some \"context\" or \"history\" from which to continue.\n", - "\n", - "Typically, each message contains `role` and `content`:\n", - "* Messages with the `system` role are used to provide core instruction to the LLM by developers.\n", - "* Messages with the `user` role are typically human-provided messages.\n", - "* Messages with the `assistant` role are typically generated by the LLM." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response = chat_completion(messages=[\n", - " user(\"My favorite color is blue.\"),\n", - " assistant(\"That's great to hear!\"),\n", - " user(\"What is my favorite color?\"),\n", - "])\n", - "print(response)\n", - "# \"Sure, I can help you with that! Your favorite color is blue.\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### LLM Hyperparameters\n", - "\n", - "#### `temperature` & `top_p`\n", - "\n", - "These APIs also take parameters which influence the creativity and determinism of your output.\n", - "\n", - "At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are \"cut\" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).\n", - "\n", - "In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.\n", - "\n", - "[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).\n", - "\n", - "Let's try it out:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def print_tuned_completion(temperature: float, top_p: float):\n", - " response = completion(\"Write a haiku about llamas\", temperature=temperature, top_p=top_p)\n", - " print(f'[temperature: {temperature} | top_p: {top_p}]\\n{response.strip()}\\n')\n", - "\n", - "print_tuned_completion(0.01, 0.01)\n", - "print_tuned_completion(0.01, 0.01)\n", - "# These two generations are highly likely to be the same\n", - "\n", - "print_tuned_completion(1.0, 1.0)\n", - "print_tuned_completion(1.0, 1.0)\n", - "# These two generations are highly likely to be different" - ] - }, { "attachments": {}, "cell_type": "markdown", @@ -559,199 +269,6 @@ "# ['50', '50', '60', '50', '50'] -> 50" ] }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Retrieval-Augmented Generation\n", - "\n", - "You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\"What is the capital of the California?\")\n", - "# Gives the correct answer \"Sacramento\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\"What was the temperature in Menlo Park on December 12th, 2023?\")\n", - "# \"I'm just an AI, I don't have access to real-time weather data or historical weather records.\"\n", - "\n", - "complete_and_print(\"What time is my dinner reservation on Saturday and what should I wear?\")\n", - "# \"I'm not able to access your personal information [..] I can provide some general guidance\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n", - "\n", - "This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "MENLO_PARK_TEMPS = {\n", - " \"2023-12-11\": \"52 degrees Fahrenheit\",\n", - " \"2023-12-12\": \"51 degrees Fahrenheit\",\n", - " \"2023-12-13\": \"51 degrees Fahrenheit\",\n", - "}\n", - "\n", - "\n", - "def prompt_with_rag(retrived_info, question):\n", - " complete_and_print(\n", - " f\"Given the following information: '{retrived_info}', respond to: '{question}'\"\n", - " )\n", - "\n", - "\n", - "def ask_for_temperature(day):\n", - " temp_on_day = MENLO_PARK_TEMPS.get(day) or \"unknown temperature\"\n", - " prompt_with_rag(\n", - " f\"The temperature in Menlo Park was {temp_on_day} on {day}'\", # Retrieved fact\n", - " f\"What is the temperature in Menlo Park on {day}?\", # User question\n", - " )\n", - "\n", - "\n", - "ask_for_temperature(\"2023-12-12\")\n", - "# \"Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.\"\n", - "\n", - "ask_for_temperature(\"2023-07-18\")\n", - "# \"I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown.\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Program-Aided Language Models\n", - "\n", - "LLMs, by nature, aren't great at performing calculations. Let's try:\n", - "\n", - "$$\n", - "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", - "$$\n", - "\n", - "(The correct answer is 91383.)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\"\"\"\n", - "Calculate the answer to the following math problem:\n", - "\n", - "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", - "\"\"\")\n", - "# Gives incorrect answers like 92448, 92648, 95463" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of \"Program-aided Language Models\" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\n", - " \"\"\"\n", - " # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", - " \"\"\",\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The following code was generated by Llama 3 70B:\n", - "\n", - "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n", - "print(result)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Limiting Extraneous Tokens\n", - "\n", - "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3.x can better follow instructions.\n", - "\n", - "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "complete_and_print(\n", - " \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n", - ")\n", - "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n", - "\n", - "complete_and_print(\n", - " \"\"\"\n", - " You are a robot that only outputs JSON.\n", - " You reply in JSON format with the field 'zip_code'.\n", - " Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n", - " Now here is my question: What is the zip code of Menlo Park?\n", - " \"\"\",\n", - ")\n", - "# \"{'zip_code': 94025}\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Additional References\n", - "- [PromptingGuide.ai](https://www.promptingguide.ai/)\n", - "- [LearnPrompting.org](https://learnprompting.org/)\n", - "- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)\n" - ] - }, { "attachments": {}, "cell_type": "markdown", diff --git a/docs/safety101.md b/docs/safety101.md index 2bf8f1bfe..43f67892b 100644 --- a/docs/safety101.md +++ b/docs/safety101.md @@ -9,9 +9,9 @@ To that goal, Llama Stack uses **Prompt Guard** and **Llama Guard 3** to secure **Prompt Guard**: -PromptGuard is a classifier model trained on a large corpus of attacks, which is capable of detecting both explicitly malicious prompts (Jailbreaks) as well as prompts that contain injected inputs (Prompt Injections). We suggest a methodology of fine-tuning the model to application-specific data to achieve optimal results. +Prompt Guard is a classifier model trained on a large corpus of attacks, which is capable of detecting both explicitly malicious prompts (Jailbreaks) as well as prompts that contain injected inputs (Prompt Injections). We suggest a methodology of fine-tuning the model to application-specific data to achieve optimal results. -PromptGuard is a BERT model that outputs only labels; unlike LlamaGuard, it doesn't need a specific prompt structure or configuration. The input is a string that the model labels as safe or unsafe (at two different levels). +PromptGuard is a BERT model that outputs only labels; unlike Llama Guard, it doesn't need a specific prompt structure or configuration. The input is a string that the model labels as safe or unsafe (at two different levels). For more detail on PromptGuard, please checkout [PromptGuard model card and prompt formats](https://www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard)