diff --git a/docs/Prompt_Engineering_with_Llama_3.ipynb b/docs/Prompt_Engineering_with_Llama_3.ipynb new file mode 100644 index 000000000..f9e705666 --- /dev/null +++ b/docs/Prompt_Engineering_with_Llama_3.ipynb @@ -0,0 +1,795 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open\n", + "\n", + "# Prompt Engineering with Llama 3.1\n", + "\n", + "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n", + "\n", + "This interactive guide covers prompt engineering & best practices with Llama 3.1." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Why now?\n", + "\n", + "[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.\n", + "\n", + "Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Llama Models\n", + "\n", + "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n", + "\n", + "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n", + "\n", + "#### Llama 3.1\n", + "1. `llama-3.1-8b` - base pretrained 8 billion parameter model\n", + "1. `llama-3.1-70b` - base pretrained 70 billion parameter model\n", + "1. `llama-3.1-405b` - base pretrained 405 billion parameter model\n", + "1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", + "1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model\n", + "1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)\n", + "\n", + "\n", + "#### Llama 3\n", + "1. `llama-3-8b` - base pretrained 8 billion parameter model\n", + "1. `llama-3-70b` - base pretrained 70 billion parameter model\n", + "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", + "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n", + "\n", + "#### Llama 2\n", + "1. `llama-2-7b` - base pretrained 7 billion parameter model\n", + "1. `llama-2-13b` - base pretrained 13 billion parameter model\n", + "1. `llama-2-70b` - base pretrained 70 billion parameter model\n", + "1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model\n", + "1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model\n", + "1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Code Llama is a code-focused LLM built on top of Llama 2 also available in various sizes and finetunes:" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Code Llama\n", + "1. `codellama-7b` - code fine-tuned 7 billion parameter model\n", + "1. `codellama-13b` - code fine-tuned 13 billion parameter model\n", + "1. `codellama-34b` - code fine-tuned 34 billion parameter model\n", + "1. `codellama-70b` - code fine-tuned 70 billion parameter model\n", + "1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n", + "2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n", + "3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n", + "3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model\n", + "1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n", + "2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n", + "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model\n", + "3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting an LLM\n", + "\n", + "Large language models are deployed and accessed in a variety of ways, including:\n", + "\n", + "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", + " * Best for privacy/security or if you already have a GPU.\n", + "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n", + " * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n", + "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", + " * Easiest option overall." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Hosted APIs\n", + "\n", + "Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:\n", + "\n", + "1. **`completion`**: generate a response to a given prompt (a string).\n", + "1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tokens\n", + "\n", + "LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...\n", + "\n", + "> Our destiny is written in the stars.\n", + "\n", + "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n", + "\n", + "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n", + "\n", + "Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Notebook Setup\n", + "\n", + "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3.1 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n", + "\n", + "To install prerequisites run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install groq" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from typing import Dict, List\n", + "from groq import Groq\n", + "\n", + "# Get a free API key from https://console.groq.com/keys\n", + "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n", + "\n", + "LLAMA3_405B_INSTRUCT = \"llama-3.1-405b-reasoning\" # Note: Groq currently only gives access here to paying customers for 405B model\n", + "LLAMA3_70B_INSTRUCT = \"llama-3.1-70b-versatile\"\n", + "LLAMA3_8B_INSTRUCT = \"llama3.1-8b-instant\"\n", + "\n", + "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n", + "\n", + "client = Groq()\n", + "\n", + "def assistant(content: str):\n", + " return { \"role\": \"assistant\", \"content\": content }\n", + "\n", + "def user(content: str):\n", + " return { \"role\": \"user\", \"content\": content }\n", + "\n", + "def chat_completion(\n", + " messages: List[Dict],\n", + " model = DEFAULT_MODEL,\n", + " temperature: float = 0.6,\n", + " top_p: float = 0.9,\n", + ") -> str:\n", + " response = client.chat.completions.create(\n", + " messages=messages,\n", + " model=model,\n", + " temperature=temperature,\n", + " top_p=top_p,\n", + " )\n", + " return response.choices[0].message.content\n", + " \n", + "\n", + "def completion(\n", + " prompt: str,\n", + " model: str = DEFAULT_MODEL,\n", + " temperature: float = 0.6,\n", + " top_p: float = 0.9,\n", + ") -> str:\n", + " return chat_completion(\n", + " [user(prompt)],\n", + " model=model,\n", + " temperature=temperature,\n", + " top_p=top_p,\n", + " )\n", + "\n", + "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n", + " print(f'==============\\n{prompt}\\n==============')\n", + " response = completion(prompt, model)\n", + " print(response, end='\\n\\n')\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Completion APIs\n", + "\n", + "Let's try Llama 3.1!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"The typical color of the sky is: \")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"which model version are you?\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Chat Completion APIs\n", + "Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some \"context\" or \"history\" from which to continue.\n", + "\n", + "Typically, each message contains `role` and `content`:\n", + "* Messages with the `system` role are used to provide core instruction to the LLM by developers.\n", + "* Messages with the `user` role are typically human-provided messages.\n", + "* Messages with the `assistant` role are typically generated by the LLM." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = chat_completion(messages=[\n", + " user(\"My favorite color is blue.\"),\n", + " assistant(\"That's great to hear!\"),\n", + " user(\"What is my favorite color?\"),\n", + "])\n", + "print(response)\n", + "# \"Sure, I can help you with that! Your favorite color is blue.\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### LLM Hyperparameters\n", + "\n", + "#### `temperature` & `top_p`\n", + "\n", + "These APIs also take parameters which influence the creativity and determinism of your output.\n", + "\n", + "At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are \"cut\" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).\n", + "\n", + "In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.\n", + "\n", + "[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).\n", + "\n", + "Let's try it out:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def print_tuned_completion(temperature: float, top_p: float):\n", + " response = completion(\"Write a haiku about llamas\", temperature=temperature, top_p=top_p)\n", + " print(f'[temperature: {temperature} | top_p: {top_p}]\\n{response.strip()}\\n')\n", + "\n", + "print_tuned_completion(0.01, 0.01)\n", + "print_tuned_completion(0.01, 0.01)\n", + "# These two generations are highly likely to be the same\n", + "\n", + "print_tuned_completion(1.0, 1.0)\n", + "print_tuned_completion(1.0, 1.0)\n", + "# These two generations are highly likely to be different" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prompting Techniques" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Explicit Instructions\n", + "\n", + "Detailed, explicit instructions produce better results than open-ended prompts:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(prompt=\"Describe quantum physics in one short sentence of no more than 12 words\")\n", + "# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n", + "\n", + "- Stylization\n", + " - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n", + " - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`\n", + " - `Give your answer like an old timey private investigator hunting down a case step by step.`\n", + "- Formatting\n", + " - `Use bullet points.`\n", + " - `Return as a JSON object.`\n", + " - `Use less technical terms and help me apply it in my work in communications.`\n", + "- Restrictions\n", + " - `Only use academic papers.`\n", + " - `Never give sources older than 2020.`\n", + " - `If you don't know the answer, say that you don't know.`\n", + "\n", + "Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"Explain the latest advances in large language models to me.\")\n", + "# More likely to cite sources from 2017\n", + "\n", + "complete_and_print(\"Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.\")\n", + "# Gives more specific advances and only cites sources from 2020" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example Prompting using Zero- and Few-Shot Learning\n", + "\n", + "A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).\n", + "\n", + "#### Zero-Shot Prompting\n", + "\n", + "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n", + "\n", + "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"Text: This was the best movie I've ever seen! \\n The sentiment of the text is: \")\n", + "# Returns positive sentiment\n", + "\n", + "complete_and_print(\"Text: The director was trying too hard. \\n The sentiment of the text is: \")\n", + "# Returns negative sentiment" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Few-Shot Prompting\n", + "\n", + "Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called \"few-shot prompting\".\n", + "\n", + "In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.\n", + "\n", + "See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def sentiment(text):\n", + " response = chat_completion(messages=[\n", + " user(\"You are a sentiment classifier. For each message, give the percentage of positive/netural/negative.\"),\n", + " user(\"I liked it\"),\n", + " assistant(\"70% positive 30% neutral 0% negative\"),\n", + " user(\"It could be better\"),\n", + " assistant(\"0% positive 50% neutral 50% negative\"),\n", + " user(\"It's fine\"),\n", + " assistant(\"25% positive 50% neutral 25% negative\"),\n", + " user(text),\n", + " ])\n", + " return response\n", + "\n", + "def print_sentiment(text):\n", + " print(f'INPUT: {text}')\n", + " print(sentiment(text))\n", + "\n", + "print_sentiment(\"I thought it was okay\")\n", + "# More likely to return a balanced mix of positive, neutral, and negative\n", + "print_sentiment(\"I loved it!\")\n", + "# More likely to return 100% positive\n", + "print_sentiment(\"Terrible service 0/10\")\n", + "# More likely to return 100% negative" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Role Prompting\n", + "\n", + "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n", + "\n", + "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"Explain the pros and cons of using PyTorch.\")\n", + "# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve\n", + "\n", + "complete_and_print(\"Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.\")\n", + "# Often results in more technical benefits and drawbacks that provide more technical details on how model layers" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Chain-of-Thought\n", + "\n", + "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n", + "\n", + "Llama 3.1 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompt = \"Who lived longer, Mozart or Elvis?\"\n", + "\n", + "complete_and_print(prompt)\n", + "# Llama 2 would often give the incorrect answer of \"Mozart\"\n", + "\n", + "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n", + "# Gives the correct answer \"Elvis\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Self-Consistency\n", + "\n", + "LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "from statistics import mode\n", + "\n", + "def gen_answer():\n", + " response = completion(\n", + " \"John found that the average of 15 numbers is 40.\"\n", + " \"If 10 is added to each number then the mean of the numbers is?\"\n", + " \"Report the answer surrounded by backticks (example: `123`)\",\n", + " )\n", + " match = re.search(r'`(\\d+)`', response)\n", + " if match is None:\n", + " return None\n", + " return match.group(1)\n", + "\n", + "answers = [gen_answer() for i in range(5)]\n", + "\n", + "print(\n", + " f\"Answers: {answers}\\n\",\n", + " f\"Final answer: {mode(answers)}\",\n", + " )\n", + "\n", + "# Sample runs of Llama-3-70B (all correct):\n", + "# ['60', '50', '50', '50', '50'] -> 50\n", + "# ['50', '50', '50', '60', '50'] -> 50\n", + "# ['50', '50', '60', '50', '50'] -> 50" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieval-Augmented Generation\n", + "\n", + "You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"What is the capital of the California?\")\n", + "# Gives the correct answer \"Sacramento\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"What was the temperature in Menlo Park on December 12th, 2023?\")\n", + "# \"I'm just an AI, I don't have access to real-time weather data or historical weather records.\"\n", + "\n", + "complete_and_print(\"What time is my dinner reservation on Saturday and what should I wear?\")\n", + "# \"I'm not able to access your personal information [..] I can provide some general guidance\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n", + "\n", + "This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MENLO_PARK_TEMPS = {\n", + " \"2023-12-11\": \"52 degrees Fahrenheit\",\n", + " \"2023-12-12\": \"51 degrees Fahrenheit\",\n", + " \"2023-12-13\": \"51 degrees Fahrenheit\",\n", + "}\n", + "\n", + "\n", + "def prompt_with_rag(retrived_info, question):\n", + " complete_and_print(\n", + " f\"Given the following information: '{retrived_info}', respond to: '{question}'\"\n", + " )\n", + "\n", + "\n", + "def ask_for_temperature(day):\n", + " temp_on_day = MENLO_PARK_TEMPS.get(day) or \"unknown temperature\"\n", + " prompt_with_rag(\n", + " f\"The temperature in Menlo Park was {temp_on_day} on {day}'\", # Retrieved fact\n", + " f\"What is the temperature in Menlo Park on {day}?\", # User question\n", + " )\n", + "\n", + "\n", + "ask_for_temperature(\"2023-12-12\")\n", + "# \"Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.\"\n", + "\n", + "ask_for_temperature(\"2023-07-18\")\n", + "# \"I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown.\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Program-Aided Language Models\n", + "\n", + "LLMs, by nature, aren't great at performing calculations. Let's try:\n", + "\n", + "$$\n", + "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", + "$$\n", + "\n", + "(The correct answer is 91383.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\"\"\"\n", + "Calculate the answer to the following math problem:\n", + "\n", + "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", + "\"\"\")\n", + "# Gives incorrect answers like 92448, 92648, 95463" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of \"Program-aided Language Models\" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\n", + " \"\"\"\n", + " # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n", + " \"\"\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The following code was generated by Llama 3 70B:\n", + "\n", + "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n", + "print(result)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Limiting Extraneous Tokens\n", + "\n", + "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3.x can better follow instructions.\n", + "\n", + "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "complete_and_print(\n", + " \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n", + ")\n", + "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n", + "\n", + "complete_and_print(\n", + " \"\"\"\n", + " You are a robot that only outputs JSON.\n", + " You reply in JSON format with the field 'zip_code'.\n", + " Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n", + " Now here is my question: What is the zip code of Menlo Park?\n", + " \"\"\",\n", + ")\n", + "# \"{'zip_code': 94025}\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional References\n", + "- [PromptingGuide.ai](https://www.promptingguide.ai/)\n", + "- [LearnPrompting.org](https://learnprompting.org/)\n", + "- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Author & Contact\n", + "\n", + "Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom." + ] + } + ], + "metadata": { + "captumWidgetMessage": [], + "dataExplorerConfig": [], + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + }, + "last_base_url": "https://bento.edge.x2p.facebook.net/", + "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac", + "last_msg_id": "4eab1242-d815b886ebe4f5b1966da982_543", + "last_server_session_id": "4a7b41c5-ed66-4dcb-a376-22673aebb469", + "operator_data": [], + "outputWidgetContext": [] + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/safety101.md b/docs/safety101.md new file mode 100644 index 000000000..2bf8f1bfe --- /dev/null +++ b/docs/safety101.md @@ -0,0 +1,52 @@ +## Safety API 101 + +This document talks about the Safety APIs in Llama Stack. + +As outlined in our [Responsible Use Guide](https://www.llama.com/docs/how-to-guides/responsible-use-guide-resources/), LLM apps should deploy appropriate system level safeguards to mitigate safety and security risks of LLM system, similar to the following diagram: +![Figure 1: Safety System](./safety_system.webp) + +To that goal, Llama Stack uses **Prompt Guard** and **Llama Guard 3** to secure our system. Here are the quick introduction about them. + +**Prompt Guard**: + +PromptGuard is a classifier model trained on a large corpus of attacks, which is capable of detecting both explicitly malicious prompts (Jailbreaks) as well as prompts that contain injected inputs (Prompt Injections). We suggest a methodology of fine-tuning the model to application-specific data to achieve optimal results. + +PromptGuard is a BERT model that outputs only labels; unlike LlamaGuard, it doesn't need a specific prompt structure or configuration. The input is a string that the model labels as safe or unsafe (at two different levels). + +For more detail on PromptGuard, please checkout [PromptGuard model card and prompt formats](https://www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard) + +**Llama Guard 3**: + +Llama Guard 3 comes in three flavors now: Llama Guard 3 1B, Llama Guard 3 8B and Llama Guard 3 11B-Vision. The first two models are text only, and the third supports the same vision understanding capabilities as the base Llama 3.2 11B-Vision model. All the models are multilingual–for text-only prompts–and follow the categories defined by the ML Commons consortium. Check their respective model cards for additional details on each model and its performance. + +For more detail on Llama Guard 3, please checkout [Llama Guard 3 model card and prompt formats](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/) + +**CodeShield**: We use [code shield](https://github.com/meta-llama/llama-stack/tree/f04b566c5cfc0d23b59e79103f680fe05ade533d/llama_stack/providers/impls/meta_reference/codeshield) + +### Configure Safety + +```bash +$ llama stack configure ~/.llama/distributions/conda/tgi-build.yaml + +.... +Configuring API: safety (meta-reference) +Do you want to configure llama_guard_shield? (y/n): y +Entering sub-configuration for llama_guard_shield: +Enter value for model (default: Llama-Guard-3-1B) (required): +Enter value for excluded_categories (default: []) (required): +Enter value for disable_input_check (default: False) (required): +Enter value for disable_output_check (default: False) (required): +Do you want to configure prompt_guard_shield? (y/n): y +Entering sub-configuration for prompt_guard_shield: +Enter value for model (default: Prompt-Guard-86M) (required): +.... +``` +As you can see, we did basic configuration above and configured: +- Llama Guard safety shield with model `Llama-Guard-3-1B` +- Prompt Guard safety shield with model `Prompt-Guard-86M` + +you can test safety (if you configured llama-guard and/or prompt-guard shields) by: + +```bash +python -m llama_stack.apis.safety.client localhost 5000 +``` diff --git a/docs/safety_system.webp b/docs/safety_system.webp new file mode 100644 index 000000000..e153da05e Binary files /dev/null and b/docs/safety_system.webp differ