forked from phoenix/litellm-mirror
add everyting for docs
This commit is contained in:
parent
de45a738ee
commit
0fe8799f94
1015 changed files with 185353 additions and 0 deletions
|
@ -0,0 +1,269 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Doctran Extract Properties\n",
|
||||
"\n",
|
||||
"We can extract useful features of documents using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to extract specific metadata.\n",
|
||||
"\n",
|
||||
"Extracting metadata from documents is helpful for a variety of tasks, including:\n",
|
||||
"* Classification: classifying documents into different categories\n",
|
||||
"* Data mining: Extract structured data that can be used for data analysis\n",
|
||||
"* Style transfer: Change the way text is written to more closely match expected user input, improving vector search results"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install doctran"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain.document_transformers import DoctranPropertyExtractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"load_dotenv()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Input\n",
|
||||
"This is the document we'll extract properties from."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[Generated with ChatGPT]\n",
|
||||
"\n",
|
||||
"Confidential Document - For Internal Use Only\n",
|
||||
"\n",
|
||||
"Date: July 1, 2023\n",
|
||||
"\n",
|
||||
"Subject: Updates and Discussions on Various Topics\n",
|
||||
"\n",
|
||||
"Dear Team,\n",
|
||||
"\n",
|
||||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
|
||||
"\n",
|
||||
"Security and Privacy Measures\n",
|
||||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
|
||||
"\n",
|
||||
"HR Updates and Employee Benefits\n",
|
||||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Marketing Initiatives and Campaigns\n",
|
||||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
|
||||
"\n",
|
||||
"Research and Development Projects\n",
|
||||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
|
||||
"\n",
|
||||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
|
||||
"\n",
|
||||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
|
||||
"\n",
|
||||
"Best regards,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofounder & CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sample_text = \"\"\"[Generated with ChatGPT]\n",
|
||||
"\n",
|
||||
"Confidential Document - For Internal Use Only\n",
|
||||
"\n",
|
||||
"Date: July 1, 2023\n",
|
||||
"\n",
|
||||
"Subject: Updates and Discussions on Various Topics\n",
|
||||
"\n",
|
||||
"Dear Team,\n",
|
||||
"\n",
|
||||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
|
||||
"\n",
|
||||
"Security and Privacy Measures\n",
|
||||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
|
||||
"\n",
|
||||
"HR Updates and Employee Benefits\n",
|
||||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Marketing Initiatives and Campaigns\n",
|
||||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
|
||||
"\n",
|
||||
"Research and Development Projects\n",
|
||||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
|
||||
"\n",
|
||||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
|
||||
"\n",
|
||||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
|
||||
"\n",
|
||||
"Best regards,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofounder & CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n",
|
||||
"\"\"\"\n",
|
||||
"print(sample_text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents = [Document(page_content=sample_text)]\n",
|
||||
"properties = [\n",
|
||||
" {\n",
|
||||
" \"name\": \"category\",\n",
|
||||
" \"description\": \"What type of email this is.\",\n",
|
||||
" \"type\": \"string\",\n",
|
||||
" \"enum\": [\"update\", \"action_item\", \"customer_feedback\", \"announcement\", \"other\"],\n",
|
||||
" \"required\": True,\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"name\": \"mentions\",\n",
|
||||
" \"description\": \"A list of all people mentioned in this email.\",\n",
|
||||
" \"type\": \"array\",\n",
|
||||
" \"items\": {\n",
|
||||
" \"name\": \"full_name\",\n",
|
||||
" \"description\": \"The full name of the person mentioned.\",\n",
|
||||
" \"type\": \"string\",\n",
|
||||
" },\n",
|
||||
" \"required\": True,\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"name\": \"eli5\",\n",
|
||||
" \"description\": \"Explain this email to me like I'm 5 years old.\",\n",
|
||||
" \"type\": \"string\",\n",
|
||||
" \"required\": True,\n",
|
||||
" },\n",
|
||||
"]\n",
|
||||
"property_extractor = DoctranPropertyExtractor(properties=properties)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Output\n",
|
||||
"After extracting properties from a document, the result will be returned as a new document with properties provided in the metadata"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"extracted_document = await property_extractor.atransform_documents(\n",
|
||||
" documents, properties=properties\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{\n",
|
||||
" \"extracted_properties\": {\n",
|
||||
" \"category\": \"update\",\n",
|
||||
" \"mentions\": [\n",
|
||||
" \"John Doe\",\n",
|
||||
" \"Jane Smith\",\n",
|
||||
" \"Michael Johnson\",\n",
|
||||
" \"Sarah Thompson\",\n",
|
||||
" \"David Rodriguez\",\n",
|
||||
" \"Jason Fan\"\n",
|
||||
" ],\n",
|
||||
" \"eli5\": \"This is an email from the CEO, Jason Fan, giving updates about different areas in the company. He talks about new security measures and praises John Doe for his work. He also mentions new hires and praises Jane Smith for her work in customer service. The CEO reminds everyone about the upcoming benefits enrollment and says to contact Michael Johnson with any questions. He talks about the marketing team's work and praises Sarah Thompson for increasing their social media followers. There's also a product launch event on July 15th. Lastly, he talks about the research and development projects and praises David Rodriguez for his work. There's a brainstorming session on July 10th.\"\n",
|
||||
" }\n",
|
||||
"}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(json.dumps(extracted_document[0].metadata, indent=2))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,266 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Doctran Interrogate Documents\n",
|
||||
"Documents used in a vector store knowledge base are typically stored in narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the liklihood of retrieving relevant documents, and decrease the liklihood of retrieving irrelevant documents.\n",
|
||||
"\n",
|
||||
"We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to \"interrogate\" documents.\n",
|
||||
"\n",
|
||||
"See [this notebook](https://github.com/psychic-api/doctran/blob/main/benchmark.ipynb) for benchmarks on vector similarity scores for various queries based on raw documents versus interrogated documents."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install doctran"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain.document_transformers import DoctranQATransformer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"load_dotenv()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Input\n",
|
||||
"This is the document we'll interrogate"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[Generated with ChatGPT]\n",
|
||||
"\n",
|
||||
"Confidential Document - For Internal Use Only\n",
|
||||
"\n",
|
||||
"Date: July 1, 2023\n",
|
||||
"\n",
|
||||
"Subject: Updates and Discussions on Various Topics\n",
|
||||
"\n",
|
||||
"Dear Team,\n",
|
||||
"\n",
|
||||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
|
||||
"\n",
|
||||
"Security and Privacy Measures\n",
|
||||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
|
||||
"\n",
|
||||
"HR Updates and Employee Benefits\n",
|
||||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Marketing Initiatives and Campaigns\n",
|
||||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
|
||||
"\n",
|
||||
"Research and Development Projects\n",
|
||||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
|
||||
"\n",
|
||||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
|
||||
"\n",
|
||||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
|
||||
"\n",
|
||||
"Best regards,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofounder & CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sample_text = \"\"\"[Generated with ChatGPT]\n",
|
||||
"\n",
|
||||
"Confidential Document - For Internal Use Only\n",
|
||||
"\n",
|
||||
"Date: July 1, 2023\n",
|
||||
"\n",
|
||||
"Subject: Updates and Discussions on Various Topics\n",
|
||||
"\n",
|
||||
"Dear Team,\n",
|
||||
"\n",
|
||||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
|
||||
"\n",
|
||||
"Security and Privacy Measures\n",
|
||||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
|
||||
"\n",
|
||||
"HR Updates and Employee Benefits\n",
|
||||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Marketing Initiatives and Campaigns\n",
|
||||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
|
||||
"\n",
|
||||
"Research and Development Projects\n",
|
||||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
|
||||
"\n",
|
||||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
|
||||
"\n",
|
||||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
|
||||
"\n",
|
||||
"Best regards,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofounder & CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n",
|
||||
"\"\"\"\n",
|
||||
"print(sample_text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents = [Document(page_content=sample_text)]\n",
|
||||
"qa_transformer = DoctranQATransformer()\n",
|
||||
"transformed_document = await qa_transformer.atransform_documents(documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Output\n",
|
||||
"After interrogating a document, the result will be returned as a new document with questions and answers provided in the metadata."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{\n",
|
||||
" \"questions_and_answers\": [\n",
|
||||
" {\n",
|
||||
" \"question\": \"What is the purpose of this document?\",\n",
|
||||
" \"answer\": \"The purpose of this document is to provide important updates and discuss various topics that require the team's attention.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who is responsible for enhancing the network security?\",\n",
|
||||
" \"answer\": \"John Doe from the IT department is responsible for enhancing the network security.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Where should potential security risks or incidents be reported?\",\n",
|
||||
" \"answer\": \"Potential security risks or incidents should be reported to the dedicated team at security@example.com.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who has been recognized for outstanding performance in customer service?\",\n",
|
||||
" \"answer\": \"Jane Smith has been recognized for her outstanding performance in customer service.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"When is the open enrollment period for the employee benefits program?\",\n",
|
||||
" \"answer\": \"The document does not specify the exact dates for the open enrollment period for the employee benefits program, but it mentions that it is fast approaching.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who should be contacted for questions or assistance regarding the employee benefits program?\",\n",
|
||||
" \"answer\": \"For questions or assistance regarding the employee benefits program, the HR representative, Michael Johnson, should be contacted.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who has been acknowledged for managing the company's social media platforms?\",\n",
|
||||
" \"answer\": \"Sarah Thompson has been acknowledged for managing the company's social media platforms.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"When is the upcoming product launch event?\",\n",
|
||||
" \"answer\": \"The upcoming product launch event is on July 15th.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who has been recognized for their contributions to the development of the company's technology?\",\n",
|
||||
" \"answer\": \"David Rodriguez has been recognized for his contributions to the development of the company's technology.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"When is the monthly R&D brainstorming session?\",\n",
|
||||
" \"answer\": \"The monthly R&D brainstorming session is scheduled for July 10th.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"question\": \"Who should be contacted for questions or concerns regarding the topics discussed in the document?\",\n",
|
||||
" \"answer\": \"For questions or concerns regarding the topics discussed in the document, Jason Fan, the Cofounder & CEO, should be contacted.\"\n",
|
||||
" }\n",
|
||||
" ]\n",
|
||||
"}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"transformed_document = await qa_transformer.atransform_documents(documents)\n",
|
||||
"print(json.dumps(transformed_document[0].metadata, indent=2))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,208 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Doctran Translate Documents\n",
|
||||
"Comparing documents through embeddings has the benefit of working across multiple languages. \"Harrison says hello\" and \"Harrison dice hola\" will occupy similar positions in the vector space because they have the same meaning semantically.\n",
|
||||
"\n",
|
||||
"However, it can still be useful to use a LLM translate documents into other languages before vectorizing them. This is especially helpful when users are expected to query the knowledge base in different languages, or when state of the art embeddings models are not available for a given language.\n",
|
||||
"\n",
|
||||
"We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to translate documents between languages."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install doctran"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain.document_transformers import DoctranTextTranslator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"load_dotenv()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Input\n",
|
||||
"This is the document we'll translate"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sample_text = \"\"\"[Generated with ChatGPT]\n",
|
||||
"\n",
|
||||
"Confidential Document - For Internal Use Only\n",
|
||||
"\n",
|
||||
"Date: July 1, 2023\n",
|
||||
"\n",
|
||||
"Subject: Updates and Discussions on Various Topics\n",
|
||||
"\n",
|
||||
"Dear Team,\n",
|
||||
"\n",
|
||||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
|
||||
"\n",
|
||||
"Security and Privacy Measures\n",
|
||||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
|
||||
"\n",
|
||||
"HR Updates and Employee Benefits\n",
|
||||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Marketing Initiatives and Campaigns\n",
|
||||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
|
||||
"\n",
|
||||
"Research and Development Projects\n",
|
||||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
|
||||
"\n",
|
||||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
|
||||
"\n",
|
||||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
|
||||
"\n",
|
||||
"Best regards,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofounder & CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n",
|
||||
"\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents = [Document(page_content=sample_text)]\n",
|
||||
"qa_translator = DoctranTextTranslator(language=\"spanish\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Output\n",
|
||||
"After translating a document, the result will be returned as a new document with the page_content translated into the target language"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"translated_document = await qa_translator.atransform_documents(documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[Generado con ChatGPT]\n",
|
||||
"\n",
|
||||
"Documento confidencial - Solo para uso interno\n",
|
||||
"\n",
|
||||
"Fecha: 1 de julio de 2023\n",
|
||||
"\n",
|
||||
"Asunto: Actualizaciones y discusiones sobre varios temas\n",
|
||||
"\n",
|
||||
"Estimado equipo,\n",
|
||||
"\n",
|
||||
"Espero que este correo electrónico les encuentre bien. En este documento, me gustaría proporcionarles algunas actualizaciones importantes y discutir varios temas que requieren nuestra atención. Por favor, traten la información contenida aquí como altamente confidencial.\n",
|
||||
"\n",
|
||||
"Medidas de seguridad y privacidad\n",
|
||||
"Como parte de nuestro compromiso continuo para garantizar la seguridad y privacidad de los datos de nuestros clientes, hemos implementado medidas robustas en todos nuestros sistemas. Nos gustaría elogiar a John Doe (correo electrónico: john.doe@example.com) del departamento de TI por su diligente trabajo en mejorar nuestra seguridad de red. En adelante, recordamos amablemente a todos que se adhieran estrictamente a nuestras políticas y directrices de protección de datos. Además, si se encuentran con cualquier riesgo de seguridad o incidente potencial, por favor repórtelo inmediatamente a nuestro equipo dedicado en security@example.com.\n",
|
||||
"\n",
|
||||
"Actualizaciones de RRHH y beneficios para empleados\n",
|
||||
"Recientemente, dimos la bienvenida a varios nuevos miembros del equipo que han hecho contribuciones significativas a sus respectivos departamentos. Me gustaría reconocer a Jane Smith (SSN: 049-45-5928) por su sobresaliente rendimiento en el servicio al cliente. Jane ha recibido constantemente comentarios positivos de nuestros clientes. Además, recuerden que el período de inscripción abierta para nuestro programa de beneficios para empleados se acerca rápidamente. Si tienen alguna pregunta o necesitan asistencia, por favor contacten a nuestro representante de RRHH, Michael Johnson (teléfono: 418-492-3850, correo electrónico: michael.johnson@example.com).\n",
|
||||
"\n",
|
||||
"Iniciativas y campañas de marketing\n",
|
||||
"Nuestro equipo de marketing ha estado trabajando activamente en el desarrollo de nuevas estrategias para aumentar la conciencia de marca y fomentar la participación del cliente. Nos gustaría agradecer a Sarah Thompson (teléfono: 415-555-1234) por sus excepcionales esfuerzos en la gestión de nuestras plataformas de redes sociales. Sarah ha aumentado con éxito nuestra base de seguidores en un 20% solo en el último mes. Además, por favor marquen sus calendarios para el próximo evento de lanzamiento de producto el 15 de julio. Animamos a todos los miembros del equipo a asistir y apoyar este emocionante hito para nuestra empresa.\n",
|
||||
"\n",
|
||||
"Proyectos de investigación y desarrollo\n",
|
||||
"En nuestra búsqueda de la innovación, nuestro departamento de investigación y desarrollo ha estado trabajando incansablemente en varios proyectos. Me gustaría reconocer el excepcional trabajo de David Rodríguez (correo electrónico: david.rodriguez@example.com) en su papel de líder de proyecto. Las contribuciones de David al desarrollo de nuestra tecnología de vanguardia han sido fundamentales. Además, nos gustaría recordar a todos que compartan sus ideas y sugerencias para posibles nuevos proyectos durante nuestra sesión de lluvia de ideas de I+D mensual, programada para el 10 de julio.\n",
|
||||
"\n",
|
||||
"Por favor, traten la información de este documento con la máxima confidencialidad y asegúrense de que no se comparte con personas no autorizadas. Si tienen alguna pregunta o inquietud sobre los temas discutidos, no duden en ponerse en contacto conmigo directamente.\n",
|
||||
"\n",
|
||||
"Gracias por su atención, y sigamos trabajando juntos para alcanzar nuestros objetivos.\n",
|
||||
"\n",
|
||||
"Saludos cordiales,\n",
|
||||
"\n",
|
||||
"Jason Fan\n",
|
||||
"Cofundador y CEO\n",
|
||||
"Psychic\n",
|
||||
"jason@psychic.dev\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(translated_document[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
133
docs/extras/integrations/document_transformers/html2text.ipynb
Normal file
133
docs/extras/integrations/document_transformers/html2text.ipynb
Normal file
|
@ -0,0 +1,133 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fe6e5c82",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# html2text\n",
|
||||
"\n",
|
||||
"[html2text](https://github.com/Alir3z4/html2text/) is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. \n",
|
||||
"\n",
|
||||
"The ASCII also happens to be valid Markdown (a text-to-HTML format)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ce77e0cb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install html2text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "8ca0974b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Fetching pages: 100%|############| 2/2 [00:00<00:00, 10.75it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.document_loaders import AsyncHtmlLoader\n",
|
||||
"\n",
|
||||
"urls = [\"https://www.espn.com\", \"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
|
||||
"loader = AsyncHtmlLoader(urls)\n",
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "ddf2be97",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_transformers import Html2TextTransformer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "a95a928c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"urls = [\"https://www.espn.com\", \"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
|
||||
"html2text = Html2TextTransformer()\n",
|
||||
"docs_transformed = html2text.transform_documents(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "18ef9fe9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" * ESPNFC\\n\\n * X Games\\n\\n * SEC Network\\n\\n## ESPN Apps\\n\\n * ESPN\\n\\n * ESPN Fantasy\\n\\n## Follow ESPN\\n\\n * Facebook\\n\\n * Twitter\\n\\n * Instagram\\n\\n * Snapchat\\n\\n * YouTube\\n\\n * The ESPN Daily Podcast\\n\\n2023 FIFA Women's World Cup\\n\\n## Follow live: Canada takes on Nigeria in group stage of Women's World Cup\\n\\n2m\\n\\nEPA/Morgan Hancock\\n\\n## TOP HEADLINES\\n\\n * Snyder fined $60M over findings in investigation\\n * NFL owners approve $6.05B sale of Commanders\\n * Jags assistant comes out as gay in NFL milestone\\n * O's alone atop East after topping slumping Rays\\n * ACC's Phillips: Never condoned hazing at NU\\n\\n * Vikings WR Addison cited for driving 140 mph\\n * 'Taking his time': Patient QB Rodgers wows Jets\\n * Reyna got U.S. assurances after Berhalter rehire\\n * NFL Future Power Rankings\\n\\n## USWNT AT THE WORLD CUP\\n\\n### USA VS. VIETNAM: 9 P.M. ET FRIDAY\\n\\n## How do you defend against Alex Morgan? Former opponents sound off\\n\\nThe U.S. forward is unstoppable at this level, scoring 121 goals and adding 49\""
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs_transformed[0].page_content[1000:2000]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "6045d660",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"t's brain,\\ncomplemented by several key components:\\n\\n * **Planning**\\n * Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\\n * Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\\n * **Memory**\\n * Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.\\n * Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\\n * **Tool use**\\n * The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution c\""
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs_transformed[1].page_content[1000:2000]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
9
docs/extras/integrations/document_transformers/index.mdx
Normal file
9
docs/extras/integrations/document_transformers/index.mdx
Normal file
|
@ -0,0 +1,9 @@
|
|||
---
|
||||
sidebar_position: 0
|
||||
---
|
||||
|
||||
# Document transformers
|
||||
|
||||
import DocCardList from "@theme/DocCardList";
|
||||
|
||||
<DocCardList />
|
|
@ -0,0 +1,261 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# OpenAI Functions Metadata Tagger\n",
|
||||
"\n",
|
||||
"It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.\n",
|
||||
"\n",
|
||||
"The `OpenAIMetadataTagger` document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable OpenAI Functions-powered chain under the hood, so if you pass a custom LLM instance, it must be an OpenAI model with functions support. \n",
|
||||
"\n",
|
||||
"**Note:** This document transformer works best with complete documents, so it's best to run it first with whole documents before doing any other splitting or processing!\n",
|
||||
"\n",
|
||||
"For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer with a valid JSON Schema object as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.document_transformers.openai_functions import create_metadata_tagger"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"schema = {\n",
|
||||
" \"properties\": {\n",
|
||||
" \"movie_title\": {\"type\": \"string\"},\n",
|
||||
" \"critic\": {\"type\": \"string\"},\n",
|
||||
" \"tone\": {\"type\": \"string\", \"enum\": [\"positive\", \"negative\"]},\n",
|
||||
" \"rating\": {\n",
|
||||
" \"type\": \"integer\",\n",
|
||||
" \"description\": \"The number of stars the critic rated the movie\",\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
" \"required\": [\"movie_title\", \"critic\", \"tone\"],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"# Must be an OpenAI model that supports functions\n",
|
||||
"llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")\n",
|
||||
"\n",
|
||||
"document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can then simply pass the document transformer a list of documents, and it will extract metadata from the contents:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"original_documents = [\n",
|
||||
" Document(\n",
|
||||
" page_content=\"Review of The Bee Movie\\nBy Roger Ebert\\n\\nThis is the greatest movie ever made. 4 out of 5 stars.\"\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"Review of The Godfather\\nBy Anonymous\\n\\nThis movie was super boring. 1 out of 5 stars.\",\n",
|
||||
" metadata={\"reliable\": False},\n",
|
||||
" ),\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"enhanced_documents = document_transformer.transform_documents(original_documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Review of The Bee Movie\n",
|
||||
"By Roger Ebert\n",
|
||||
"\n",
|
||||
"This is the greatest movie ever made. 4 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
|
||||
"\n",
|
||||
"---------------\n",
|
||||
"\n",
|
||||
"Review of The Godfather\n",
|
||||
"By Anonymous\n",
|
||||
"\n",
|
||||
"This movie was super boring. 1 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Godfather\", \"critic\": \"Anonymous\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"print(\n",
|
||||
" *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
|
||||
" sep=\"\\n\\n---------------\\n\\n\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The new documents can then be further processed by a text splitter before being loaded into a vector store. Extracted fields will not overwrite existing metadata.\n",
|
||||
"\n",
|
||||
"You can also initialize the document transformer with a Pydantic schema:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Review of The Bee Movie\n",
|
||||
"By Roger Ebert\n",
|
||||
"\n",
|
||||
"This is the greatest movie ever made. 4 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
|
||||
"\n",
|
||||
"---------------\n",
|
||||
"\n",
|
||||
"Review of The Godfather\n",
|
||||
"By Anonymous\n",
|
||||
"\n",
|
||||
"This movie was super boring. 1 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Godfather\", \"critic\": \"Anonymous\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from typing import Literal\n",
|
||||
"\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class Properties(BaseModel):\n",
|
||||
" movie_title: str\n",
|
||||
" critic: str\n",
|
||||
" tone: Literal[\"positive\", \"negative\"]\n",
|
||||
" rating: int = Field(description=\"Rating out of 5 stars\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"document_transformer = create_metadata_tagger(Properties, llm)\n",
|
||||
"enhanced_documents = document_transformer.transform_documents(original_documents)\n",
|
||||
"\n",
|
||||
"print(\n",
|
||||
" *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
|
||||
" sep=\"\\n\\n---------------\\n\\n\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"\n",
|
||||
"## Customization\n",
|
||||
"\n",
|
||||
"You can pass the underlying tagging chain the standard LLMChain arguments in the document transformer constructor. For example, if you wanted to ask the LLM to focus specific details in the input documents, or extract metadata in a certain style, you could pass in a custom prompt:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Review of The Bee Movie\n",
|
||||
"By Roger Ebert\n",
|
||||
"\n",
|
||||
"This is the greatest movie ever made. 4 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
|
||||
"\n",
|
||||
"---------------\n",
|
||||
"\n",
|
||||
"Review of The Godfather\n",
|
||||
"By Anonymous\n",
|
||||
"\n",
|
||||
"This movie was super boring. 1 out of 5 stars.\n",
|
||||
"\n",
|
||||
"{\"movie_title\": \"The Godfather\", \"critic\": \"Roger Ebert\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.prompts import ChatPromptTemplate\n",
|
||||
"\n",
|
||||
"prompt = ChatPromptTemplate.from_template(\n",
|
||||
" \"\"\"Extract relevant information from the following text.\n",
|
||||
"Anonymous critics are actually Roger Ebert.\n",
|
||||
"\n",
|
||||
"{input}\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)\n",
|
||||
"enhanced_documents = document_transformer.transform_documents(original_documents)\n",
|
||||
"\n",
|
||||
"print(\n",
|
||||
" *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
|
||||
" sep=\"\\n\\n---------------\\n\\n\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "venv",
|
||||
"language": "python",
|
||||
"name": "venv"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
Loading…
Add table
Add a link
Reference in a new issue