forked from phoenix/litellm-mirror
trying to add docs
This commit is contained in:
parent
0fe8799f94
commit
2cf949990e
834 changed files with 0 additions and 161273 deletions
|
@ -1,57 +0,0 @@
|
|||
The default recommended text splitter is the RecursiveCharacterTextSplitter. This text splitter takes a list of characters. It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth. By default the characters it tries to split on are `["\n\n", "\n", " ", ""]`
|
||||
|
||||
In addition to controlling which characters you can split on, you can also control a few other things:
|
||||
|
||||
- `length_function`: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
|
||||
- `chunk_size`: the maximum size of your chunks (as measured by the length function).
|
||||
- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window).
|
||||
- `add_start_index`: whether to include the starting position of each chunk within the original document in the metadata.
|
||||
|
||||
|
||||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
# Set a really small chunk size, just to show.
|
||||
chunk_size = 100,
|
||||
chunk_overlap = 20,
|
||||
length_function = len,
|
||||
add_start_index = True,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
print(texts[1])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}
|
||||
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Other transformations:
|
||||
### Filter redundant docs, translate docs, extract metadata, and more
|
||||
|
||||
We can do perform a number of transformations on docs which are not simply splitting the text. With the
|
||||
`EmbeddingsRedundantFilter` we can identify similar documents and filter out redundancies. With integrations like
|
||||
[doctran](https://github.com/psychic-api/doctran/tree/main) we can do things like translate documents from one language
|
||||
to another, extract desired properties and add them to metadata, and convert conversational dialogue into a Q/A format
|
||||
set of documents.
|
|
@ -1,60 +0,0 @@
|
|||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
text_splitter = CharacterTextSplitter(
|
||||
separator = "\n\n",
|
||||
chunk_size = 1000,
|
||||
chunk_overlap = 200,
|
||||
length_function = len,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
Here's an example of passing metadata along with the documents, notice that it is split along with the documents.
|
||||
|
||||
|
||||
```python
|
||||
metadatas = [{"document": 1}, {"document": 2}]
|
||||
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
|
||||
print(documents[0])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
text_splitter.split_text(state_of_the_union)[0]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,312 +0,0 @@
|
|||
```python
|
||||
from langchain.text_splitter import (
|
||||
RecursiveCharacterTextSplitter,
|
||||
Language,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Full list of support languages
|
||||
[e.value for e in Language]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['cpp',
|
||||
'go',
|
||||
'java',
|
||||
'js',
|
||||
'php',
|
||||
'proto',
|
||||
'python',
|
||||
'rst',
|
||||
'ruby',
|
||||
'rust',
|
||||
'scala',
|
||||
'swift',
|
||||
'markdown',
|
||||
'latex',
|
||||
'html',
|
||||
'sol',]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# You can also see the separators used for a given language
|
||||
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Python
|
||||
|
||||
Here's an example using the PythonTextSplitter
|
||||
|
||||
|
||||
```python
|
||||
PYTHON_CODE = """
|
||||
def hello_world():
|
||||
print("Hello, World!")
|
||||
|
||||
# Call the function
|
||||
hello_world()
|
||||
"""
|
||||
python_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
|
||||
)
|
||||
python_docs = python_splitter.create_documents([PYTHON_CODE])
|
||||
python_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='def hello_world():\n print("Hello, World!")', metadata={}),
|
||||
Document(page_content='# Call the function\nhello_world()', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## JS
|
||||
Here's an example using the JS text splitter
|
||||
|
||||
|
||||
```python
|
||||
JS_CODE = """
|
||||
function helloWorld() {
|
||||
console.log("Hello, World!");
|
||||
}
|
||||
|
||||
// Call the function
|
||||
helloWorld();
|
||||
"""
|
||||
|
||||
js_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.JS, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
js_docs = js_splitter.create_documents([JS_CODE])
|
||||
js_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='function helloWorld() {\n console.log("Hello, World!");\n}', metadata={}),
|
||||
Document(page_content='// Call the function\nhelloWorld();', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Markdown
|
||||
|
||||
Here's an example using the Markdown text splitter.
|
||||
|
||||
|
||||
````python
|
||||
markdown_text = """
|
||||
# 🦜️🔗 LangChain
|
||||
|
||||
⚡ Building applications with LLMs through composability ⚡
|
||||
|
||||
## Quick Install
|
||||
|
||||
```bash
|
||||
# Hopefully this code block isn't split
|
||||
pip install langchain
|
||||
```
|
||||
|
||||
As an open source project in a rapidly developing field, we are extremely open to contributions.
|
||||
"""
|
||||
````
|
||||
|
||||
|
||||
```python
|
||||
md_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
md_docs = md_splitter.create_documents([markdown_text])
|
||||
md_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='# 🦜️🔗 LangChain', metadata={}),
|
||||
Document(page_content='⚡ Building applications with LLMs through composability ⚡', metadata={}),
|
||||
Document(page_content='## Quick Install', metadata={}),
|
||||
Document(page_content="```bash\n# Hopefully this code block isn't split", metadata={}),
|
||||
Document(page_content='pip install langchain', metadata={}),
|
||||
Document(page_content='```', metadata={}),
|
||||
Document(page_content='As an open source project in a rapidly developing field, we', metadata={}),
|
||||
Document(page_content='are extremely open to contributions.', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Latex
|
||||
|
||||
Here's an example on Latex text
|
||||
|
||||
|
||||
```python
|
||||
latex_text = """
|
||||
\documentclass{article}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\maketitle
|
||||
|
||||
\section{Introduction}
|
||||
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.
|
||||
|
||||
\subsection{History of LLMs}
|
||||
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
|
||||
|
||||
\subsection{Applications of LLMs}
|
||||
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.
|
||||
|
||||
\end{document}
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
latex_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
latex_docs = latex_splitter.create_documents([latex_text])
|
||||
latex_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', metadata={}),
|
||||
Document(page_content='\\section{Introduction}', metadata={}),
|
||||
Document(page_content='Large language models (LLMs) are a type of machine learning', metadata={}),
|
||||
Document(page_content='model that can be trained on vast amounts of text data to', metadata={}),
|
||||
Document(page_content='generate human-like language. In recent years, LLMs have', metadata={}),
|
||||
Document(page_content='made significant advances in a variety of natural language', metadata={}),
|
||||
Document(page_content='processing tasks, including language translation, text', metadata={}),
|
||||
Document(page_content='generation, and sentiment analysis.', metadata={}),
|
||||
Document(page_content='\\subsection{History of LLMs}', metadata={}),
|
||||
Document(page_content='The earliest LLMs were developed in the 1980s and 1990s,', metadata={}),
|
||||
Document(page_content='but they were limited by the amount of data that could be', metadata={}),
|
||||
Document(page_content='processed and the computational power available at the', metadata={}),
|
||||
Document(page_content='time. In the past decade, however, advances in hardware and', metadata={}),
|
||||
Document(page_content='software have made it possible to train LLMs on massive', metadata={}),
|
||||
Document(page_content='datasets, leading to significant improvements in', metadata={}),
|
||||
Document(page_content='performance.', metadata={}),
|
||||
Document(page_content='\\subsection{Applications of LLMs}', metadata={}),
|
||||
Document(page_content='LLMs have many applications in industry, including', metadata={}),
|
||||
Document(page_content='chatbots, content creation, and virtual assistants. They', metadata={}),
|
||||
Document(page_content='can also be used in academia for research in linguistics,', metadata={}),
|
||||
Document(page_content='psychology, and computational linguistics.', metadata={}),
|
||||
Document(page_content='\\end{document}', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## HTML
|
||||
|
||||
Here's an example using an HTML text splitter
|
||||
|
||||
|
||||
```python
|
||||
html_text = """
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>🦜️🔗 LangChain</title>
|
||||
<style>
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
}
|
||||
h1 {
|
||||
color: darkblue;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div>
|
||||
<h1>🦜️🔗 LangChain</h1>
|
||||
<p>⚡ Building applications with LLMs through composability ⚡</p>
|
||||
</div>
|
||||
<div>
|
||||
As an open source project in a rapidly developing field, we are extremely open to contributions.
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
html_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.HTML, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
html_docs = html_splitter.create_documents([html_text])
|
||||
html_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='<!DOCTYPE html>\n<html>', metadata={}),
|
||||
Document(page_content='<head>\n <title>🦜️🔗 LangChain</title>', metadata={}),
|
||||
Document(page_content='<style>\n body {\n font-family: Aria', metadata={}),
|
||||
Document(page_content='l, sans-serif;\n }\n h1 {', metadata={}),
|
||||
Document(page_content='color: darkblue;\n }\n </style>\n </head', metadata={}),
|
||||
Document(page_content='>', metadata={}),
|
||||
Document(page_content='<body>', metadata={}),
|
||||
Document(page_content='<div>\n <h1>🦜️🔗 LangChain</h1>', metadata={}),
|
||||
Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡', metadata={}),
|
||||
Document(page_content='</p>\n </div>', metadata={}),
|
||||
Document(page_content='<div>\n As an open source project in a rapidly dev', metadata={}),
|
||||
Document(page_content='eloping field, we are extremely open to contributions.', metadata={}),
|
||||
Document(page_content='</div>\n </body>\n</html>', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Solidity
|
||||
Here's an example using the Solidity text splitter
|
||||
|
||||
```python
|
||||
SOL_CODE = """
|
||||
pragma solidity ^0.8.20;
|
||||
contract HelloWorld {
|
||||
function add(uint a, uint b) pure public returns(uint) {
|
||||
return a + b;
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
sol_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.SOL, chunk_size=128, chunk_overlap=0
|
||||
)
|
||||
sol_docs = sol_splitter.create_documents([SOL_CODE])
|
||||
sol_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock>
|
||||
|
||||
```
|
||||
[
|
||||
Document(page_content='pragma solidity ^0.8.20;', metadata={}),
|
||||
Document(page_content='contract HelloWorld {\n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}', metadata={})
|
||||
]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,50 +0,0 @@
|
|||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
# Set a really small chunk size, just to show.
|
||||
chunk_size = 100,
|
||||
chunk_overlap = 20,
|
||||
length_function = len,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
print(texts[1])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0
|
||||
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
text_splitter.split_text(state_of_the_union)[:2]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
|
||||
'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
Loading…
Add table
Add a link
Reference in a new issue