forked from phoenix/litellm-mirror
trying to add docs
This commit is contained in:
parent
0fe8799f94
commit
2cf949990e
834 changed files with 0 additions and 161273 deletions
|
@ -1,18 +0,0 @@
|
|||
The simplest loader reads in a file as text and places it all into one Document.
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
|
||||
loader = TextLoader("./index.md")
|
||||
loader.load()
|
||||
```
|
||||
|
||||
<CodeOutputBlock language="python">
|
||||
|
||||
```
|
||||
[
|
||||
Document(page_content='---\nsidebar_position: 0\n---\n# Document loaders\n\nUse document loaders to load data from a source as `Document`\'s. A `Document` is a piece of text\nand associated metadata. For example, there are document loaders for loading a simple `.txt` file, for loading the text\ncontents of any web page, or even for loading a transcript of a YouTube video.\n\nEvery document loader exposes two methods:\n1. "Load": load documents from the configured source\n2. "Load and split": load documents from the configured source and split them using the passed in text splitter\n\nThey optionally implement:\n\n3. "Lazy load": load documents into memory lazily\n', metadata={'source': '../docs/docs_skeleton/docs/modules/data_connection/document_loaders/index.md'})
|
||||
]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
File diff suppressed because one or more lines are too long
|
@ -1,277 +0,0 @@
|
|||
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DirectoryLoader
|
||||
```
|
||||
|
||||
We can use the `glob` parameter to control which files to load. Note that here it doesn't load the `.rst` file or the `.html` files.
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader('../', glob="**/*.md")
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
len(docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
1
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Show a progress bar
|
||||
|
||||
By default a progress bar will not be shown. To show a progress bar, install the `tqdm` library (e.g. `pip install tqdm`), and set the `show_progress` parameter to `True`.
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Requirement already satisfied: tqdm in /Users/jon/.pyenv/versions/3.9.16/envs/microbiome-app/lib/python3.9/site-packages (4.65.0)
|
||||
|
||||
|
||||
0it [00:00, ?it/s]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Use multithreading
|
||||
|
||||
By default the loading happens in one thread. In order to utilize several threads set the `use_multithreading` flag to true.
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader('../', glob="**/*.md", use_multithreading=True)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
## Change loader class
|
||||
By default this uses the `UnstructuredLoader` class. However, you can change up the type of loader pretty easily.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader('../', glob="**/*.md", loader_cls=TextLoader)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
len(docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
1
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
If you need to load Python source code files, use the `PythonLoader`.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import PythonLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
len(docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
691
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Auto detect file encodings with TextLoader
|
||||
|
||||
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
|
||||
|
||||
First to illustrate the problem, let's try to load multiple text with arbitrary encodings.
|
||||
|
||||
|
||||
```python
|
||||
path = '../../../../../tests/integration_tests/examples'
|
||||
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)
|
||||
```
|
||||
|
||||
### A. Default Behavior
|
||||
|
||||
|
||||
```python
|
||||
loader.load()
|
||||
```
|
||||
|
||||
<HTMLOutputBlock center>
|
||||
|
||||
|
||||
```html
|
||||
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #800000; text-decoration-color: #800000">╭─────────────────────────────── </span><span style="color: #800000; text-decoration-color: #800000; font-weight: bold">Traceback </span><span style="color: #bf7f7f; text-decoration-color: #bf7f7f; font-weight: bold">(most recent call last)</span><span style="color: #800000; text-decoration-color: #800000"> ────────────────────────────────╮</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/data/source/langchain/langchain/document_loaders/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">text.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">29</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">load</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">26 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span>text = <span style="color: #808000; text-decoration-color: #808000">""</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">27 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">with</span> <span style="color: #00ffff; text-decoration-color: #00ffff">open</span>(<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.file_path, encoding=<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.encoding) <span style="color: #0000ff; text-decoration-color: #0000ff">as</span> f: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">28 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">try</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span>29 <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ </span>text = f.read() <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">30 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">except</span> <span style="color: #00ffff; text-decoration-color: #00ffff">UnicodeDecodeError</span> <span style="color: #0000ff; text-decoration-color: #0000ff">as</span> e: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">31 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.autodetect_encoding: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">32 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ </span>detected_encodings = <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.detect_file_encodings() <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/home/spike/.pyenv/versions/3.9.11/lib/python3.9/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">codecs.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">322</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">decode</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 319 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ </span><span style="color: #0000ff; text-decoration-color: #0000ff">def</span> <span style="color: #00ff00; text-decoration-color: #00ff00">decode</span>(<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>, <span style="color: #00ffff; text-decoration-color: #00ffff">input</span>, final=<span style="color: #0000ff; text-decoration-color: #0000ff">False</span>): <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 320 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f"># decode input (taking the buffer into account)</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 321 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span>data = <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.buffer + <span style="color: #00ffff; text-decoration-color: #00ffff">input</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span> 322 <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span>(result, consumed) = <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>._buffer_decode(data, <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.errors, final) <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 323 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f"># keep undecoded input until the next call</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 324 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span><span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.buffer = data[consumed:] <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f"> 325 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">return</span> result <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">╰──────────────────────────────────────────────────────────────────────────────────────────────────╯</span>
|
||||
<span style="color: #ff0000; text-decoration-color: #ff0000; font-weight: bold">UnicodeDecodeError: </span><span style="color: #008000; text-decoration-color: #008000">'utf-8'</span> codec can't decode byte <span style="color: #008080; text-decoration-color: #008080; font-weight: bold">0xca</span> in position <span style="color: #008080; text-decoration-color: #008080; font-weight: bold">0</span>: invalid continuation byte
|
||||
|
||||
<span style="font-style: italic">The above exception was the direct cause of the following exception:</span>
|
||||
|
||||
<span style="color: #800000; text-decoration-color: #800000">╭─────────────────────────────── </span><span style="color: #800000; text-decoration-color: #800000; font-weight: bold">Traceback </span><span style="color: #bf7f7f; text-decoration-color: #bf7f7f; font-weight: bold">(most recent call last)</span><span style="color: #800000; text-decoration-color: #800000"> ────────────────────────────────╮</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> in <span style="color: #00ff00; text-decoration-color: #00ff00"><module></span>:<span style="color: #0000ff; text-decoration-color: #0000ff">1</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span>1 loader.load() <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">2 </span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/data/source/langchain/langchain/document_loaders/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">directory.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">84</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">load</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">81 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.silent_errors: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">82 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ │ </span>logger.warning(e) <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">83 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">else</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span>84 <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">raise</span> e <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">85 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">finally</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">86 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> pbar: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">87 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ │ </span>pbar.update(<span style="color: #0000ff; text-decoration-color: #0000ff">1</span>) <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/data/source/langchain/langchain/document_loaders/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">directory.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">78</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">load</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">75 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> i.is_file(): <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">76 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> _is_visible(i.relative_to(p)) <span style="color: #ff00ff; text-decoration-color: #ff00ff">or</span> <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.load_hidden: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">77 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">try</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span>78 <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span>sub_docs = <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.loader_cls(<span style="color: #00ffff; text-decoration-color: #00ffff">str</span>(i), **<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.loader_kwargs).load() <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">79 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span>docs.extend(sub_docs) <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">80 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">except</span> <span style="color: #00ffff; text-decoration-color: #00ffff">Exception</span> <span style="color: #0000ff; text-decoration-color: #0000ff">as</span> e: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">81 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">if</span> <span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.silent_errors: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/data/source/langchain/langchain/document_loaders/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">text.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">44</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">load</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">41 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">except</span> <span style="color: #00ffff; text-decoration-color: #00ffff">UnicodeDecodeError</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">42 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">continue</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">43 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">else</span>: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #800000; text-decoration-color: #800000">❱ </span>44 <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">raise</span> <span style="color: #00ffff; text-decoration-color: #00ffff">RuntimeError</span>(<span style="color: #808000; text-decoration-color: #808000">f"Error loading {</span><span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.file_path<span style="color: #808000; text-decoration-color: #808000">}"</span>) <span style="color: #0000ff; text-decoration-color: #0000ff">from</span> <span style="color: #00ffff; text-decoration-color: #00ffff; text-decoration: underline">e</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">45 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">except</span> <span style="color: #00ffff; text-decoration-color: #00ffff">Exception</span> <span style="color: #0000ff; text-decoration-color: #0000ff">as</span> e: <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">46 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│ │ │ │ </span><span style="color: #0000ff; text-decoration-color: #0000ff">raise</span> <span style="color: #00ffff; text-decoration-color: #00ffff">RuntimeError</span>(<span style="color: #808000; text-decoration-color: #808000">f"Error loading {</span><span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.file_path<span style="color: #808000; text-decoration-color: #808000">}"</span>) <span style="color: #0000ff; text-decoration-color: #0000ff">from</span> <span style="color: #00ffff; text-decoration-color: #00ffff; text-decoration: underline">e</span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">47 </span> <span style="color: #800000; text-decoration-color: #800000">│</span>
|
||||
<span style="color: #800000; text-decoration-color: #800000">╰──────────────────────────────────────────────────────────────────────────────────────────────────╯</span>
|
||||
<span style="color: #ff0000; text-decoration-color: #ff0000; font-weight: bold">RuntimeError: </span>Error loading ..<span style="color: #800080; text-decoration-color: #800080">/../../../../tests/integration_tests/examples/</span><span style="color: #ff00ff; text-decoration-color: #ff00ff">example-non-utf8.txt</span>
|
||||
</pre>
|
||||
```
|
||||
|
||||
|
||||
</HTMLOutputBlock>
|
||||
|
||||
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
|
||||
|
||||
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.
|
||||
|
||||
### B. Silent fail
|
||||
|
||||
We can pass the parameter `silent_errors` to the `DirectoryLoader` to skip the files which could not be loaded and continue the load process.
|
||||
|
||||
|
||||
```python
|
||||
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Error loading ../../../../../tests/integration_tests/examples/example-non-utf8.txt
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
doc_sources = [doc.metadata['source'] for doc in docs]
|
||||
doc_sources
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
|
||||
'../../../../../tests/integration_tests/examples/example-utf8.txt']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### C. Auto detect encodings
|
||||
|
||||
We can also ask `TextLoader` to auto detect the file encoding before failing, by passing the `autodetect_encoding` to the loader class.
|
||||
|
||||
|
||||
```python
|
||||
text_loader_kwargs={'autodetect_encoding': True}
|
||||
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
doc_sources = [doc.metadata['source'] for doc in docs]
|
||||
doc_sources
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['../../../../../tests/integration_tests/examples/example-non-utf8.txt',
|
||||
'../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
|
||||
'../../../../../tests/integration_tests/examples/example-utf8.txt']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,50 +0,0 @@
|
|||
```python
|
||||
from langchain.document_loaders import UnstructuredHTMLLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Loading HTML with BeautifulSoup4
|
||||
|
||||
We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`. This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import BSHTMLLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
loader = BSHTMLLoader("example_data/fake-content.html")
|
||||
data = loader.load()
|
||||
data
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,333 +0,0 @@
|
|||
>The `JSONLoader` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the `jq` python package.
|
||||
Check this [manual](https://stedolan.github.io/jq/manual/#Basicfilters) for a detailed documentation of the `jq` syntax.
|
||||
|
||||
|
||||
```python
|
||||
#!pip install jq
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import JSONLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
from pprint import pprint
|
||||
|
||||
|
||||
file_path='./example_data/facebook_chat.json'
|
||||
data = json.loads(Path(file_path).read_text())
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
|
||||
'is_still_participant': True,
|
||||
'joinable_mode': {'link': '', 'mode': 1},
|
||||
'magic_words': [],
|
||||
'messages': [{'content': 'Bye!',
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675597571851},
|
||||
{'content': 'Oh no worries! Bye',
|
||||
'sender_name': 'User 1',
|
||||
'timestamp_ms': 1675597435669},
|
||||
{'content': 'No Im sorry it was my mistake, the blue one is not '
|
||||
'for sale',
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675596277579},
|
||||
{'content': 'I thought you were selling the blue one!',
|
||||
'sender_name': 'User 1',
|
||||
'timestamp_ms': 1675595140251},
|
||||
{'content': 'Im not interested in this bag. Im interested in the '
|
||||
'blue one!',
|
||||
'sender_name': 'User 1',
|
||||
'timestamp_ms': 1675595109305},
|
||||
{'content': 'Here is $129',
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675595068468},
|
||||
{'photos': [{'creation_timestamp': 1675595059,
|
||||
'uri': 'url_of_some_picture.jpg'}],
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675595060730},
|
||||
{'content': 'Online is at least $100',
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675595045152},
|
||||
{'content': 'How much do you want?',
|
||||
'sender_name': 'User 1',
|
||||
'timestamp_ms': 1675594799696},
|
||||
{'content': 'Goodmorning! $50 is too low.',
|
||||
'sender_name': 'User 2',
|
||||
'timestamp_ms': 1675577876645},
|
||||
{'content': 'Hi! Im interested in your bag. Im offering $50. Let '
|
||||
'me know if you are interested. Thanks!',
|
||||
'sender_name': 'User 1',
|
||||
'timestamp_ms': 1675549022673}],
|
||||
'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
|
||||
'thread_path': 'inbox/User 1 and User 2 chat',
|
||||
'title': 'User 1 and User 2 chat'}
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Using `JSONLoader`
|
||||
|
||||
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the `JSONLoader` as shown below.
|
||||
|
||||
|
||||
### JSON file
|
||||
|
||||
```python
|
||||
loader = JSONLoader(
|
||||
file_path='./example_data/facebook_chat.json',
|
||||
jq_schema='.messages[].content')
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
|
||||
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
|
||||
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
|
||||
Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
|
||||
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
|
||||
Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
|
||||
Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
|
||||
Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
|
||||
Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
|
||||
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
|
||||
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
### JSON Lines file
|
||||
|
||||
If you want to load documents from a JSON Lines file, you pass `json_lines=True`
|
||||
and specify `jq_schema` to extract `page_content` from a single JSON object.
|
||||
|
||||
```python
|
||||
file_path = './example_data/facebook_chat_messages.jsonl'
|
||||
pprint(Path(file_path).read_text())
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
('{"sender_name": "User 2", "timestamp_ms": 1675597571851, "content": "Bye!"}\n'
|
||||
'{"sender_name": "User 1", "timestamp_ms": 1675597435669, "content": "Oh no '
|
||||
'worries! Bye"}\n'
|
||||
'{"sender_name": "User 2", "timestamp_ms": 1675596277579, "content": "No Im '
|
||||
'sorry it was my mistake, the blue one is not for sale"}\n')
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
loader = JSONLoader(
|
||||
file_path='./example_data/facebook_chat_messages.jsonl',
|
||||
jq_schema='.content',
|
||||
json_lines=True)
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
|
||||
Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
|
||||
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
Another option is set `jq_schema='.'` and provide `content_key`:
|
||||
|
||||
```python
|
||||
loader = JSONLoader(
|
||||
file_path='./example_data/facebook_chat_messages.jsonl',
|
||||
jq_schema='.',
|
||||
content_key='sender_name',
|
||||
json_lines=True)
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
|
||||
Document(page_content='User 1', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
|
||||
Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Extracting metadata
|
||||
|
||||
Generally, we want to include metadata available in the JSON file into the documents that we create from the content.
|
||||
|
||||
The following demonstrates how metadata can be extracted using the `JSONLoader`.
|
||||
|
||||
There are some key changes to be noted. In the previous example where we didn't collect the metadata, we managed to directly specify in the schema where the value for the `page_content` can be extracted from.
|
||||
|
||||
```
|
||||
.messages[].content
|
||||
```
|
||||
|
||||
In the current example, we have to tell the loader to iterate over the records in the `messages` field. The jq_schema then has to be:
|
||||
|
||||
```
|
||||
.messages[]
|
||||
```
|
||||
|
||||
This allows us to pass the records (dict) into the `metadata_func` that has to be implemented. The `metadata_func` is responsible for identifying which pieces of information in the record should be included in the metadata stored in the final `Document` object.
|
||||
|
||||
Additionally, we now have to explicitly specify in the loader, via the `content_key` argument, the key from the record where the value for the `page_content` needs to be extracted from.
|
||||
|
||||
|
||||
```python
|
||||
# Define the metadata extraction function.
|
||||
def metadata_func(record: dict, metadata: dict) -> dict:
|
||||
|
||||
metadata["sender_name"] = record.get("sender_name")
|
||||
metadata["timestamp_ms"] = record.get("timestamp_ms")
|
||||
|
||||
return metadata
|
||||
|
||||
|
||||
loader = JSONLoader(
|
||||
file_path='./example_data/facebook_chat.json',
|
||||
jq_schema='.messages[]',
|
||||
content_key="content",
|
||||
metadata_func=metadata_func
|
||||
)
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
|
||||
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
|
||||
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),
|
||||
Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),
|
||||
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),
|
||||
Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),
|
||||
Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),
|
||||
Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),
|
||||
Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),
|
||||
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),
|
||||
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
Now, you will see that the documents contain the metadata associated with the content we extracted.
|
||||
|
||||
## The `metadata_func`
|
||||
|
||||
As shown above, the `metadata_func` accepts the default metadata generated by the `JSONLoader`. This allows full control to the user with respect to how the metadata is formatted.
|
||||
|
||||
For example, the default metadata contains the `source` and the `seq_num` keys. However, it is possible that the JSON data contain these keys as well. The user can then exploit the `metadata_func` to rename the default keys and use the ones from the JSON data.
|
||||
|
||||
The example below shows how we can modify the `source` to only contain information of the file source relative to the `langchain` directory.
|
||||
|
||||
|
||||
```python
|
||||
# Define the metadata extraction function.
|
||||
def metadata_func(record: dict, metadata: dict) -> dict:
|
||||
|
||||
metadata["sender_name"] = record.get("sender_name")
|
||||
metadata["timestamp_ms"] = record.get("timestamp_ms")
|
||||
|
||||
if "source" in metadata:
|
||||
source = metadata["source"].split("/")
|
||||
source = source[source.index("langchain"):]
|
||||
metadata["source"] = "/".join(source)
|
||||
|
||||
return metadata
|
||||
|
||||
|
||||
loader = JSONLoader(
|
||||
file_path='./example_data/facebook_chat.json',
|
||||
jq_schema='.messages[]',
|
||||
content_key="content",
|
||||
metadata_func=metadata_func
|
||||
)
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
|
||||
Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
|
||||
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),
|
||||
Document(page_content='I thought you were selling the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),
|
||||
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),
|
||||
Document(page_content='Here is $129', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),
|
||||
Document(page_content='', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),
|
||||
Document(page_content='Online is at least $100', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),
|
||||
Document(page_content='How much do you want?', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),
|
||||
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),
|
||||
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Common JSON structures with jq schema
|
||||
|
||||
The list below provides a reference to the possible `jq_schema` the user can use to extract content from the JSON data depending on the structure.
|
||||
|
||||
```
|
||||
JSON -> [{"text": ...}, {"text": ...}, {"text": ...}]
|
||||
jq_schema -> ".[].text"
|
||||
|
||||
JSON -> {"key": [{"text": ...}, {"text": ...}, {"text": ...}]}
|
||||
jq_schema -> ".key[].text"
|
||||
|
||||
JSON -> ["...", "...", "..."]
|
||||
jq_schema -> ".[]"
|
||||
```
|
|
@ -1,59 +0,0 @@
|
|||
```python
|
||||
# !pip install unstructured > /dev/null
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import UnstructuredMarkdownLoader
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
markdown_path = "../../../../../README.md"
|
||||
loader = UnstructuredMarkdownLoader(markdown_path)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content="ð\x9f¦\x9cï¸\x8fð\x9f”\x97 LangChain\n\nâ\x9a¡ Building applications with LLMs through composability â\x9a¡\n\nLooking for the JS/TS version? Check out LangChain.js.\n\nProduction Support: As you move your LangChains into production, we'd love to offer more comprehensive support.\nPlease fill out this form and we'll set up a dedicated support Slack channel.\n\nQuick Install\n\npip install langchain\nor\nconda install langchain -c conda-forge\n\nð\x9f¤” What is this?\n\nLarge language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.\n\nThis library aims to assist in the development of those types of applications. Common examples of these applications include:\n\nâ\x9d“ Question Answering over specific documents\n\nDocumentation\n\nEnd-to-end Example: Question Answering over Notion Database\n\nð\x9f’¬ Chatbots\n\nDocumentation\n\nEnd-to-end Example: Chat-LangChain\n\nð\x9f¤\x96 Agents\n\nDocumentation\n\nEnd-to-end Example: GPT+WolframAlpha\n\nð\x9f“\x96 Documentation\n\nPlease see here for full documentation on:\n\nGetting started (installation, setting up the environment, simple examples)\n\nHow-To examples (demos, integrations, helper functions)\n\nReference (full API docs)\n\nResources (high-level explanation of core concepts)\n\nð\x9f\x9a\x80 What can this help with?\n\nThere are six main areas that LangChain is designed to help with.\nThese are, in increasing order of complexity:\n\nð\x9f“\x83 LLMs and Prompts:\n\nThis includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.\n\nð\x9f”\x97 Chains:\n\nChains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.\n\nð\x9f“\x9a Data Augmented Generation:\n\nData Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.\n\nð\x9f¤\x96 Agents:\n\nAgents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.\n\nð\x9f§\xa0 Memory:\n\nMemory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.\n\nð\x9f§\x90 Evaluation:\n\n[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.\n\nFor more information on these concepts, please see our full documentation.\n\nð\x9f’\x81 Contributing\n\nAs an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.\n\nFor detailed information on how to contribute, see here.", metadata={'source': '../../../../../README.md'})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Retain Elements
|
||||
|
||||
Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.
|
||||
|
||||
|
||||
```python
|
||||
loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
data[0]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document(page_content='ð\x9f¦\x9cï¸\x8fð\x9f”\x97 LangChain', metadata={'source': '../../../../../README.md', 'page_number': 1, 'category': 'Title'})
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
File diff suppressed because one or more lines are too long
|
@ -1,57 +0,0 @@
|
|||
The default recommended text splitter is the RecursiveCharacterTextSplitter. This text splitter takes a list of characters. It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth. By default the characters it tries to split on are `["\n\n", "\n", " ", ""]`
|
||||
|
||||
In addition to controlling which characters you can split on, you can also control a few other things:
|
||||
|
||||
- `length_function`: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
|
||||
- `chunk_size`: the maximum size of your chunks (as measured by the length function).
|
||||
- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window).
|
||||
- `add_start_index`: whether to include the starting position of each chunk within the original document in the metadata.
|
||||
|
||||
|
||||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
# Set a really small chunk size, just to show.
|
||||
chunk_size = 100,
|
||||
chunk_overlap = 20,
|
||||
length_function = len,
|
||||
add_start_index = True,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
print(texts[1])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}
|
||||
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Other transformations:
|
||||
### Filter redundant docs, translate docs, extract metadata, and more
|
||||
|
||||
We can do perform a number of transformations on docs which are not simply splitting the text. With the
|
||||
`EmbeddingsRedundantFilter` we can identify similar documents and filter out redundancies. With integrations like
|
||||
[doctran](https://github.com/psychic-api/doctran/tree/main) we can do things like translate documents from one language
|
||||
to another, extract desired properties and add them to metadata, and convert conversational dialogue into a Q/A format
|
||||
set of documents.
|
|
@ -1,60 +0,0 @@
|
|||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
text_splitter = CharacterTextSplitter(
|
||||
separator = "\n\n",
|
||||
chunk_size = 1000,
|
||||
chunk_overlap = 200,
|
||||
length_function = len,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
Here's an example of passing metadata along with the documents, notice that it is split along with the documents.
|
||||
|
||||
|
||||
```python
|
||||
metadatas = [{"document": 1}, {"document": 2}]
|
||||
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
|
||||
print(documents[0])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
text_splitter.split_text(state_of_the_union)[0]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,312 +0,0 @@
|
|||
```python
|
||||
from langchain.text_splitter import (
|
||||
RecursiveCharacterTextSplitter,
|
||||
Language,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Full list of support languages
|
||||
[e.value for e in Language]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['cpp',
|
||||
'go',
|
||||
'java',
|
||||
'js',
|
||||
'php',
|
||||
'proto',
|
||||
'python',
|
||||
'rst',
|
||||
'ruby',
|
||||
'rust',
|
||||
'scala',
|
||||
'swift',
|
||||
'markdown',
|
||||
'latex',
|
||||
'html',
|
||||
'sol',]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# You can also see the separators used for a given language
|
||||
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Python
|
||||
|
||||
Here's an example using the PythonTextSplitter
|
||||
|
||||
|
||||
```python
|
||||
PYTHON_CODE = """
|
||||
def hello_world():
|
||||
print("Hello, World!")
|
||||
|
||||
# Call the function
|
||||
hello_world()
|
||||
"""
|
||||
python_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
|
||||
)
|
||||
python_docs = python_splitter.create_documents([PYTHON_CODE])
|
||||
python_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='def hello_world():\n print("Hello, World!")', metadata={}),
|
||||
Document(page_content='# Call the function\nhello_world()', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## JS
|
||||
Here's an example using the JS text splitter
|
||||
|
||||
|
||||
```python
|
||||
JS_CODE = """
|
||||
function helloWorld() {
|
||||
console.log("Hello, World!");
|
||||
}
|
||||
|
||||
// Call the function
|
||||
helloWorld();
|
||||
"""
|
||||
|
||||
js_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.JS, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
js_docs = js_splitter.create_documents([JS_CODE])
|
||||
js_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='function helloWorld() {\n console.log("Hello, World!");\n}', metadata={}),
|
||||
Document(page_content='// Call the function\nhelloWorld();', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Markdown
|
||||
|
||||
Here's an example using the Markdown text splitter.
|
||||
|
||||
|
||||
````python
|
||||
markdown_text = """
|
||||
# 🦜️🔗 LangChain
|
||||
|
||||
⚡ Building applications with LLMs through composability ⚡
|
||||
|
||||
## Quick Install
|
||||
|
||||
```bash
|
||||
# Hopefully this code block isn't split
|
||||
pip install langchain
|
||||
```
|
||||
|
||||
As an open source project in a rapidly developing field, we are extremely open to contributions.
|
||||
"""
|
||||
````
|
||||
|
||||
|
||||
```python
|
||||
md_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
md_docs = md_splitter.create_documents([markdown_text])
|
||||
md_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='# 🦜️🔗 LangChain', metadata={}),
|
||||
Document(page_content='⚡ Building applications with LLMs through composability ⚡', metadata={}),
|
||||
Document(page_content='## Quick Install', metadata={}),
|
||||
Document(page_content="```bash\n# Hopefully this code block isn't split", metadata={}),
|
||||
Document(page_content='pip install langchain', metadata={}),
|
||||
Document(page_content='```', metadata={}),
|
||||
Document(page_content='As an open source project in a rapidly developing field, we', metadata={}),
|
||||
Document(page_content='are extremely open to contributions.', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Latex
|
||||
|
||||
Here's an example on Latex text
|
||||
|
||||
|
||||
```python
|
||||
latex_text = """
|
||||
\documentclass{article}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\maketitle
|
||||
|
||||
\section{Introduction}
|
||||
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.
|
||||
|
||||
\subsection{History of LLMs}
|
||||
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
|
||||
|
||||
\subsection{Applications of LLMs}
|
||||
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.
|
||||
|
||||
\end{document}
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
latex_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
latex_docs = latex_splitter.create_documents([latex_text])
|
||||
latex_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', metadata={}),
|
||||
Document(page_content='\\section{Introduction}', metadata={}),
|
||||
Document(page_content='Large language models (LLMs) are a type of machine learning', metadata={}),
|
||||
Document(page_content='model that can be trained on vast amounts of text data to', metadata={}),
|
||||
Document(page_content='generate human-like language. In recent years, LLMs have', metadata={}),
|
||||
Document(page_content='made significant advances in a variety of natural language', metadata={}),
|
||||
Document(page_content='processing tasks, including language translation, text', metadata={}),
|
||||
Document(page_content='generation, and sentiment analysis.', metadata={}),
|
||||
Document(page_content='\\subsection{History of LLMs}', metadata={}),
|
||||
Document(page_content='The earliest LLMs were developed in the 1980s and 1990s,', metadata={}),
|
||||
Document(page_content='but they were limited by the amount of data that could be', metadata={}),
|
||||
Document(page_content='processed and the computational power available at the', metadata={}),
|
||||
Document(page_content='time. In the past decade, however, advances in hardware and', metadata={}),
|
||||
Document(page_content='software have made it possible to train LLMs on massive', metadata={}),
|
||||
Document(page_content='datasets, leading to significant improvements in', metadata={}),
|
||||
Document(page_content='performance.', metadata={}),
|
||||
Document(page_content='\\subsection{Applications of LLMs}', metadata={}),
|
||||
Document(page_content='LLMs have many applications in industry, including', metadata={}),
|
||||
Document(page_content='chatbots, content creation, and virtual assistants. They', metadata={}),
|
||||
Document(page_content='can also be used in academia for research in linguistics,', metadata={}),
|
||||
Document(page_content='psychology, and computational linguistics.', metadata={}),
|
||||
Document(page_content='\\end{document}', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## HTML
|
||||
|
||||
Here's an example using an HTML text splitter
|
||||
|
||||
|
||||
```python
|
||||
html_text = """
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>🦜️🔗 LangChain</title>
|
||||
<style>
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
}
|
||||
h1 {
|
||||
color: darkblue;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div>
|
||||
<h1>🦜️🔗 LangChain</h1>
|
||||
<p>⚡ Building applications with LLMs through composability ⚡</p>
|
||||
</div>
|
||||
<div>
|
||||
As an open source project in a rapidly developing field, we are extremely open to contributions.
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
html_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.HTML, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
html_docs = html_splitter.create_documents([html_text])
|
||||
html_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='<!DOCTYPE html>\n<html>', metadata={}),
|
||||
Document(page_content='<head>\n <title>🦜️🔗 LangChain</title>', metadata={}),
|
||||
Document(page_content='<style>\n body {\n font-family: Aria', metadata={}),
|
||||
Document(page_content='l, sans-serif;\n }\n h1 {', metadata={}),
|
||||
Document(page_content='color: darkblue;\n }\n </style>\n </head', metadata={}),
|
||||
Document(page_content='>', metadata={}),
|
||||
Document(page_content='<body>', metadata={}),
|
||||
Document(page_content='<div>\n <h1>🦜️🔗 LangChain</h1>', metadata={}),
|
||||
Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡', metadata={}),
|
||||
Document(page_content='</p>\n </div>', metadata={}),
|
||||
Document(page_content='<div>\n As an open source project in a rapidly dev', metadata={}),
|
||||
Document(page_content='eloping field, we are extremely open to contributions.', metadata={}),
|
||||
Document(page_content='</div>\n </body>\n</html>', metadata={})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
## Solidity
|
||||
Here's an example using the Solidity text splitter
|
||||
|
||||
```python
|
||||
SOL_CODE = """
|
||||
pragma solidity ^0.8.20;
|
||||
contract HelloWorld {
|
||||
function add(uint a, uint b) pure public returns(uint) {
|
||||
return a + b;
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
sol_splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
language=Language.SOL, chunk_size=128, chunk_overlap=0
|
||||
)
|
||||
sol_docs = sol_splitter.create_documents([SOL_CODE])
|
||||
sol_docs
|
||||
```
|
||||
|
||||
<CodeOutputBlock>
|
||||
|
||||
```
|
||||
[
|
||||
Document(page_content='pragma solidity ^0.8.20;', metadata={}),
|
||||
Document(page_content='contract HelloWorld {\n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}', metadata={})
|
||||
]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,50 +0,0 @@
|
|||
```python
|
||||
# This is a long document we can split up.
|
||||
with open('../../../state_of_the_union.txt') as f:
|
||||
state_of_the_union = f.read()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
# Set a really small chunk size, just to show.
|
||||
chunk_size = 100,
|
||||
chunk_overlap = 20,
|
||||
length_function = len,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
texts = text_splitter.create_documents([state_of_the_union])
|
||||
print(texts[0])
|
||||
print(texts[1])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0
|
||||
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
text_splitter.split_text(state_of_the_union)[:2]
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
|
||||
'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,261 +0,0 @@
|
|||
```python
|
||||
# Helper function for printing docs
|
||||
|
||||
def pretty_print_docs(docs):
|
||||
print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))
|
||||
```
|
||||
|
||||
## Using a vanilla vector store retriever
|
||||
Let's start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can see that given an example question our retriever returns one or two relevant docs and a few irrelevant docs. And even the relevant docs have a lot of irrelevant information in them.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.vectorstores import FAISS
|
||||
|
||||
documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
|
||||
|
||||
docs = retriever.get_relevant_documents("What did the president say about Ketanji Brown Jackson")
|
||||
pretty_print_docs(docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document 1:
|
||||
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 2:
|
||||
|
||||
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.
|
||||
|
||||
And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.
|
||||
|
||||
We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
|
||||
|
||||
We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.
|
||||
|
||||
We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.
|
||||
|
||||
We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 3:
|
||||
|
||||
And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong.
|
||||
|
||||
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
|
||||
|
||||
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
|
||||
|
||||
And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things.
|
||||
|
||||
So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.
|
||||
|
||||
First, beat the opioid epidemic.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 4:
|
||||
|
||||
Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers.
|
||||
|
||||
And as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up.
|
||||
|
||||
That ends on my watch.
|
||||
|
||||
Medicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect.
|
||||
|
||||
We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees.
|
||||
|
||||
Let’s pass the Paycheck Fairness Act and paid leave.
|
||||
|
||||
Raise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty.
|
||||
|
||||
Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Adding contextual compression with an `LLMChainExtractor`
|
||||
Now let's wrap our base retriever with a `ContextualCompressionRetriever`. We'll add an `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.llms import OpenAI
|
||||
from langchain.retrievers import ContextualCompressionRetriever
|
||||
from langchain.retrievers.document_compressors import LLMChainExtractor
|
||||
|
||||
llm = OpenAI(temperature=0)
|
||||
compressor = LLMChainExtractor.from_llm(llm)
|
||||
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
|
||||
|
||||
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
|
||||
pretty_print_docs(compressed_docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document 1:
|
||||
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence."
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 2:
|
||||
|
||||
"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## More built-in compressors: filters
|
||||
### `LLMChainFilter`
|
||||
The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.retrievers.document_compressors import LLMChainFilter
|
||||
|
||||
_filter = LLMChainFilter.from_llm(llm)
|
||||
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)
|
||||
|
||||
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
|
||||
pretty_print_docs(compressed_docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document 1:
|
||||
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### `EmbeddingsFilter`
|
||||
|
||||
Making an extra LLM call over each retrieved document is expensive and slow. The `EmbeddingsFilter` provides a cheaper and faster option by embedding the documents and query and only returning those documents which have sufficiently similar embeddings to the query.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.retrievers.document_compressors import EmbeddingsFilter
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
|
||||
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)
|
||||
|
||||
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
|
||||
pretty_print_docs(compressed_docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document 1:
|
||||
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 2:
|
||||
|
||||
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.
|
||||
|
||||
And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.
|
||||
|
||||
We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
|
||||
|
||||
We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.
|
||||
|
||||
We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.
|
||||
|
||||
We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 3:
|
||||
|
||||
And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong.
|
||||
|
||||
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
|
||||
|
||||
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
|
||||
|
||||
And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things.
|
||||
|
||||
So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.
|
||||
|
||||
First, beat the opioid epidemic.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
# Stringing compressors and document transformers together
|
||||
Using the `DocumentCompressorPipeline` we can also easily combine multiple compressors in sequence. Along with compressors we can add `BaseDocumentTransformer`s to our pipeline, which don't perform any contextual compression but simply perform some transformation on a set of documents. For example `TextSplitter`s can be used as document transformers to split documents into smaller pieces, and the `EmbeddingsRedundantFilter` can be used to filter out redundant documents based on embedding similarity between documents.
|
||||
|
||||
Below we create a compressor pipeline by first splitting our docs into smaller chunks, then removing redundant documents, and then filtering based on relevance to the query.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_transformers import EmbeddingsRedundantFilter
|
||||
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
|
||||
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
|
||||
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
|
||||
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
|
||||
pipeline_compressor = DocumentCompressorPipeline(
|
||||
transformers=[splitter, redundant_filter, relevant_filter]
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)
|
||||
|
||||
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
|
||||
pretty_print_docs(compressed_docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Document 1:
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 2:
|
||||
|
||||
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
|
||||
|
||||
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year
|
||||
----------------------------------------------------------------------------------------------------
|
||||
Document 3:
|
||||
|
||||
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,254 +0,0 @@
|
|||
The public API of the `BaseRetriever` class in LangChain is as follows:
|
||||
|
||||
```python
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, List
|
||||
from langchain.schema import Document
|
||||
from langchain.callbacks.manager import Callbacks
|
||||
|
||||
class BaseRetriever(ABC):
|
||||
...
|
||||
def get_relevant_documents(
|
||||
self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
|
||||
) -> List[Document]:
|
||||
"""Retrieve documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
callbacks: Callback manager or list of callbacks
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
...
|
||||
|
||||
async def aget_relevant_documents(
|
||||
self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
|
||||
) -> List[Document]:
|
||||
"""Asynchronously get documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
callbacks: Callback manager or list of callbacks
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
It's that simple! You can call `get_relevant_documents` or the async `get_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
|
||||
the specific retriever object you are calling.
|
||||
|
||||
Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
|
||||
|
||||
In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
|
||||
|
||||
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vectorstore to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
|
||||
|
||||
```
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
This example showcases question answering over documents.
|
||||
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vectorstores) and then also shows how to use them in a chain.
|
||||
|
||||
Question answering over documents consists of four steps:
|
||||
|
||||
1. Create an index
|
||||
2. Create a Retriever from that index
|
||||
3. Create a question answering chain
|
||||
4. Ask questions!
|
||||
|
||||
Each of the steps has multiple sub steps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
|
||||
|
||||
First, let's import some common classes we'll use no matter what.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.chains import RetrievalQA
|
||||
from langchain.llms import OpenAI
|
||||
```
|
||||
|
||||
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
loader = TextLoader('../state_of_the_union.txt', encoding='utf8')
|
||||
```
|
||||
|
||||
## One Line Index Creation
|
||||
|
||||
To get started as quickly as possible, we can use the `VectorstoreIndexCreator`.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.indexes import VectorstoreIndexCreator
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
index = VectorstoreIndexCreator().from_loaders([loader])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Running Chroma using direct local API.
|
||||
Using DuckDB in-memory for database. Data will be transient.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
Now that the index is created, we can use it to ask questions of the data! Note that under the hood this is actually doing a few steps as well, which we will cover later in this guide.
|
||||
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
index.query(query)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
index.query_with_sources(query)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
{'question': 'What did the president say about Ketanji Brown Jackson',
|
||||
'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
|
||||
'sources': '../state_of_the_union.txt'}
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
|
||||
|
||||
|
||||
```python
|
||||
index.vectorstore
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
<langchain.vectorstores.chroma.Chroma at 0x119aa5940>
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
If we then want to access the VectorstoreRetriever, we can do that with:
|
||||
|
||||
|
||||
```python
|
||||
index.vectorstore.as_retriever()
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x119aa5940>, search_kwargs={})
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Walkthrough
|
||||
|
||||
Okay, so what's actually going on? How is this index getting created?
|
||||
|
||||
A lot of the magic is being hid in this `VectorstoreIndexCreator`. What is this doing?
|
||||
|
||||
There are three main steps going on after the documents are loaded:
|
||||
|
||||
1. Splitting documents into chunks
|
||||
2. Creating embeddings for each document
|
||||
3. Storing documents and embeddings in a vectorstore
|
||||
|
||||
Let's walk through this in code
|
||||
|
||||
|
||||
```python
|
||||
documents = loader.load()
|
||||
```
|
||||
|
||||
Next, we will split the documents into chunks.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
We will then select which embeddings we want to use.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
embeddings = OpenAIEmbeddings()
|
||||
```
|
||||
|
||||
We now create the vectorstore to use as the index.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.vectorstores import Chroma
|
||||
db = Chroma.from_documents(texts, embeddings)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Running Chroma using direct local API.
|
||||
Using DuckDB in-memory for database. Data will be transient.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
So that's creating the index. Then, we expose this index in a retriever interface.
|
||||
|
||||
|
||||
```python
|
||||
retriever = db.as_retriever()
|
||||
```
|
||||
|
||||
Then, as before, we create a chain and use it to answer questions!
|
||||
|
||||
|
||||
```python
|
||||
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
qa.run(query)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
" The President said that Judge Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He said she is a consensus builder and has received a broad range of support from organizations such as the Fraternal Order of Police and former judges appointed by Democrats and Republicans."
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
`VectorstoreIndexCreator` is just a wrapper around all this logic. It is configurable in the text splitter it uses, the embeddings it uses, and the vectorstore it uses. For example, you can configure it as below:
|
||||
|
||||
|
||||
```python
|
||||
index_creator = VectorstoreIndexCreator(
|
||||
vectorstore_cls=Chroma,
|
||||
embedding=OpenAIEmbeddings(),
|
||||
text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
)
|
||||
```
|
||||
|
||||
Hopefully this highlights what is going on under the hood of `VectorstoreIndexCreator`. While we think it's important to have a simple way to create indexes, we also think it's important to understand what's going on under the hood.
|
|
@ -1,161 +0,0 @@
|
|||
# Implement a Custom Retriever
|
||||
|
||||
In this walkthrough, you will implement a simple custom retriever in LangChain using a simple dot product distance lookup.
|
||||
|
||||
All retrievers inherit from the `BaseRetriever` class and override the following abstract methods:
|
||||
|
||||
```python
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, List
|
||||
from langchain.schema import Document
|
||||
from langchain.callbacks.manager import (
|
||||
AsyncCallbackManagerForRetrieverRun,
|
||||
CallbackManagerForRetrieverRun,
|
||||
)
|
||||
|
||||
class BaseRetriever(ABC):
|
||||
|
||||
@abstractmethod
|
||||
def _get_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Get documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
run_manager: The callbacks handler to use
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
async def _aget_relevant_documents(
|
||||
self,
|
||||
query: str,
|
||||
*,
|
||||
run_manager: AsyncCallbackManagerForRetrieverRun,
|
||||
) -> List[Document]:
|
||||
"""Asynchronously get documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
run_manager: The callbacks handler to use
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
The `_get_relevant_documents` and async `_get_relevant_documents` methods can be implemented however you see fit. The `run_manager` is useful if your retriever calls other traceable LangChain primitives like LLMs, chains, or tools.
|
||||
|
||||
|
||||
Below, implement an example that fetches the most similar documents from a list of documents using a numpy array of embeddings.
|
||||
|
||||
|
||||
```python
|
||||
from typing import Any, List, Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
from langchain.callbacks.manager import (
|
||||
AsyncCallbackManagerForRetrieverRun,
|
||||
CallbackManagerForRetrieverRun,
|
||||
)
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.schema import BaseRetriever, Document
|
||||
|
||||
|
||||
class NumpyRetriever(BaseRetriever):
|
||||
"""Retrieves documents from a numpy array."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
texts: List[str],
|
||||
vectors: np.ndarray,
|
||||
embeddings: Optional[Embeddings] = None,
|
||||
num_to_return: int = 1,
|
||||
) -> None:
|
||||
super().__init__()
|
||||
self.embeddings = embeddings or OpenAIEmbeddings()
|
||||
self.texts = texts
|
||||
self.vectors = vectors
|
||||
self.num_to_return = num_to_return
|
||||
|
||||
@classmethod
|
||||
def from_texts(
|
||||
cls,
|
||||
texts: List[str],
|
||||
embeddings: Optional[Embeddings] = None,
|
||||
**kwargs: Any,
|
||||
) -> "NumpyRetriever":
|
||||
embeddings = embeddings or OpenAIEmbeddings()
|
||||
vectors = np.array(embeddings.embed_documents(texts))
|
||||
return cls(texts, vectors, embeddings)
|
||||
|
||||
def _get_relevant_documents_from_query_vector(
|
||||
self, vector_query: np.ndarray
|
||||
) -> List[Document]:
|
||||
dot_product = np.dot(self.vectors, vector_query)
|
||||
# Get the indices of the min 5 documents
|
||||
indices = np.argpartition(
|
||||
dot_product, -min(self.num_to_return, len(self.vectors))
|
||||
)[-self.num_to_return :]
|
||||
# Sort indices by distance
|
||||
indices = indices[np.argsort(dot_product[indices])]
|
||||
return [
|
||||
Document(
|
||||
page_content=self.texts[idx],
|
||||
metadata={"index": idx},
|
||||
)
|
||||
for idx in indices
|
||||
]
|
||||
|
||||
def _get_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Get documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
run_manager: The callbacks handler to use
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
vector_query = np.array(self.embeddings.embed_query(query))
|
||||
return self._get_relevant_documents_from_query_vector(vector_query)
|
||||
|
||||
async def _aget_relevant_documents(
|
||||
self,
|
||||
query: str,
|
||||
*,
|
||||
run_manager: AsyncCallbackManagerForRetrieverRun,
|
||||
) -> List[Document]:
|
||||
"""Asynchronously get documents relevant to a query.
|
||||
Args:
|
||||
query: string to find relevant documents for
|
||||
run_manager: The callbacks handler to use
|
||||
Returns:
|
||||
List of relevant documents
|
||||
"""
|
||||
query_emb = await self.embeddings.aembed_query(query)
|
||||
return self._get_relevant_documents_from_query_vector(np.array(query_emb))
|
||||
```
|
||||
|
||||
The retriever can be instantiated through the class method `from_texts`. It embeds the texts and stores them in a numpy array. To look up documents, it embeds the query and finds the most similar documents using a simple dot product distance.
|
||||
Once the retriever is implemented, you can use it like any other retriever in LangChain.
|
||||
|
||||
|
||||
```python
|
||||
retriever = NumpyRetriever.from_texts(texts= ["hello world", "goodbye world"])
|
||||
```
|
||||
|
||||
You can then use the retriever to get relevant documents.
|
||||
|
||||
```python
|
||||
retriever.get_relevant_documents("Hi there!")
|
||||
|
||||
# [Document(page_content='hello world', metadata={'index': 0})]
|
||||
```
|
||||
|
||||
```python
|
||||
retriever.get_relevant_documents("Bye!")
|
||||
# [Document(page_content='goodbye world', metadata={'index': 1})]
|
||||
```
|
|
@ -1,124 +0,0 @@
|
|||
```python
|
||||
import faiss
|
||||
|
||||
from datetime import datetime, timedelta
|
||||
from langchain.docstore import InMemoryDocstore
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.retrievers import TimeWeightedVectorStoreRetriever
|
||||
from langchain.schema import Document
|
||||
from langchain.vectorstores import FAISS
|
||||
```
|
||||
|
||||
## Low Decay Rate
|
||||
|
||||
A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
|
||||
|
||||
|
||||
```python
|
||||
# Define your embedding model
|
||||
embeddings_model = OpenAIEmbeddings()
|
||||
# Initialize the vectorstore as empty
|
||||
embedding_size = 1536
|
||||
index = faiss.IndexFlatL2(embedding_size)
|
||||
vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})
|
||||
retriever = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore, decay_rate=.0000000000000000000000001, k=1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
yesterday = datetime.now() - timedelta(days=1)
|
||||
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
|
||||
retriever.add_documents([Document(page_content="hello foo")])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['d7f85756-2371-4bdf-9140-052780a0f9b3']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# "Hello World" is returned first because it is most salient, and the decay rate is close to 0., meaning it's still recent enough
|
||||
retriever.get_relevant_documents("hello world")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 678341), 'created_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 279596), 'buffer_idx': 0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## High Decay Rate
|
||||
|
||||
With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.
|
||||
|
||||
|
||||
|
||||
```python
|
||||
# Define your embedding model
|
||||
embeddings_model = OpenAIEmbeddings()
|
||||
# Initialize the vectorstore as empty
|
||||
embedding_size = 1536
|
||||
index = faiss.IndexFlatL2(embedding_size)
|
||||
vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})
|
||||
retriever = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore, decay_rate=.999, k=1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
yesterday = datetime.now() - timedelta(days=1)
|
||||
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
|
||||
retriever.add_documents([Document(page_content="hello foo")])
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
['40011466-5bbe-4101-bfd1-e22e7f505de2']
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# "Hello Foo" is returned first because "hello world" is mostly forgotten
|
||||
retriever.get_relevant_documents("hello world")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 4, 16, 22, 9, 2, 494798), 'created_at': datetime.datetime(2023, 4, 16, 22, 9, 2, 178722), 'buffer_idx': 1})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Virtual Time
|
||||
|
||||
Using some utils in LangChain, you can mock out the time component
|
||||
|
||||
|
||||
```python
|
||||
from langchain.utils import mock_now
|
||||
import datetime
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Notice the last access time is that date time
|
||||
with mock_now(datetime.datetime(2011, 2, 3, 10, 11)):
|
||||
print(retriever.get_relevant_documents("hello world"))
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='hello world', metadata={'last_accessed_at': MockDateTime(2011, 2, 3, 10, 11), 'created_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 279596), 'buffer_idx': 0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,88 +0,0 @@
|
|||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
loader = TextLoader('../../../state_of_the_union.txt')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import FAISS
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
documents = loader.load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
embeddings = OpenAIEmbeddings()
|
||||
db = FAISS.from_documents(texts, embeddings)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Exiting: Cleaning up .chroma directory
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
retriever = db.as_retriever()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
## Maximum Marginal Relevance Retrieval
|
||||
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
|
||||
|
||||
|
||||
```python
|
||||
retriever = db.as_retriever(search_type="mmr")
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
## Similarity Score Threshold Retrieval
|
||||
|
||||
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold
|
||||
|
||||
|
||||
```python
|
||||
retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5})
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
## Specifying top k
|
||||
You can also specify search kwargs like `k` to use when doing retrieval.
|
||||
|
||||
|
||||
```python
|
||||
retriever = db.as_retriever(search_kwargs={"k": 1})
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
len(docs)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
1
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,201 +0,0 @@
|
|||
## Get started
|
||||
We'll use a Pinecone vector store in this example.
|
||||
|
||||
First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
|
||||
|
||||
To use Pinecone, you to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
|
||||
|
||||
NOTE: The self-query retriever requires you to have `lark` package installed.
|
||||
|
||||
|
||||
```python
|
||||
# !pip install lark pinecone-client
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
import pinecone
|
||||
|
||||
|
||||
pinecone.init(api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.schema import Document
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.vectorstores import Pinecone
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
# create new index
|
||||
pinecone.create_index("langchain-self-retriever-demo", dimension=1536)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
docs = [
|
||||
Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata={"year": 1993, "rating": 7.7, "genre": ["action", "science fiction"]}),
|
||||
Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}),
|
||||
Document(page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6}),
|
||||
Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3}),
|
||||
Document(page_content="Toys come alive and have a blast doing so", metadata={"year": 1995, "genre": "animated"}),
|
||||
Document(page_content="Three men walk into the Zone, three men walk out of the Zone", metadata={"year": 1979, "rating": 9.9, "director": "Andrei Tarkovsky", "genre": ["science fiction", "thriller"], "rating": 9.9})
|
||||
]
|
||||
vectorstore = Pinecone.from_documents(
|
||||
docs, embeddings, index_name="langchain-self-retriever-demo"
|
||||
)
|
||||
```
|
||||
|
||||
## Creating our self-querying retriever
|
||||
Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.
|
||||
|
||||
|
||||
```python
|
||||
from langchain.llms import OpenAI
|
||||
from langchain.retrievers.self_query.base import SelfQueryRetriever
|
||||
from langchain.chains.query_constructor.base import AttributeInfo
|
||||
|
||||
metadata_field_info=[
|
||||
AttributeInfo(
|
||||
name="genre",
|
||||
description="The genre of the movie",
|
||||
type="string or list[string]",
|
||||
),
|
||||
AttributeInfo(
|
||||
name="year",
|
||||
description="The year the movie was released",
|
||||
type="integer",
|
||||
),
|
||||
AttributeInfo(
|
||||
name="director",
|
||||
description="The name of the movie director",
|
||||
type="string",
|
||||
),
|
||||
AttributeInfo(
|
||||
name="rating",
|
||||
description="A 1-10 rating for the movie",
|
||||
type="float"
|
||||
),
|
||||
]
|
||||
document_content_description = "Brief summary of a movie"
|
||||
llm = OpenAI(temperature=0)
|
||||
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)
|
||||
```
|
||||
|
||||
## Testing it out
|
||||
And now we can try actually using our retriever!
|
||||
|
||||
|
||||
```python
|
||||
# This example only specifies a relevant query
|
||||
retriever.get_relevant_documents("What are some movies about dinosaurs")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
query='dinosaur' filter=None
|
||||
|
||||
|
||||
[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': ['action', 'science fiction'], 'rating': 7.7, 'year': 1993.0}),
|
||||
Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0}),
|
||||
Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0}),
|
||||
Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'director': 'Christopher Nolan', 'rating': 8.2, 'year': 2010.0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# This example only specifies a filter
|
||||
retriever.get_relevant_documents("I want to watch a movie rated higher than 8.5")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5)
|
||||
|
||||
|
||||
[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0}),
|
||||
Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'thriller'], 'rating': 9.9, 'year': 1979.0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# This example specifies a query and a filter
|
||||
retriever.get_relevant_documents("Has Greta Gerwig directed any movies about women")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig')
|
||||
|
||||
|
||||
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019.0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# This example specifies a composite filter
|
||||
retriever.get_relevant_documents("What's a highly rated (above 8.5) science fiction film?")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5)])
|
||||
|
||||
|
||||
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'thriller'], 'rating': 9.9, 'year': 1979.0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
# This example specifies a query and composite filter
|
||||
retriever.get_relevant_documents("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990.0), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005.0), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')])
|
||||
|
||||
|
||||
[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Filter k
|
||||
|
||||
We can also use the self query retriever to specify `k`: the number of documents to fetch.
|
||||
|
||||
We can do this by passing `enable_limit=True` to the constructor.
|
||||
|
||||
|
||||
```python
|
||||
retriever = SelfQueryRetriever.from_llm(
|
||||
llm,
|
||||
vectorstore,
|
||||
document_content_description,
|
||||
metadata_field_info,
|
||||
enable_limit=True,
|
||||
verbose=True
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# This example only specifies a relevant query
|
||||
retriever.get_relevant_documents("What are two movies about dinosaurs")
|
||||
```
|
|
@ -1,73 +0,0 @@
|
|||
### Setup
|
||||
|
||||
To start we'll need to install the OpenAI Python package:
|
||||
|
||||
```bash
|
||||
pip install openai
|
||||
```
|
||||
|
||||
Accessing the API requires an API key, which you can get by creating an account and heading [here](https://platform.openai.com/account/api-keys). Once we have a key we'll want to set it as an environment variable by running:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY="..."
|
||||
```
|
||||
|
||||
If you'd prefer not to set an environment variable you can pass the key in directly via the `openai_api_key` named parameter when initiating the OpenAI LLM class:
|
||||
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
embeddings_model = OpenAIEmbeddings(openai_api_key="...")
|
||||
```
|
||||
|
||||
otherwise you can initialize without any params:
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
embeddings_model = OpenAIEmbeddings()
|
||||
```
|
||||
|
||||
### `embed_documents`
|
||||
#### Embed list of texts
|
||||
|
||||
```python
|
||||
embeddings = embeddings_model.embed_documents(
|
||||
[
|
||||
"Hi there!",
|
||||
"Oh, hello!",
|
||||
"What's your name?",
|
||||
"My friends call me World",
|
||||
"Hello World!"
|
||||
]
|
||||
)
|
||||
len(embeddings), len(embeddings[0])
|
||||
```
|
||||
|
||||
<CodeOutputBlock language="python">
|
||||
|
||||
```
|
||||
(5, 1536)
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### `embed_query`
|
||||
#### Embed single query
|
||||
Embed a single piece of text for the purpose of comparing to other embedded pieces of texts.
|
||||
|
||||
```python
|
||||
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
|
||||
embedded_query[:5]
|
||||
```
|
||||
|
||||
<CodeOutputBlock language="python">
|
||||
|
||||
```
|
||||
[0.0053587136790156364,
|
||||
-0.0004999046213924885,
|
||||
0.038883671164512634,
|
||||
-0.003001077566295862,
|
||||
-0.00900818221271038]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,89 +0,0 @@
|
|||
Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
|
||||
|
||||
`Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.
|
||||
|
||||
```bash
|
||||
pip install qdrant-client
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.vectorstores import Qdrant
|
||||
```
|
||||
|
||||
### Create a vector store asynchronously
|
||||
|
||||
```python
|
||||
db = await Qdrant.afrom_documents(documents, embeddings, "http://localhost:6333")
|
||||
```
|
||||
|
||||
### Similarity search
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
docs = await db.asimilarity_search(query)
|
||||
print(docs[0].page_content)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### Similarity search by vector
|
||||
|
||||
```python
|
||||
embedding_vector = embeddings.embed_query(query)
|
||||
docs = await db.asimilarity_search_by_vector(embedding_vector)
|
||||
```
|
||||
|
||||
## Maximum marginal relevance search (MMR)
|
||||
|
||||
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It is also supported in async API.
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
found_docs = await qdrant.amax_marginal_relevance_search(query, k=2, fetch_k=10)
|
||||
for i, doc in enumerate(found_docs):
|
||||
print(f"{i + 1}.", doc.page_content, "\n")
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
1. Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
|
||||
2. We can’t change how divided we’ve been. But we can change how we move forward—on COVID-19 and other issues we must face together.
|
||||
|
||||
I recently visited the New York City Police Department days after the funerals of Officer Wilbert Mora and his partner, Officer Jason Rivera.
|
||||
|
||||
They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun.
|
||||
|
||||
Officer Mora was 27 years old.
|
||||
|
||||
Officer Rivera was 22.
|
||||
|
||||
Both Dominican Americans who’d grown up on the same streets they later chose to patrol as police officers.
|
||||
|
||||
I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves.
|
||||
|
||||
I’ve worked on these issues a long time.
|
||||
|
||||
I know what works: Investing in crime preventionand community police officers who’ll walk the beat, who’ll know the neighborhood, and who can restore trust and safety.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
|
@ -1,168 +0,0 @@
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="chroma" label="Chroma" default>
|
||||
|
||||
This walkthrough uses the `chroma` vector database, which runs on your local machine as a library.
|
||||
|
||||
```bash
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
import getpass
|
||||
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
db = Chroma.from_documents(documents, OpenAIEmbeddings())
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="faiss" label="FAISS">
|
||||
|
||||
This walkthrough uses the `FAISS` vector database, which makes use of the Facebook AI Similarity Search (FAISS) library.
|
||||
|
||||
```bash
|
||||
pip install faiss-cpu
|
||||
```
|
||||
|
||||
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
import getpass
|
||||
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import FAISS
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
db = FAISS.from_documents(documents, OpenAIEmbeddings())
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="lance" label="Lance">
|
||||
|
||||
This notebook shows how to use functionality related to the LanceDB vector database based on the Lance data format.
|
||||
|
||||
```bash
|
||||
pip install lancedb
|
||||
```
|
||||
|
||||
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
import getpass
|
||||
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import LanceDB
|
||||
|
||||
import lancedb
|
||||
|
||||
db = lancedb.connect("/tmp/lancedb")
|
||||
table = db.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{
|
||||
"vector": embeddings.embed_query("Hello World"),
|
||||
"text": "Hello World",
|
||||
"id": "1",
|
||||
}
|
||||
],
|
||||
mode="overwrite",
|
||||
)
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
db = LanceDB.from_documents(documents, OpenAIEmbeddings(), connection=table)
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
|
||||
|
||||
### Similarity search
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
docs = db.similarity_search(query)
|
||||
print(docs[0].page_content)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### Similarity search by vector
|
||||
|
||||
It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string.
|
||||
|
||||
```python
|
||||
embedding_vector = OpenAIEmbeddings().embed_query(query)
|
||||
docs = db.similarity_search_by_vector(embedding_vector)
|
||||
print(docs[0].page_content)
|
||||
```
|
||||
|
||||
The query is the same, and so the result is also the same.
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
Loading…
Add table
Add a link
Reference in a new issue