Pinecone (稀疏) 集成 - LangChain中文版文档

Pinecone 是一个功能广泛的向量数据库。

本笔记本展示了如何使用与 Pinecone 向量数据库相关的功能。

设置

要使用 PineconeSparseVectorStore，您首先需要安装合作伙伴包，以及本笔记本中使用的其他包。

pip install -qU "langchain-pinecone==0.2.5"

WARNING: pinecone 6.0.2 does not provide the extra 'async'

凭据

创建新的 Pinecone 账户，或登录现有账户，并创建一个 API 密钥以在本笔记本中使用。

import os
from getpass import getpass

from pinecone import Pinecone

# get API key at app.pinecone.io
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") or getpass(
    "Enter your Pinecone API key: "
)

# initialize client
pc = Pinecone()

Enter your Pinecone API key: ··········

初始化

在初始化我们的向量存储之前，让我们连接到一个 Pinecone 索引。如果名为 index_name 的索引不存在，它将被创建。

from pinecone import AwsRegion, CloudProvider, Metric, ServerlessSpec

index_name = "langchain-sparse-vector-search"  # change if desired
model_name = "pinecone-sparse-english-v0"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud=CloudProvider.AWS,
        region=AwsRegion.US_EAST_1,
        embed={
            "model": model_name,
            "field_map": {"text": "chunk_text"},
            "metric": Metric.DOTPRODUCT,
        },
    )

index = pc.Index(index_name)
print(f"Index `{index_name}` host: {index.config.host}")

Index `langchain-sparse-vector-search` host: https://langchain-sparse-vector-search-yrrgefy.svc.aped-4627-b74a.pinecone.io

对于我们的稀疏嵌入模型，我们使用 pinecone-sparse-english-v0，我们像这样初始化它：

from langchain_pinecone.embeddings import PineconeSparseEmbeddings

sparse_embeddings = PineconeSparseEmbeddings(model=model_name)

现在我们的 Pinecone 索引和嵌入模型都已准备好，我们可以在 LangChain 中初始化我们的稀疏向量存储：

from langchain_pinecone import PineconeSparseVectorStore

vector_store = PineconeSparseVectorStore(index=index, embedding=sparse_embeddings)

管理向量存储

一旦创建了您的向量存储，我们可以通过添加和删除不同的项目来与其交互。

向向量存储添加项目

我们可以使用 add_documents 函数将项目添加到我们的向量存储中。

from uuid import uuid4

from langchain_core.documents import Document

documents = [
    Document(
        page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
        metadata={"source": "social"},
    ),
    Document(
        page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="Building an exciting new project with LangChain - come check it out!",
        metadata={"source": "social"},
    ),
    Document(
        page_content="Robbers broke into the city bank and stole $1 million in cash.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="Wow! That was an amazing movie. I can't wait to see it again.",
        metadata={"source": "social"},
    ),
    Document(
        page_content="Is the new iPhone worth the price? Read this review to find out.",
        metadata={"source": "website"},
    ),
    Document(
        page_content="The top 10 soccer players in the world right now.",
        metadata={"source": "website"},
    ),
    Document(
        page_content="LangGraph is the best framework for building stateful, agentic applications!",
        metadata={"source": "social"},
    ),
    Document(
        page_content="The stock market is down 500 points today due to fears of a recession.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="I have a bad feeling I am going to get deleted :(",
        metadata={"source": "social"},
    ),
]

uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['95b598af-c3dc-4a8a-bdb7-5d21283e5a86',
 '838614a5-5635-4efd-9ac3-5237a37a542b',
 '093fd11f-c85b-4c83-83f0-117df64ff442',
 'fb3ba32f-f802-410a-ad79-56f7bce938fe',
 '75cde9bf-7e91-4f06-8bae-c824dab16a08',
 '9de8f769-d604-4e56-b677-ee333cbc8e34',
 'f5f4ae97-88e6-4669-bcf7-87072bb08550',
 'f9f82811-187c-4b25-85b5-7a42b4da3bff',
 'ce45957c-e8fc-41ef-819b-1bd52b6fc815',
 '66cacc6f-b8e2-441b-9f7f-468788aad88f']

从向量存储删除项目

我们可以使用 delete 方法从我们的向量存储中删除记录，并提供要删除的文档 ID 列表。

vector_store.delete(ids=[uuids[-1]])

查询向量存储

一旦我们将文档加载到向量存储中，我们就很可能准备好开始查询了。LangChain 中有多种方法可以实现这一点。首先，我们将看看如何通过 similarity_search 方法直接查询我们的 vector_store 来执行简单的向量搜索：

results = vector_store.similarity_search("I'm building a new LangChain project!", k=3)

for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

我们还可以为查询添加元数据过滤，以便根据各种标准限制我们的搜索。让我们尝试一个简单的过滤器，将我们的搜索限制为仅包含 source=="social" 的记录：

results = vector_store.similarity_search(
    "I'm building a new LangChain project!",
    k=3,
    filter={"source": "social"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

比较这些结果时，我们可以看到我们的第一个查询返回了来自 "website" 源的不同记录。在我们后来的、经过过滤的查询中——情况不再如此。

相似度搜索和分数

我们还可以在返回列表形式的 (document, score) 元组的同时进行搜索。其中 document 是包含我们文本内容和元数据的 LangChain Document 对象。

results = vector_store.similarity_search_with_score(
    "I'm building a new LangChain project!", k=3, filter={"source": "social"}
)
for doc, score in results:
    print(f"[SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

[SIM=12.959961] Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
[SIM=12.959961] Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
[SIM=1.942383] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

作为检索器

在我们的链和代理中，我们经常将向量存储用作 VectorStoreRetriever。要创建它，我们使用 as_retriever 方法：

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.5},
)
retriever

VectorStoreRetriever(tags=['PineconeSparseVectorStore', 'PineconeSparseEmbeddings'], vectorstore=<langchain_pinecone.vectorstores_sparse.PineconeSparseVectorStore object at 0x7c8087b24290>, search_type='similarity_score_threshold', search_kwargs={'k': 3, 'score_threshold': 0.5})

我们现在可以使用 invoke 方法查询我们的检索器：

retriever.invoke(
    input="I'm building a new LangChain project!", filter={"source": "social"}
)

/usr/local/lib/python3.11/dist-packages/langchain_core/vectorstores/base.py:1082: UserWarning: Relevance scores must be between 0 and 1, got [(Document(id='093fd11f-c85b-4c83-83f0-117df64ff442', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'), 6.97998045), (Document(id='54f8f645-9f77-4aab-b9fa-709fd91ae3b3', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'), 6.97998045), (Document(id='f9f82811-187c-4b25-85b5-7a42b4da3bff', metadata={'source': 'social'}, page_content='LangGraph is the best framework for building stateful, agentic applications!'), 1.471191405)]
  self.vectorstore.similarity_search_with_relevance_scores(

[Document(id='093fd11f-c85b-4c83-83f0-117df64ff442', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='54f8f645-9f77-4aab-b9fa-709fd91ae3b3', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='f9f82811-187c-4b25-85b5-7a42b4da3bff', metadata={'source': 'social'}, page_content='LangGraph is the best framework for building stateful, agentic applications!')]

用于检索增强生成的用法

有关如何使用此向量存储进行检索增强生成 (RAG) 的指南，请参阅以下部分：

API 参考

有关所有功能和配置的详细文档，请前往 API 参考： API 参考稀疏嵌入： API 参考

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

​设置

​凭据

​初始化

​管理向量存储

​向向量存储添加项目

​从向量存储删除项目

​查询向量存储

​相似度搜索和分数

​作为检索器

​用于检索增强生成的用法

​API 参考

设置

凭据

初始化

管理向量存储

向向量存储添加项目

从向量存储删除项目

查询向量存储

相似度搜索和分数

作为检索器

用于检索增强生成的用法

API 参考