Apify Actor 集成 - LangChain中文版文档

Apify Actor 是专为各种网页抓取、爬取和数据提取任务设计的云程序。这些 Actor 促进了从网络自动收集数据，使用户能够高效地提取、处理和存储信息。Actor 可用于执行诸如抓取电子商务网站的产品详情、监控价格变化或收集搜索引擎结果等任务。它们与 Apify 数据集无缝集成，允许将 Actor 收集的 structured 数据存储、管理和导出为 JSON、CSV 或 Excel 等格式，以便进一步分析或使用。

概述

本笔记本将引导您使用 Apify Actor 与 LangChain 配合进行网页抓取和数据提取自动化。langchain-apify 包将 Apify 的基于云的工具与 LangChain 代理集成，实现 AI 应用的高效数据收集和数据处理。

集成详情

类	包	可序列化	JS 支持	版本
`ApifyActorsTool`	`langchain-apify`	✅	✅

工具特性

返回工件	原生异步	返回数据	定价
❌	✅	Actor 输出（因 Actor 而异）	按使用量付费，提供免费层级

设置

此集成位于 langchain-apify 包中。该包可以使用 pip 安装。

pip install langchain-apify

前置条件

Apify 账户：注册您的免费 Apify 账户。
Apify API 令牌：在 Apify 文档中了解如何获取您的 API 令牌。

import os

os.environ["APIFY_TOKEN"] = "your-apify-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

定价

Apify 采用按使用量付费的定价模式，并提供免费层级。定价因 Actor 而异——有些 Actor 是免费的（您只需支付平台使用费），而另一些则按结果或事件收费。

实例化

此处我们实例化 ApifyActorsTool 以调用 RAG Web Browser Apify Actor。该 Actor 为 AI 和 LLM 应用提供网页浏览功能，类似于 ChatGPT 中的网页浏览功能。来自 Apify Store 的任何 Actor 都可以这样使用。

from langchain_apify import ApifyActorsTool

tool = ApifyActorsTool("apify/rag-web-browser")

调用

ApifyActorsTool 接受单个参数，即 run_input —— 一个作为运行输入传递给 Actor 的字典。运行输入架构文档可在 Actor 详细信息页面的输入部分找到。参见 RAG Web Browser 输入架构。

tool.invoke({"run_input": {"query": "what is apify?", "maxResults": 2}})

链式调用

我们可以将创建的代理提供给 agent。当被要求搜索信息时，代理将调用 Apify Actor，后者将搜索网络，然后检索搜索结果。

pip install langgraph langchain-openai

from langchain.messages import ToolMessage
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent


model = ChatOpenAI(model="gpt-5-mini")
tools = [tool]
graph = create_agent(model, tools=tools)

inputs = {"messages": [("user", "search for what is Apify")]}
for s in graph.stream(inputs, stream_mode="values"):
    message = s["messages"][-1]
    # skip tool messages
    if isinstance(message, ToolMessage):
        continue
    message.pretty_print()

================================ Human Message =================================

search for what is Apify
================================== Ai Message ==================================
Tool Calls:
  apify_actor_apify_rag-web-browser (call_27mjHLzDzwa5ZaHWCMH510lm)
 Call ID: call_27mjHLzDzwa5ZaHWCMH510lm
  Args:
    run_input: {"run_input":{"query":"Apify","maxResults":3,"outputFormats":["markdown"]}}
================================== Ai Message ==================================

Apify is a comprehensive platform for web scraping, browser automation, and data extraction. It offers a wide array of tools and services that cater to developers and businesses looking to extract data from websites efficiently and effectively. Here's an overview of Apify:

1. **Ecosystem and Tools**:
   - Apify provides an ecosystem where developers can build, deploy, and publish data extraction and web automation tools called Actors.
   - The platform supports various use cases such as extracting data from social media platforms, conducting automated browser-based tasks, and more.

2. **Offerings**:
   - Apify offers over 10,000 ready-made scraping tools and code templates.
   - Users can also build custom solutions or hire Apify's professional services for more tailored data extraction needs.

3. **Technology and Integration**:
   - The platform supports integration with popular tools and services like Zapier, GitHub, Google Sheets, Pinecone, and more.
   - Apify supports open-source tools and technologies such as JavaScript, Python, Puppeteer, Playwright, Selenium, and its own Crawlee library for web crawling and browser automation.

4. **Community and Learning**:
   - Apify hosts a community on Discord where developers can get help and share expertise.
   - It offers educational resources through the Web Scraping Academy to help users become proficient in data scraping and automation.

5. **Enterprise Solutions**:
   - Apify provides enterprise-grade web data extraction solutions with high reliability, 99.95% uptime, and compliance with SOC2, GDPR, and CCPA standards.

For more information, you can visit [Apify's official website](https://apify.com/) or their [GitHub page](https://github.com/apify) which contains their code repositories and further details about their projects.

其他 Actor 示例

Apify Store 包含数千个预构建的 Actor。以下是其他流行 Actor 的示例：

Instagram 抓取器

from langchain_apify import ApifyActorsTool

instagram_tool = ApifyActorsTool("apify/instagram-scraper")

# Scrape Instagram posts
result = instagram_tool.invoke({
    "run_input": {
        "directUrls": ["https://www.instagram.com/humansofny/"],
        "resultsLimit": 10
    }
})

Google 搜索结果抓取器

google_search_tool = ApifyActorsTool("apify/google-search-scraper")

# Scrape Google Search results
result = google_search_tool.invoke({
    "run_input": {
        "queries": "langchain python tutorial",
        "maxPagesPerQuery": 1
    }
})

浏览 Apify Store 以发现更多适用于您用例的 Actor。

何时使用 Apify

当您有以下需求时，Apify 是理想选择：

访问数千个预构建的 Actor 用于各种平台（社交媒体、电子商务、搜索引擎等）
自定义网页抓取和自动化工作流 超出简单搜索范围
无需基础设施的抓取 （无服务器平台处理扩展和维护）
灵活的 Actor 生态系统 – 运行 Apify Store 中的任何 Actor

API 参考

有关如何使用此集成的更多信息，请查看 git 仓库或 Apify 集成文档。

使用 Apify MCP 服务器

不确定使用哪个 Actor 或其需要哪些参数？ Apify MCP (模型上下文协议) 服务器可以帮助您通过模型上下文协议发现可用 Actor、探索其输入架构并理解参数要求。要在 LangChain 中使用 Apify MCP 服务器：

import os
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent

client = MultiServerMCPClient({
    "apify": {
        "transport": "http",
        "url": "https://mcp.apify.com",
        "headers": {
            "Authorization": f"Bearer {os.environ['APIFY_TOKEN']}",
        },
    }
})

tools = await client.get_tools()
agent = create_agent("gpt-5-mini", tools)

有关更多信息，请参阅 LangChain MCP 文档和 Apify MCP 服务器。

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

​概述

​集成详情

​工具特性

​设置

​前置条件

​定价

​实例化

​调用

​链式调用

​其他 Actor 示例

​Instagram 抓取器

​Google 搜索结果抓取器

​何时使用 Apify

​API 参考

​使用 Apify MCP 服务器

概述

集成详情

工具特性

设置

前置条件

定价

实例化

调用

链式调用

其他 Actor 示例

Instagram 抓取器

Google 搜索结果抓取器

何时使用 Apify

API 参考

使用 Apify MCP 服务器