SitemapLoader 集成 - Docs by LangChain

Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a given URL, and then scrapes and loads all pages in the sitemap, returning each page as a Document. The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren’t concerned about being a good citizen, or you control the scrapped server, or don’t care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!

概述

集成详情

Class	Package	Local	Serializable	JS support
`SiteMapLoader`	`langchain-community`	✅	❌	✅

加载器特性

Source	Document Lazy Loading	Native Async Support
`SiteMapLoader`	✅	❌

设置

To access SiteMap document loader you’ll need to install the langchain-community integration package.

凭证

No credentials are needed to run this. 要启用模型调用的自动追踪，请设置你的 LangSmith API 密钥：

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain-community。

pip install -qU langchain-community

Fix notebook asyncio bug

import nest_asyncio

nest_asyncio.apply()

初始化

现在我们可以实例化模型对象并加载文档：

from langchain_community.document_loaders.sitemap import SitemapLoader

sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

加载

docs = sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]

Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

You can change the requests_per_second parameter to increase the max concurrent requests. and use requests_kwargs to pass kwargs when send requests.

sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

惰性加载

You can also load the pages lazily in order to minimize the memory load.

page = []
for doc in sitemap_loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

Filtering sitemap URLs

Sitemaps can be massive files, with thousands of URLs. Often you don’t need every single one of them. You can filter the URLs by passing a list of strings or regex patterns to the filter_urls parameter. Only URLs that match one of the patterns will be loaded.

loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()

documents[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

Add custom scraping rules

The SitemapLoader uses beautifulsoup4 for the scraping process, and it scrapes every element on the page by default. The SitemapLoader constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements. 以下example shows how to develop and use a custom function to avoid navigation and header elements. Import the beautifulsoup4 library and define the custom function.

pip install beautifulsoup4

from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

Add your custom function to the SitemapLoader object.

loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)

Local sitemap

The sitemap loader can also be used to load local files.

sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

API 参考

For detailed documentation of all SiteMapLoader features and configurations head to the API reference

通过 MCP 将这些文档连接到 Claude、VSCode 等工具以获取实时答案。

在 GitHub 上编辑此页面或提交 issue。

Documentation Index

​概述

​集成详情

​加载器特性

​设置

​凭证

​安装

​Fix notebook asyncio bug

​初始化

​加载

​惰性加载

​Filtering sitemap URLs

​Add custom scraping rules

​Local sitemap

​API 参考

概述

集成详情

加载器特性

设置

凭证

安装

Fix notebook asyncio bug

初始化

加载

惰性加载

Filtering sitemap URLs

Add custom scraping rules

Local sitemap

API 参考