BSHTMLLoader 集成 - Docs by LangChain

This guide provides a quick overview for getting started with BeautifulSoup4 document loader. For detailed documentation of all BeautifulSoup4 features and configurations head to the API reference.

概述

集成详情

Class	Package	Local	Serializable	JS support
`BSHTMLLoader`	`langchain-community`	✅	❌	❌

加载器特性

Source	Document Lazy Loading	Native Async Support
`BSHTMLLoader`	✅	❌

设置

To access BSHTMLLoader document loader you’ll need to install the langchain-community integration package and the bs4 Python 包。

凭证

No credentials are needed to use the BSHTMLLoader class. 要启用模型调用的自动追踪，请设置你的 LangSmith API 密钥：

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain-community and bs4。

pip install -qU langchain-community bs4

初始化

现在我们可以实例化模型对象并加载文档：

TODO: Update model instantiation with relevant params.

from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(
    file_path="./example_data/fake-content.html",
)

加载

docs = loader.load()
docs[0]

Document(metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}, page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n')

print(docs[0].metadata)

{'source': './example_data/fake-content.html', 'title': 'Test Title'}

惰性加载

page = []
for doc in loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []
page[0]

Document(metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}, page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n')

Adding separator to BS4

我们可以also pass a separator to use when calling get_text on the soup

loader = BSHTMLLoader(
    file_path="./example_data/fake-content.html", get_text_separator=", "
)

docs = loader.load()
print(docs[0])

page_content='
, Test Title,
,
,
, My First Heading,
, My first paragraph.,
,
,
' metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}

API 参考

For detailed documentation of all BSHTMLLoader features and configurations head to the API reference

通过 MCP 将这些文档连接到 Claude、VSCode 等工具以获取实时答案。

在 GitHub 上编辑此页面或提交 issue。

Documentation Index

​概述

​集成详情

​加载器特性

​设置

​凭证

​安装

​初始化

​加载

​惰性加载

​Adding separator to BS4

​API 参考

概述

集成详情

加载器特性

设置

凭证

安装

初始化

加载

惰性加载

Adding separator to BS4

API 参考