Documentation Index
Fetch the complete documentation index at: https://nvd-54.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Theunstructuredpackage from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructuredecosystem within LangChain.
安装和设置
If you are using a loader that runs locally, use the following steps to getunstructured and its
dependencies running.
-
For the smallest installation footprint and to take advantage of features not available in the
open-source
unstructuredpackage, install the Python SDK withpip install unstructured-clientalong withpip install langchain-unstructuredto use theUnstructuredLoaderand partition remotely against the Unstructured API. This loader lives in a LangChain partner repo instead of thelangchain-communityrepo and you will need anapi_key. You can generate a free key on the Unstructured API key page.- Unstructured’s documentation for the sdk can be found here: https://docs.unstructured.io/api-reference/api-services/sdk
-
To run everything locally, install the open-source Python 包 with
pip install unstructuredalong withpip install langchain-communityand use the sameUnstructuredLoaderas mentioned above.- You can install document specific dependencies with extras, e.g.
pip install "unstructured[docx]". Learn more about extras in the full installation documentation. - To install the dependencies for all document types, use
pip install "unstructured[all-docs]".
- You can install document specific dependencies with extras, e.g.
-
Install the following system dependencies if they are not already available on your system with e.g.
brew installfor Mac. Depending on what document types you’re parsing, you may not need all of these.libmagic-dev(filetype detection)poppler-utils(images and PDFs)tesseract-ocr(images and PDFs)qpdf(PDFs)libreoffice(MS Office docs)pandoc(EPUBs)
- When running locally, Unstructured also recommends using Docker by following this guide to ensure all system dependencies are installed correctly.
Data loaders
The primary usage ofUnstructured is in data loaders.
UnstructuredLoader
查看使用示例 to see how you can use this loader for both partitioning locally and remotely with the serverless Unstructured API.UnstructuredCHMLoader
CHM means Microsoft Compiled HTML Help.
UnstructuredCSVLoader
Acomma-separated values (CSV) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
查看使用示例.
UnstructuredEmailLoader
查看使用示例.UnstructuredEPubLoader
EPUB is ane-book file format that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled ePub. EPUB is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
查看使用示例.
UnstructuredExcelLoader
查看使用示例.UnstructuredFileIOLoader
查看使用示例.UnstructuredHTMLLoader
UnstructuredImageLoader
查看使用示例.UnstructuredMarkdownLoader
查看使用示例.UnstructuredODTLoader
TheOpen Document Format for Office Applications (ODF), also known as OpenDocument,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
查看使用示例.
UnstructuredOrgModeLoader
An Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs. 查看使用示例.UnstructuredPDFLoader
UnstructuredPowerPointLoader
查看使用示例.UnstructuredRSTLoader
AreStructured Text (RST) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
查看使用示例.
UnstructuredRTFLoader
See a usage example in the API documentation.UnstructuredTSVLoader
Atab-separated values (TSV) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
查看使用示例.
UnstructuredURLLoader
查看使用示例.UnstructuredWordDocumentLoader
查看使用示例.UnstructuredXMLLoader
查看使用示例.将这些文档连接 到 Claude、VSCode 等工具,通过 MCP 获取实时答案。

