`langchain_community.document_loaders.dedoc`.DedocAPIFileLoader¶

class langchain_community.document_loaders.dedoc.DedocAPIFileLoader(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None)[source]¶

使用 dedoc API 加载文件。文件加载器自动检测文件类型（即使扩展名错误）。默认情况下，加载器会调用本地托管的 dedoc API。关于 dedoc API 的更多信息可以在 dedoc 文档中找到

https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html

请参阅 DedocBaseLoader 的文档以获取更多详细信息。

设置

使用此加载器不需要安装 dedoc 库。相反，需要运行 dedoc API。您可以使用 Docker 容器来完成此任务。有关详细信息，请参阅 dedoc 文档

https://dedoc.readthedocs.io/zh/latest/getting_started/installation.html#install-and-run-dedoc-using-docker

docker pull dedocproject/dedoc
docker run -p 1231:1231

实例化

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    file_path="example.pdf",
    # url=...,
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

加载

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

延迟加载

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

使用文件路径、API URL和解析参数初始化。

参数

file_path (str) – 处理文件路径
url (str) – 调用 dedoc API 的 URL
split (str) –
将文档分割成部分（每部分单独返回）的类型，默认值 “document” “document”: 返回单个 langchain 文档对象

(不分割)

”page”: 将文档分割成页面（适用于 PDF、DJVU、PPTX、PPT、ODP） “node”: 将文档分割成树节点（标题节点、列表项节点、

纯文本节点)

”line”: 将文档分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为单个 langchain 文档对象返回
dedoc (通过文档解析使用的参数） –
(https://dedoc.readthedocs.io/zh/latest/parameters/parameters.html)

with_attachments: 启用附件文件提取 recursion_deep_attachments: 附件文件的递归深度

提取，只有当 with_attachments==True 时才起作用

pdf_with_text_layer: 解析 PDF 文档的处理程序类型，
可用选项 [“true”, “false”, “tabby”, “auto”, “auto_tabby” (默认)]

language: 对于没有文本层的 PDF 和图像的文档语言，
可用选项 [“eng”, “rus”, “rus+eng” (默认)], 语言列表可以扩展，请参阅 https://dedoc.readthedocs.io/zh/latest/tutorials/add_new_language.html

pages: 定义解析 PDF 文档的阅读范围页面切片

is_one_column_document: 检测没有文本层和图像的 PDF 的列数，

可用选项 [“true”, “false”， “auto” (默认)]
document_orientation: 修复没有文本层和图像的 PDF 的文档方向（90、180、270 度）

可用选项 [“auto” (默认), “no_change”]
need_header_footer_analysis: 从解析 PDF 和图像的输出中移除页眉和页脚

需要解析结果 PDF 和图像
need_binarization: 清理没有文本层和图像的 PDF 的页面背景（二值化）

需要解析 PDF 并具有文本层和图像的表格
和图像

delimiter: CSV、TSV 文件的列分隔符 encoding: TXT、CSV、TSV 编码

with_attachments (Union[str, bool]) –

recursion_deep_attachments (int) –

pdf_with_text_layer (str) –

language (str) –

pages (str) –

is_one_column_document (str) –

document_orientation (str) –

need_header_footer_analysis (Union[str, bool]) –

need_binarization (Union[str, bool]) –

need_pdf_table_analysis (Union[str, bool]) –

delimiter (Optional[str]) –

encoding (Optional[str]) –

方法

__init__(file_path, *[, url, split, ...])

使用文件路径、API URL和解析参数初始化。

alazy_load()

Documents 的延迟加载器。

aload()

将数据加载到 Document 对象中。

lazy_load()

延迟加载文档。

load()

将数据加载到 Document 对象中。

load_and_split([text_splitter])

加载文档并将它们分割成块。

__init__(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None) → None[source]¶

使用文件路径、API URL和解析参数初始化。

参数

file_path (str) – 处理文件路径

url (str) – 调用 dedoc API 的 URL

split (str) –
将文档分割成部分（每部分单独返回）的类型，默认值 “document” “document”: 返回单个 langchain 文档对象

(不分割)

”page”: 将文档分割成页面（适用于 PDF、DJVU、PPTX、PPT、ODP） “node”: 将文档分割成树节点（标题节点、列表项节点、

纯文本节点)

”line”: 将文档分割成行

with_tables (bool) – 将表格添加到结果中 - 每个表格作为单个 langchain 文档对象返回

dedoc (通过文档解析使用的参数） –
(https://dedoc.readthedocs.io/zh/latest/parameters/parameters.html)

with_attachments: 启用附件文件提取 recursion_deep_attachments: 附件文件的递归深度

提取，只有当 with_attachments==True 时才起作用

pdf_with_text_layer: 解析 PDF 文档的处理程序类型，
可用选项 [“true”, “false”, “tabby”, “auto”, “auto_tabby” (默认)]

language: 对于没有文本层的 PDF 和图像的文档语言，
可用选项 [“eng”, “rus”, “rus+eng” (默认)], 语言列表可以扩展，请参阅 https://dedoc.readthedocs.io/zh/latest/tutorials/add_new_language.html

pages: 定义解析 PDF 文档的阅读范围页面切片

is_one_column_document: 检测没有文本层和图像的 PDF 的列数，

可用选项 [“true”, “false”， “auto” (默认)]
document_orientation: 修复没有文本层和图像的 PDF 的文档方向（90、180、270 度）

可用选项 [“auto” (默认), “no_change”]
need_header_footer_analysis: 从解析 PDF 和图像的输出中移除页眉和页脚

需要解析结果 PDF 和图像
need_binarization: 清理没有文本层和图像的 PDF 的页面背景（二值化）

需要解析 PDF 并具有文本层和图像的表格
和图像

delimiter: CSV、TSV 文件的列分隔符 encoding: TXT、CSV、TSV 编码

with_attachments (Union[str, bool]) –

recursion_deep_attachments (int) –

pdf_with_text_layer (str) –

language (str) –

pages (str) –

is_one_column_document (str) –

document_orientation (str) –

need_header_footer_analysis (Union[str, bool]) –

need_binarization (Union[str, bool]) –

need_pdf_table_analysis (Union[str, bool]) –

delimiter (Optional[str]) –

encoding (Optional[str]) –

返回类型

无

async alazy_load() → AsyncIterator[Document]¶

Documents 的延迟加载器。

返回类型

AsyncIterator[Document]

async aload() → List[Document]¶

将数据加载到 Document 对象中。

返回类型

List[Document]

lazy_load() → Iterator[Document][source]¶

延迟加载文档。

返回类型

Iterator[Document]

load() → List[Document]¶

将数据加载到 Document 对象中。

返回类型

List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) → List[Document]¶

加载数据并将其分割成块。块作为文档返回。

不要重写此方法。应该考虑将其弃用！

参数

text_splitter (可选[TextSplitter]) - 用来分割文档的文本分割器实例。默认使用RecursiveCharacterTextSplitter。

返回

文档列表。

返回类型

List[Document]

`__init__`(file_path, *[, url, split, ...])	使用文件路径、API URL和解析参数初始化。
`alazy_load`()	Documents 的延迟加载器。
`aload`()	将数据加载到 Document 对象中。
`lazy_load`()	延迟加载文档。
`load`()	将数据加载到 Document 对象中。
`load_and_split`([text_splitter])	加载文档并将它们分割成块。

langchain_community.document_loaders.dedoc.DedocAPIFileLoader¶

`langchain_community.document_loaders.dedoc`.DedocAPIFileLoader¶