`langchain_community.document_loaders.dedoc`.DedocBaseLoader¶

class langchain_community.document_loaders.dedoc.DedocBaseLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None)[source]¶

使用dedoc (https://dedoc.readthedocs.io)的基加载器。

加载器允许从给定文件中提取文本、表格和附件

文本可以根据页面、dedoc树节点、文本行进行分割
（根据split参数）。
（当with_attachments=True时）根据split参数分割附件。
对于附件，langchain Document对象有一个额外的元数据字段`type`=”attachment”。
（当with_tables=True时）表格不做分割 - 每个表格对应一个
langchain Document对象。对于表格，Document对象有额外的元数据字段type`=”table”和`text_as_html，其中包含表格的HTML表示。

使用文件路径和解析参数进行初始化。

参数

file_path (str) – 要处理的文件路径
split (str) –
将文档分割成部分的类型（每个部分单独返回），默认值“document”：将文档文本作为单个langchain文档返回

对象（不分割）

“page”：将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、ODP)

“node”：将文档文本分割成树节点（标题节点、列表节点、原始文本节点）

“line”：将文档文本分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为单个langchain文档对象返回
dedoc (通过dedoc进行文档解析使用的参数) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html)

with_attachments：启用附带文件提取 recursion_deep_attachments：附带文件的递归深度

提取，仅在with_attachments==True时生效

pdf_with_text_layer：解析PDF文档的处理程序类型，
可用选项[“true”， “false”， “tabby”， “auto”， “auto_tabby” (默认)]

language：对于没有文本层的PDF和图片，文档的语言，
可用选项[“eng”， “rus”， “rus+eng” (默认)]，语言列表可以扩展，请参阅https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages：定义解析PDF文档阅读范围的页面切片

is_one_column_document：检测没有文本层和图片的PDF的列数，

可用选项[“true”， “false”， “auto” (默认)]
document_orientation：对于没有文本层和图片的PDF，固定文档方向（90，180，270度），

可用选项[“auto” (默认)， “no_change”]
need_header_footer_analysis：从输出结果中移除标题和页脚

for PDF and images
need_binarization：清理没有文本层和图片的PDF的页面背景（二值化）

need_pdf_table_analysis：解析没有文本层和图片的PDF的表格

delimiter：CSV、TSV文件的列分隔符
with_attachments (Union [ str, bool ]) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (Union [ str, bool ]) –
need_binarization (Union [ str, bool ]) –
need_pdf_table_analysis (Union [ str, bool ]) –
delimiter (Optional [ str ] ) –
encoding (Optional [ str ] ) –

方法

`__init__`(file_path, *[, split, with_tables, ...])	使用文件路径和解析参数进行初始化。
`alazy_load`()	文档的懒加载器。
`aload`()	将数据加载到文档对象中。
`lazy_load`()	懒加载文档。
`load`()	将数据加载到文档对象中。
`load_and_split`([text_splitter])	加载文档并将其分割成块。

__init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None) → None[source]¶

使用文件路径和解析参数进行初始化。

参数

file_path (str) – 要处理的文件路径
split (str) –
将文档分割成部分的类型（每个部分单独返回），默认值“document”：将文档文本作为单个langchain文档返回

对象（不分割）

“page”：将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、ODP)

“node”：将文档文本分割成树节点（标题节点、列表节点、原始文本节点）

“line”：将文档文本分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为单个langchain文档对象返回
dedoc (通过dedoc进行文档解析使用的参数) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html)

with_attachments：启用附带文件提取 recursion_deep_attachments：附带文件的递归深度

提取，仅在with_attachments==True时生效

pdf_with_text_layer：解析PDF文档的处理程序类型，
可用选项[“true”， “false”， “tabby”， “auto”， “auto_tabby” (默认)]

language：对于没有文本层的PDF和图片，文档的语言，
可用选项[“eng”， “rus”， “rus+eng” (默认)]，语言列表可以扩展，请参阅https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages：定义解析PDF文档阅读范围的页面切片

is_one_column_document：检测没有文本层和图片的PDF的列数，

可用选项[“true”， “false”， “auto” (默认)]
document_orientation：对于没有文本层和图片的PDF，固定文档方向（90，180，270度），

可用选项[“auto” (默认)， “no_change”]
need_header_footer_analysis：从输出结果中移除标题和页脚

for PDF and images
need_binarization：清理没有文本层和图片的PDF的页面背景（二值化）

need_pdf_table_analysis：解析没有文本层和图片的PDF的表格

delimiter：CSV、TSV文件的列分隔符
with_attachments (Union [ str, bool ]) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (Union [ str, bool ]) –
need_binarization (Union [ str, bool ]) –
need_pdf_table_analysis (Union [ str, bool ]) –
delimiter (Optional [ str ] ) –
encoding (Optional [ str ] ) –

返回类型

None

async alazy_load() → AsyncIterator[Document]¶

文档的懒加载器。

返回类型: AsyncIterator[Document]

async aload() → List[Document]¶

将数据加载到文档对象中。

返回类型: List[Document]

lazy_load() → Iterator[Document][source]¶

懒加载文档。

返回类型: Iterator[Document]

load() → List[Document]¶

将数据加载到文档对象中。

返回类型: List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) → List[Document]¶

加载文档并将它们分割成块。块以Document的形式返回。

不应覆盖此方法。它应该被认为已弃用！

参数: text_splitter (可选[TextSplitter]) – 用于分割文档的TextSplitter实例。默认值为RecursiveCharacterTextSplitter。
返回: 文档列表。
返回类型: 列表[Document]

langchain_community.document_loaders.dedoc.DedocBaseLoader¶

`langchain_community.document_loaders.dedoc`.DedocBaseLoader¶