langchain_community.document_loaders.pdf
.DedocPDFLoader¶
- class langchain_community.document_loaders.pdf.DedocPDFLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None)[source]¶
DedocPDFLoader document loader integration to load PDF files using dedoc. The file loader can automatically detect the correctness of a textual layer in the
PDF document.
- Note that __init__ method supports parameters that differ from ones of
DedocBaseLoader.
- Setup:
Install
dedoc
package.pip install -U dedoc
- Instantiate:
from langchain_community.document_loaders import DedocPDFLoader loader = DedocPDFLoader( file_path="example.pdf", # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )
- Load:
docs = loader.load() print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
- Lazy load:
docs = [] docs_lazy = loader.lazy_load() for doc in docs_lazy: docs.append(doc) print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
- Parameters used for document parsing via dedoc
(https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files extraction,
works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing, available options
[“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer,
available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing is_one_column_document: detect number of columns for PDF without a textual
layer, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees) for PDF
without a textual layer, available options [“auto” (default), “no_change”]
need_header_footer_analysis: remove headers and footers from the output result need_binarization: clean pages background (binarize) for PDF without a textual
layer
need_pdf_table_analysis: parse tables for PDF without a textual layer
Initialize with file path and parsing parameters.
- Parameters
file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document
object (don’t split)
- ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)
- ”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)
”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (Union[str, bool]) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (Union[str, bool]) –
need_binarization (Union[str, bool]) –
need_pdf_table_analysis (Union[str, bool]) –
delimiter (Optional[str]) –
encoding (Optional[str]) –
Methods
__init__
(file_path, *[, split, with_tables, ...])Initialize with file path and parsing parameters.
A lazy loader for Documents.
aload
()Load data into Document objects.
Lazily load documents.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: Union[str, bool] = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: Union[str, bool] = False, need_binarization: Union[str, bool] = False, need_pdf_table_analysis: Union[str, bool] = True, delimiter: Optional[str] = None, encoding: Optional[str] = None) None ¶
Initialize with file path and parsing parameters.
- Parameters
file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document
object (don’t split)
- ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)
- ”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)
”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (Union[str, bool]) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (Union[str, bool]) –
need_binarization (Union[str, bool]) –
need_pdf_table_analysis (Union[str, bool]) –
delimiter (Optional[str]) –
encoding (Optional[str]) –
- Return type
None
- async alazy_load() AsyncIterator[Document] ¶
A lazy loader for Documents.
- Return type
AsyncIterator[Document]
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- Return type
List[Document]