`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶

类 langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[来源]¶

通过提取特定标签并删除不需要的标签来转换HTML内容。

示例

from langchain_community.document_transformers import BeautifulSoupTransformer

bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)

初始化转换器。

检查BeautifulSoup4包是否已安装。如果未安装，则引发ImportError异常。

方法

`__init__`()	初始化转换器。
`atransform_documents`(documents, **kwargs)	异步转换一组文档。
`extract_tags`(html_content, tags, *[, ...])	从给定的HTML内容中提取特定标签。
`remove_unnecessary_lines`(content)	通过删除不必要的行来清理内容。
`remove_unwanted_classnames`(html_content, ...)	从给定的HTML内容中删除不需要的类名。
`remove_unwanted_tags`(html_content, unwanted_tags)	从给定的HTML内容中删除不需要的标签。
`transform_documents`(documents[, ...])	通过清理它们的HTML内容来转换Document对象的列表。

__init__() → None[来源]¶

初始化转换器。

检查BeautifulSoup4包是否已安装。如果未安装，则引发ImportError异常。

返回类型: None

async atransform_documents(documents: Sequence[Document], **kwargs: Any) → Sequence[Document][源代码]¶

异步转换一组文档。

参数

documents (Sequence[Document]) – 待转换的文档序列。
kwargs (Any) –

返回

转换后的文档序列。

返回类型

Sequence[Document]

static extract_tags(html_content: str, tags: Union[List[str], Tuple[str, ...]], *, remove_comments: bool = False) → str[源代码]¶

从给定的HTML内容中提取特定标签。

参数

html_content (str) – 原始HTML内容字符串。
tags (Union[List[str], Tuple[str, ...]]) – 从HTML中提取的标签列表。
remove_comments (bool) –

返回

从选中标签中组合的内容字符串。

返回类型

str

static remove_unnecessary_lines(content: str) → str[source]¶

通过删除不必要的行来清理内容。

参数: content (str) – 字符串，可能包含不必要的行或空格。
返回: 移除不必要的行后的清洁字符串。
返回类型: str

static remove_unwanted_classnames(html_content: str, unwanted_classnames: Union[List[str], Tuple[str, ...]]) → str[source]¶

从给定的HTML内容中删除不需要的类名。

参数

html_content (str) – 原始HTML内容字符串。
unwanted_classnames (Union[List[str], Tuple[str, ...]]) – 要从HTML中移除的类名列表。

返回

移除不必要的类名后的清洁HTML字符串。

返回类型

str

static remove_unwanted_tags(html_content: str, unwanted_tags: Union[List[str], Tuple[str, ...]]) → str[source]¶

从给定的HTML内容中删除不需要的标签。

参数

html_content (str) – 原始HTML内容字符串。
unwanted_tags (Union[List[str], Tuple[str, ...]]) – 要从HTML中移除的标签列表。

返回

移除不必要的标签后的清洁HTML字符串。

返回类型

str

transform_documents(documents: Sequence[Document], unwanted_tags: Union[List[str], Tuple[str, ...] = ('script', 'style'), tags_to_extract: Union[List[str], Tuple[str, ...] = ('p', 'li', 'div', 'a'), remove_lines: bool = True, *, unwanted_classnames: Union[Tuple[str, ...], List[str]] = (), remove_comments: bool = False, **kwargs: Any) → Sequence[Document][source]¶

通过清理它们的HTML内容来转换Document对象的列表。

参数

documents (Sequence[\Document]\) – 包含HTML内容的Document对象的序列。
unwanted_tags (Union[List[str], Tuple[str, ...]]) – 要从HTML中移除的标签列表。
tags_to_extract (Union[列表[字符串], 元组[字符串, ...]]) – 需要提取内容的标签列表。
remove_lines (布尔值) – 如果设置为 True，将删除不必要的行。
unwanted_classnames (Union[元组[字符串, ...], 列表[字符串]]) – 要从 HTML 中移除的类名列表。
remove_comments (布尔值) – 如果设置为 True，将删除注释。
kwargs (Any) –

返回

具有转换后内容的文档对象序列。

返回类型

Sequence[Document]

BeautifulSoupTransformer 的用法示例¶

Beautiful Soup

langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer¶

BeautifulSoupTransformer 的用法示例¶

`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶