langchain_text_splitters.base.Tokenizer

class langchain_text_splitters.base.Tokenizer(chunk_overlap: int, tokens_per_chunk: int, decode: Callable[[List[int]], str], encode: Callable[[str], List[int]])[source]

Tokenizer 数据类。

属性

chunk_overlap

片段间词汇重叠

tokens_per_chunk

每个片段最大词汇数

decode

将一系列词汇 ID 解码为字符串的函数

encode

将字符串编码为词汇 ID 列表的函数

方法

__init__(chunk_overlap, tokens_per_chunk, ...)

参数
  • chunk_overlap (int) –

  • tokens_per_chunk (int) –

  • decode (Callable[[List[int]], str]) –

  • encode (Callable[[str], List[int]]) –

__init__(chunk_overlap: int, tokens_per_chunk: int, decode: Callable[[List[int]], str], encode: Callable[[str], List[int]]) None
参数
  • chunk_overlap (int) –

  • tokens_per_chunk (int) –

  • decode (Callable[[List[int]], str]) –

  • encode (Callable[[str], List[int]]) –

返回类型

None