chunk
llmfy.llmfy_utils.chunk.chunk
chunk_text(text, chunk_size=800, chunk_overlap=100)
Split text into overlapping chunks.
example:
text = "This is a long text " * 200
chunks = chunk_text(text=text, chunk_size=100, chunk_overlap=20)
for chunk in chunks:
print(f"{chunk.id}: {chunk.content}")
print(f"{chunk.metadata}")
# OR
text = "This is a long text " * 200
data = (text, {"source": "doc1.pdf", "page": 2})
chunks = chunk_text(text=data, chunk_size=100, chunk_overlap=20)
for chunk in chunks:
print(f"{chunk.id}: {chunk.content}")
print(f"{chunk.metadata}")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str | tuple
|
Text to chunk or (text, metadata) |
required |
chunk_size
|
int
|
Defaults to 800. |
800
|
chunk_overlap
|
int
|
Defaults to 100. |
100
|
Returns:
| Name | Type | Description |
|---|---|---|
chunks |
List[BaseChunkResult]
|
Each chunk property: - 'id': chunk id - 'content': chunk content - 'metadata': optional metadata if provided |
Source code in llmfy/llmfy_utils/chunk/chunk.py
chunk_markdown_by_header(markdown_text, header_level=None)
Split Markdown into chunks based on header levels. The content of each chunk includes the header itself. Optionally attaches metadata if provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
markdown_text
|
Union[str, Tuple[str, Any]]
|
str or (str, metadata_dict) Example: ("# Title", {"source": "doc1.md", "page": 2}) |
required |
header_level
|
int | None
|
int | None - None: include all headers (#-######) - int: include headers up to that level |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
chunks |
List[MarkdownChunkResult]
|
Each chunk property: - 'id': chunk id - 'header': the header text (without #) - 'level': header level (1-6) - 'content': header + content text - 'metadata': optional metadata if provided |