Skip to content

Chunking Example

Chunking text without metadata

from llmfy import (
    chunk_text
)

text = """Artificial intelligence (AI) is one of the most transformative technologies of our time. It refers to computer systems that can perform tasks traditionally requiring human intelligence, such as learning, reasoning, and problem-solving. From voice assistants to recommendation engines, AI has become deeply embedded in our daily lives.

The rapid growth of machine learning, a subset of AI, has accelerated progress across industries. By training algorithms on vast amounts of data, systems can now recognize patterns, make predictions, and even generate new content. This capability is driving innovations in healthcare, finance, education, and many other fields.

While AI offers enormous benefits, it also raises challenges and ethical questions. Concerns about privacy, bias in algorithms, and the potential loss of jobs highlight the need for responsible development. Balancing innovation with accountability is essential to ensure AI works for the benefit of society as a whole.

Looking ahead, AI will likely continue shaping the future in profound ways. Advancements in natural language processing, robotics, and autonomous systems suggest possibilities we are only beginning to imagine. With careful oversight and thoughtful use, AI has the potential to enhance human capabilities and solve some of the world’s most complex problems.
"""

chunks = chunk_text(text=data, chunk_size=100, chunk_overlap=20)
for chunk in chunks:
    print(f"{chunk.id}: {chunk.content} \n{chunk.metadata}\n")

Chunking text with metadata

text = """Artificial intelligence (AI) is one of the most transformative technologies of our time. It refers to computer systems that can perform tasks traditionally requiring human intelligence, such as learning, reasoning, and problem-solving. From voice assistants to recommendation engines, AI has become deeply embedded in our daily lives.

The rapid growth of machine learning, a subset of AI, has accelerated progress across industries. By training algorithms on vast amounts of data, systems can now recognize patterns, make predictions, and even generate new content. This capability is driving innovations in healthcare, finance, education, and many other fields.

While AI offers enormous benefits, it also raises challenges and ethical questions. Concerns about privacy, bias in algorithms, and the potential loss of jobs highlight the need for responsible development. Balancing innovation with accountability is essential to ensure AI works for the benefit of society as a whole.

Looking ahead, AI will likely continue shaping the future in profound ways. Advancements in natural language processing, robotics, and autonomous systems suggest possibilities we are only beginning to imagine. With careful oversight and thoughtful use, AI has the potential to enhance human capabilities and solve some of the world’s most complex problems.
"""

data = (text, {"source": "doc1.pdf", "page": 2})

chunks = chunk_text(text=data, chunk_size=100, chunk_overlap=20)
for chunk in chunks:
    print(f"{chunk.id}: {chunk.content} \n{chunk.metadata}\n")

Chunking markdown by headers

from llmfy import chunk_markdown_by_header

Chunking markdown by headers

md_text = """
# Main Title

Intro paragraph for the document.

## Section 1
Details for section 1.

### Subsection 1.1
Information about subsection 1.1.

## Section 2
Content for section 2.

### Subsection 2.1
Nested content here.

#### Sub-subsection 2.1.1
Even more nested content.
"""
print("🔹 All headers (default):")
chunks_all = chunk_markdown_by_header(md_text)
for c in chunks_all:
    print(f"\nLevel {c.level} - {c.header}")
    print(c.content)
    print("-" * 60)

print("\n🔹 Only up to level 2:")
chunks_lvl2 = chunk_markdown_by_header(md_text, header_level=2)
for c in chunks_lvl2:
    print(f"\nLevel {c.level} - {c.header}")
    print(c.content)
    print("-" * 60)

Output:

🔹 All headers (default):

Level 1 - Main Title
# Main Title

Intro paragraph for the document.
------------------------------------------------------------

Level 2 - Section 1
## Section 1
Details for section 1.
------------------------------------------------------------

Level 3 - Subsection 1.1
### Subsection 1.1
Information about subsection 1.1.
------------------------------------------------------------

Level 2 - Section 2
## Section 2
Content for section 2.
------------------------------------------------------------

Level 3 - Subsection 2.1
### Subsection 2.1
Nested content here.
------------------------------------------------------------

Level 4 - Sub-subsection 2.1.1
#### Sub-subsection 2.1.1
Even more nested content.
------------------------------------------------------------

🔹 Only up to level 2:

Level 1 - Main Title
# Main Title

Intro paragraph for the document.
------------------------------------------------------------

Level 2 - Section 1
## Section 1
Details for section 1.

### Subsection 1.1
Information about subsection 1.1.
------------------------------------------------------------

Level 2 - Section 2
## Section 2
Content for section 2.

### Subsection 2.1
Nested content here.

#### Sub-subsection 2.1.1
Even more nested content.
------------------------------------------------------------

Chunking markdown by headers with metadata

md_w_data = (
    """
# Main Title

Intro paragraph.

## Section 1
Details for section 1.

### Subsection 1.1
More info here.
""",
    {"source": "doc1.md", "author": "irufano"},
)
print("\n🔹 Meta data:")
chunks_lvl2 = chunk_markdown_by_header(md_w_data, header_level=2)
for c in chunks_lvl2:
    print(f"Level {c.level} - {c.header}")
    print(f"Metadata: {c.metadata}")
    print(f"Content:\n{c.content}")
    print("-" * 60)

Output:

🔹 Meta data:
Level 1 - Main Title
Metadata: {'source': 'doc1.md', 'author': 'irufano'}
Content:
# Main Title

Intro paragraph.
------------------------------------------------------------
Level 2 - Section 1
Metadata: {'source': 'doc1.md', 'author': 'irufano'}
Content:
## Section 1
Details for section 1.

### Subsection 1.1
More info here.
------------------------------------------------------------