Skip to content

text_preprocessing

llmfy.llmfy_utils.text_preprocessing.text_preprocessing

clean_text_for_embedding(text)

Light cleaning for embeddings/vector search

Parameters:

Name Type Description Default
text str

text to clean.

required

Returns:

Name Type Description
str str

cleaned text

Source code in llmfy/llmfy_utils/text_preprocessing/text_preprocessing.py
def clean_text_for_embedding(text: str) -> str:
    """
    Light cleaning for embeddings/vector search

    Args:
        text (str): text to clean.

    Returns:
        str: cleaned text
    """
    # Normalize Unicode (e.g., full-width chars → normal width)
    text = unicodedata.normalize("NFKC", text)

    # Collapse multiple spaces/newlines into one space
    text = re.sub(r"\s+", " ", text)

    # Trim leading/trailing whitespace
    return text.strip()