Chunking - Isaacus Docs

Every language model, Isaacus models included, has a limit to the number of tokens it can take at a time. This limit is known as its maximum sequence length or context window. In some cases, we work around this limit by breaking down long texts into smaller chunks through a process called chunking. We use semchunk, the most popular semantic chunking algorithm (which we developed ourselves), to chunk texts in such a way that the chunks created are unlikely to cut off right in the middle of important sentences and paragraphs. Chunks created by the Isaacus API will often correspond to separate clauses and sections in a document. We give you the option to customize how chunking is performed by providing chunk size and chunk overlap ratio parameters in our API. It is worth noting that the default chunk size is the maximum input length of whatever model is being used less overhead, which includes not only boilerplate tokens but also, if a model that takes an Isaacus Query Language query as input is being used, the number of tokens in the longest statement in that query. You also have the freedom to prechunk your text before sending it to an Isaacus model or to not chunk it at all, in which case, we will have to truncate your text to fit within the context window of the model if it is too long. This code snippet shows you how you can use our semchunk algorithm to chunk text like we do:

import semchunk

# NOTE We use a low chunk size here for demonstration purposes.
chunker = semchunk.chunkerify('isaacus/kanon-2-tokenizer', chunk_size=3)

text = "The client is happy."

chunks = chunker(text)

print(chunks) # expected output: ['The client is', 'happy.']

If you ever encounter issues with the way your text is being chunked, you can always create an issue on the semchunk GitHub repository or reach out to us directly.