Any Isaacus model that analyzes text does so by breaking it down into tokens. This page explains how that works and how you can use the Kanon tokenizer to tokenize text in the same way that Isaacus models do.

If you’re actually looking to use Isaacus APIs, please see our quickstart guide or API documentation instead.

Tokens

Isaacus models work, like all computer-based AI models, by converting their inputs into numbers.

Specifically, they convert text into a restricted set of numbers, known as a vocabulary.

Each number in the vocabulary represents a token.

A token is a unit of text.

A token can be thought of as a word, part of a word, number, punctuation mark or space.

Any text input sent to an Isaacus model gets broken down into tokens.

The process of breaking down text into tokens is called tokenization.

A tokenizer is a tool that performs tokenization.

Different models have different tokenizers.

Different tokenizers have different vocabularies and different tokenization rules.

If we were to send the sentence “The client is happy.” to an Isaacus model that uses the Kanon tokenizer, it would first break it down into the tokens “The”, “client”, “is”, “happy”, and ”.”.

A vocabulary can only contain so many tokens.

Most of the tokens in a vocabulary will be common words but they can also be subwords, as mentioned earlier. For example, there are quite a lot of words that end in the suffix “-ing” so you will often see “-ing” as a token in vocabularies.

Having subwords in a vocabulary means that, if a tokenizer encounters a rare word, for example, “discombobulation”, it can break it down into more common subwords, like “disc”, “omb”, “ob” and “ulation”.

The Kanon tokenizer

The Kanon tokenizer is the tokenizer currently used by all Isaacus models.

The tokenizer is freely available on Hugging Face. Consequently, you can use it with the Transformers library to tokenize text in the same way that Isaacus models do.

from transformers import AutoTokenizer

kanon_tokenizer = AutoTokenizer.from_pretrained("isaacus/kanon-tokenizer")

text = "The client is happy."

tokens = kanon_tokenizer.tokenize(text)

print(tokens)