gotok is a lightweight command-line utility designed to count tokens in text using OpenAI's tokenization standards, making it easy to estimate input sizes for models like GPT-4o and older.
- Supports multiple OpenAI models and encodings.
- Flexible input and output options.
- Quiet mode for streamlined token counting.
- Go 1.19 or higher
Clone the repository and build the binary using Go:
git clone https://github.com/mattjoyce/gotok.git
cd gotok
go installThis will compile the gotok binary and place it in your $GOPATH/bin directory, making it accessible from your command line.
gotok [options] < [input]-
--model string
Specifies the model to use for token embedding (default:gpt-4o). If set, the model's default encoding will be applied. -
--encoding string
Sets a specific encoding manually, overriding the model's default. Must be one of the listed encodings (e.g.,cl100k_base). -
--input string
File path to the input text file. If both--inputand stdin are provided,gotokwill concatenate their contents. -
--output string
Designates where to send the output. Choose betweenstderr,stdout, or specify a file path (default:stderr). -
--passthrough
Outputs the original input text tostdoutin addition to the token count. Set tofalseby default. -
--quiet
Quiet mode suppresses all output except for the token count, overriding--passthrough. -
--list
Lists the available models and encodings, providing an easy reference for compatible options.
Here is a list of available models and their associated encodings that can be used with gotok:
code-search-ada-code-001(r50k_base)code-davinci-001(p50k_base)text-embedding-ada-002(cl100k_base)text-similarity-curie-001(r50k_base)code-search-babbage-code-001(r50k_base)text-ada-001(r50k_base)code-davinci-002(p50k_base)text-embedding-3-small(cl100k_base)text-davinci-002(p50k_base)ada(r50k_base)cushman-codex(p50k_base)code-davinci-edit-001(p50k_edit)text-search-davinci-doc-001(r50k_base)text-babbage-001(r50k_base)code-cushman-002(p50k_base)code-cushman-001(p50k_base)text-similarity-davinci-001(r50k_base)text-similarity-babbage-001(r50k_base)text-similarity-ada-001(r50k_base)text-search-ada-doc-001(r50k_base)gpt2(gpt2)davinci(r50k_base)babbage(r50k_base)text-search-curie-doc-001(r50k_base)curie(r50k_base)davinci-codex(p50k_base)text-davinci-edit-001(p50k_edit)text-embedding-3-large(cl100k_base)gpt-4(cl100k_base)gpt-3.5-turbo(cl100k_base)text-curie-001(r50k_base)text-search-babbage-doc-001(r50k_base)gpt-4o(o200k_base)text-davinci-003(p50k_base)text-davinci-001(r50k_base)
-
Basic Usage
Count tokens ininput.txt, displaying output instderr(default):gotok --input input.txt
-
Using Stdin Input
Pipe input text from stdin:gotok < input.txt -
Specify Encoding and Model
Specify a particular encoding and model:gotok --encoding "cl100k_base" --model "gpt-4" < input.txt
-
Quiet Mode
Count tokens only (suppress all other output):gotok --quiet --input input.txt
-
List Available Models and Encodings
gotok --list
- When both
--inputand stdin input are provided, the content from both sources will be concatenated before processing. - Use the
--passthroughoption to print the original input text tostdoutwhile viewing token counts.
This utility simplifies text preprocessing tasks for token-based models by giving you quick insight into input size, which helps in better managing prompt limits and structuring inputs for optimal model performance.