Skip to content

miyako/CTranslate2

Repository files navigation

ct2-server

CTranslate2 Inference Engine

Usage:  ct2-server -s -e embedding_model -p port 

 -m path     : translation model
 -e path     : embedding model (pooling=mean)
 -r path     : reranker model
 -a path     : generate model
 -g path     : chat completion model
 -t path     : chat template
 -j          : chat template from stdin
 -l          : pooling=last-token
 -c          : pooling=cls
 -s          : server
 -p          : server listening port (default=8080)
 -h host     : server host (default=127.0.0.1)
 -i          : input
 -o          : output (default=stdout)
 -           : use stdin for input

The CLI is built for 4 platforms:

  • macOS Apple Silicon
  • macOS Intel
  • Windows AMD
  • Windows ARM

Dependencies

  • ctranslate2-4.7.1

OpenAI Compatible Endpoints

  • /v1/models
  • /v1/embeddings
  • /v1/chat/completion

Cohere Compatible Endpoints

  • /v1/rerank

MongoDB Compatible Endpoints

  • /v1/contextualizedembeddings
  • /v1/contextualized/embeddings (alias)

Other Endpoints

  • /v1/generate
  • /v1/translate

Converted CT2 Models

Quantisation

The int8_float16 format is primarily designed for NVIDIA GPUs. It stores weights in 8-bit integers but converts them to 16-bit floating point for maximum efficiency (storage+speed). CTranslate2 falls back to float32 if CUDA is unavailable, which defeats the purpose of this hybrid format.

The float16 format is also designed for GPUs that support native 16-bit maths. The CPU backend of CTranslate2 usually performs calculations in float32 even on a CPU like Apple Silicon that actually has native 16-bit maths. The weights are automatically converted to 32-bit at startup.

The int8 format takes advantage of NEON instructions on Apple Silicon and AVX2 AVX-512 VNNI instructions on Intel or AMD to accelerate maths. You should always use the int8 format on a PC or Mac with no GPU.

Chat Completion with Tool Calling

int8 max_position_embeddings hidden_size num_hidden_layers
Rakuten/RakutenAI-2.0-mini-instruct 1540 131072 2048 22
Qwen/Qwen3-0.6B 598 40960 1024 28
Qwen/Qwen3-1.7B 1720 40960 2048 28
katanemo/Arch-Agent-1.5B 1550 32768 1536 28
katanemo/Arch-Agent-3B 3090 32768 2048 36
meta-llama/Llama-3.2-1B-Instruct 1240 131072 2048 16
meta-llama/Llama-3.2-3B-Instruct 3220 131072 3072 28
Salesforce/xLAM-2-1b-fc-r 1550 32768 1536 28
Salesforce/xLAM-2-3b-fc-r 3090 32768 2048 36
tiiuae/Falcon3-1B-Instruct 1670 8192 2048 18
tiiuae/Falcon3-3B-Instruct 3230 32768 3072 22

Chat Completion

int8 max_position_embeddings hidden_size num_hidden_layers
elyza/ELYZA-japanese-Llama-2-7b-fast-instruct 6850 4096 4096 32
Rakuten/RakutenAI-7B-instruct 7380 32768 4096 32
llm-jp/llm-jp-3-1.8b-instruct 1870 4096 2048 24
llm-jp/llm-jp-3-3.7b-instruct 3790 4096 3072 28
SakanaAI/TinySwallow-1.5B-Instruct 1550 32768 1536 28
sbintuitions/sarashina2.2-0.5b-instruct-v0.1 795 8192 1280 24
sbintuitions/sarashina2.2-1b-instruct-v0.1 1410 8192 1792 24
sbintuitions/sarashina2.2-3b-instruct-v0.1 3360 8192 2560 32
google/gemma-2-2b-jpn-it 2620 8192 2304 26
tokyotech-llm/Gemma-2-Llama-Swallow-2b-it-v0.1 2620 8192 2304 26
SakanaAI/Llama-3-Karamaru-v1 8040 8192 4096 32

Rerank

int8 int8_float16 float16 max_position_embeddings hidden_size num_hidden_layers
cross-encoder/ms-marco-MiniLM-L6-v2 23 23 45 512 384 6
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 119 119 235 512 384 12
BAAI/bge-reranker-v2-m3 594 577 1130 8192 1024 24
BAAI/bge-reranker-base 280 279 555 8192 768 12
BAAI/bge-reranker-large 563 561 1120 8192 1024 24

Embedding

int8 int8_float16 float16 max_position_embeddings hidden_size num_hidden_layers pooling
sentence-transformers/LaBSE 475 474 942 512 768 12 cls
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 280 279 555 512 768 12 mean
BAAI/bge-small-en-v1.5 34 33 66 512 384 12 cls
BAAI/bge-base-en-v1.5 111 110 219 512 768 12 cls
BAAI/bge-large-en-v1.5 338 337 670 512 1024 24 cls
BAAI/bge-m3 595 577 1130 8192 1024 24 cls
intfloat/e5-small-v2 34 33 66 512 384 12 mean
intfloat/e5-base-v2 111 110 219 512 768 12 mean
intfloat/e5-large-v2 339 337 670 512 1024 24 mean
intfloat/multilingual-e5-small 120 119 235 512 384 12 mean
intfloat/multilingual-e5-base 280 279 555 512 768 12 mean
intfloat/multilingual-e5-large 563 562 1120 512 1024 24 mean
Snowflake/snowflake-arctic-embed-s 34 33 66 512 384 12 cls
Snowflake/snowflake-arctic-embed-l 338 337 670 512 1024 24 cls
sentence-transformers/all-MiniLM-L6-v2 23 23 45 512 384 6 mean
sentence-transformers/all-MiniLM-L12-v2 34 33 66 512 384 12 mean
ibm-granite/granite-embedding-30m-english 30 30 60 512 384 6 cls
ibm-granite/granite-embedding-125m-english 126 126 249 512 768 12 cls
ibm-granite/granite-embedding-107m-multilingual 108 108 214 512 384 6 cls
ibm-granite/granite-embedding-278m-multilingual 279 279 555 512 768 12 cls

CTranslate2 can't use models that depend on external python code for tokenisation:

BertJapaneseTokenizer

  • ruri-base-v2
  • ruri-large-v2

CTranslate2 does not store or return the dense vector representation of the sentence, which is necessary for using decodes for embeddings. That excludes:

Qwen2ForCausalLM

  • mixedbread-ai/mxbai-rerank-base-v2
  • mixedbread-ai/mxbai-rerank-large-v2
  • gte-Qwen2-1.5B-instruct
  • gte-Qwen2-7B-instruct

LlamaModel

  • sarashina-embedding-v1-1b
  • sarashina-embedding-v2-1b

As of February 2026, CTranslate2 does not support several notable model architectures:

NomicBertModel

  • nomic-embed-text-v1
  • nomic-embed-text-v1.5

NewModel

  • gte-base-en-v1.5
  • gte-large-en-v1.5

NewForTokenClassification

  • gte-multilingual-base

Gemma3TextModel

  • embeddinggemma-300m
  • ruri-v3-30m
  • ruri-v3-70m
  • ruri-v3-130m
  • ruri-v3-310m
  • modernbert-ja-30m
  • modernbert-ja-70m
  • modernbert-ja-130m
  • modernbert-ja-310m
  • gte-modernbert-base
  • amber-base
  • amber-large
  • granite-embedding-small-english-r2
  • granite-embedding-english-r2

RoPE

NotImplementedError: RoPE scaling type 'yarn' is not yet implemented. The following RoPE scaling types are currently supported: linear, su, llama3, longrope

  • MadeAgents/Hammer2.1-0.5b
  • MadeAgents/Hammer2.1-1.5b

GraniteMoeHybridConfig

  • ibm-granite/granite-4.0-350m
  • ibm-granite/granite-4.0-1b

DebertaV2ForSequenceClassification

  • mixedbread-ai/mxbai-rerank-base-v1
  • mixedbread-ai/mxbai-rerank-large-v1

Qwen 3.5

  • Qwen/Qwen3.5-0.8B
  • Qwen/Qwen3.5-2B

About

Local inference engine

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors