CTranslate2 Inference Engine
Usage: ct2-server -s -e embedding_model -p port
-m path : translation model
-e path : embedding model (pooling=mean)
-r path : reranker model
-a path : generate model
-g path : chat completion model
-t path : chat template
-j : chat template from stdin
-l : pooling=last-token
-c : pooling=cls
-s : server
-p : server listening port (default=8080)
-h host : server host (default=127.0.0.1)
-i : input
-o : output (default=stdout)
- : use stdin for input
The CLI is built for 4 platforms:
- macOS Apple Silicon
- macOS Intel
- Windows AMD
- Windows ARM
ctranslate2-4.7.1
/v1/models/v1/embeddings/v1/chat/completion
/v1/rerank
/v1/contextualizedembeddings/v1/contextualized/embeddings(alias)
/v1/generate/v1/translate
The int8_float16 format is primarily designed for NVIDIA GPUs. It stores weights in 8-bit integers but converts them to 16-bit floating point for maximum efficiency (storage+speed). CTranslate2 falls back to float32 if CUDA is unavailable, which defeats the purpose of this hybrid format.
The float16 format is also designed for GPUs that support native 16-bit maths. The CPU backend of CTranslate2 usually performs calculations in float32 even on a CPU like Apple Silicon that actually has native 16-bit maths. The weights are automatically converted to 32-bit at startup.
The int8 format takes advantage of NEON instructions on Apple Silicon and AVX2 AVX-512 VNNI instructions on Intel or AMD to accelerate maths. You should always use the int8 format on a PC or Mac with no GPU.
int8 |
max_position_embeddings |
hidden_size |
num_hidden_layers |
|
|---|---|---|---|---|
Rakuten/RakutenAI-2.0-mini-instruct |
1540 |
131072 |
2048 |
22 |
Qwen/Qwen3-0.6B |
598 |
40960 |
1024 |
28 |
Qwen/Qwen3-1.7B |
1720 |
40960 |
2048 |
28 |
katanemo/Arch-Agent-1.5B |
1550 |
32768 |
1536 |
28 |
katanemo/Arch-Agent-3B |
3090 |
32768 |
2048 |
36 |
meta-llama/Llama-3.2-1B-Instruct |
1240 |
131072 |
2048 |
16 |
meta-llama/Llama-3.2-3B-Instruct |
3220 |
131072 |
3072 |
28 |
Salesforce/xLAM-2-1b-fc-r |
1550 |
32768 |
1536 |
28 |
Salesforce/xLAM-2-3b-fc-r |
3090 |
32768 |
2048 |
36 |
tiiuae/Falcon3-1B-Instruct |
1670 |
8192 |
2048 |
18 |
tiiuae/Falcon3-3B-Instruct |
3230 |
32768 |
3072 |
22 |
int8 |
max_position_embeddings |
hidden_size |
num_hidden_layers |
|
|---|---|---|---|---|
elyza/ELYZA-japanese-Llama-2-7b-fast-instruct |
6850 |
4096 |
4096 |
32 |
Rakuten/RakutenAI-7B-instruct |
7380 |
32768 |
4096 |
32 |
llm-jp/llm-jp-3-1.8b-instruct |
1870 |
4096 |
2048 |
24 |
llm-jp/llm-jp-3-3.7b-instruct |
3790 |
4096 |
3072 |
28 |
SakanaAI/TinySwallow-1.5B-Instruct |
1550 |
32768 |
1536 |
28 |
sbintuitions/sarashina2.2-0.5b-instruct-v0.1 |
795 |
8192 |
1280 |
24 |
sbintuitions/sarashina2.2-1b-instruct-v0.1 |
1410 |
8192 |
1792 |
24 |
sbintuitions/sarashina2.2-3b-instruct-v0.1 |
3360 |
8192 |
2560 |
32 |
google/gemma-2-2b-jpn-it |
2620 |
8192 |
2304 |
26 |
tokyotech-llm/Gemma-2-Llama-Swallow-2b-it-v0.1 |
2620 |
8192 |
2304 |
26 |
SakanaAI/Llama-3-Karamaru-v1 |
8040 |
8192 |
4096 |
32 |
int8 |
int8_float16 |
float16 |
max_position_embeddings |
hidden_size |
num_hidden_layers |
|
|---|---|---|---|---|---|---|
cross-encoder/ms-marco-MiniLM-L6-v2 |
23 |
23 |
45 |
512 |
384 |
6 |
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 |
119 |
119 |
235 |
512 |
384 |
12 |
BAAI/bge-reranker-v2-m3 |
594 |
577 |
1130 |
8192 |
1024 |
24 |
BAAI/bge-reranker-base |
280 |
279 |
555 |
8192 |
768 |
12 |
BAAI/bge-reranker-large |
563 |
561 |
1120 |
8192 |
1024 |
24 |
CTranslate2 can't use models that depend on external python code for tokenisation:
- ruri-base-v2
- ruri-large-v2
CTranslate2 does not store or return the dense vector representation of the sentence, which is necessary for using decodes for embeddings. That excludes:
- mixedbread-ai/mxbai-rerank-base-v2
- mixedbread-ai/mxbai-rerank-large-v2
- gte-Qwen2-1.5B-instruct
- gte-Qwen2-7B-instruct
- sarashina-embedding-v1-1b
- sarashina-embedding-v2-1b
As of February 2026, CTranslate2 does not support several notable model architectures:
- nomic-embed-text-v1
- nomic-embed-text-v1.5
- gte-base-en-v1.5
- gte-large-en-v1.5
- gte-multilingual-base
- embeddinggemma-300m
- ruri-v3-30m
- ruri-v3-70m
- ruri-v3-130m
- ruri-v3-310m
- modernbert-ja-30m
- modernbert-ja-70m
- modernbert-ja-130m
- modernbert-ja-310m
- gte-modernbert-base
- amber-base
- amber-large
- granite-embedding-small-english-r2
- granite-embedding-english-r2
NotImplementedError: RoPE scaling type 'yarn' is not yet implemented. The following RoPE scaling types are currently supported: linear, su, llama3, longrope
- MadeAgents/Hammer2.1-0.5b
- MadeAgents/Hammer2.1-1.5b
- ibm-granite/granite-4.0-350m
- ibm-granite/granite-4.0-1b
- mixedbread-ai/mxbai-rerank-base-v1
- mixedbread-ai/mxbai-rerank-large-v1
- Qwen/Qwen3.5-0.8B
- Qwen/Qwen3.5-2B