ct2-server

CTranslate2 Inference Engine

Usage:  ct2-server -s -e embedding_model -p port 

 -m path     : translation model
 -e path     : embedding model (pooling=mean)
 -r path     : reranker model
 -a path     : generate model
 -g path     : chat completion model
 -t path     : chat template
 -j          : chat template from stdin
 -l          : pooling=last-token
 -c          : pooling=cls
 -s          : server
 -p          : server listening port (default=8080)
 -h host     : server host (default=127.0.0.1)
 -i          : input
 -o          : output (default=stdout)
 -           : use stdin for input

The CLI is built for 4 platforms:

macOS Apple Silicon
macOS Intel
Windows AMD
Windows ARM

Dependencies

ctranslate2-4.7.1

OpenAI Compatible Endpoints

/v1/models
/v1/embeddings
/v1/chat/completion

Cohere Compatible Endpoints

/v1/rerank

MongoDB Compatible Endpoints

/v1/contextualizedembeddings
/v1/contextualized/embeddings (alias)

Other Endpoints

/v1/generate
/v1/translate

Converted CT2 Models

Quantisation

The int8_float16 format is primarily designed for NVIDIA GPUs. It stores weights in 8-bit integers but converts them to 16-bit floating point for maximum efficiency (storage+speed). CTranslate2 falls back to float32 if CUDA is unavailable, which defeats the purpose of this hybrid format.

The float16 format is also designed for GPUs that support native 16-bit maths. The CPU backend of CTranslate2 usually performs calculations in float32 even on a CPU like Apple Silicon that actually has native 16-bit maths. The weights are automatically converted to 32-bit at startup.

The int8 format takes advantage of NEON instructions on Apple Silicon and AVX2 AVX-512 VNNI instructions on Intel or AMD to accelerate maths. You should always use the int8 format on a PC or Mac with no GPU.

Chat Completion with Tool Calling

	`int8`	`max_position_embeddings`	`hidden_size`	`num_hidden_layers`
`Rakuten/RakutenAI-2.0-mini-instruct`	`1540`	`131072`	`2048`	`22`
`Qwen/Qwen3-0.6B`	`598`	`40960`	`1024`	`28`
`Qwen/Qwen3-1.7B`	`1720`	`40960`	`2048`	`28`
`katanemo/Arch-Agent-1.5B`	`1550`	`32768`	`1536`	`28`
`katanemo/Arch-Agent-3B`	`3090`	`32768`	`2048`	`36`
`meta-llama/Llama-3.2-1B-Instruct`	`1240`	`131072`	`2048`	`16`
`meta-llama/Llama-3.2-3B-Instruct`	`3220`	`131072`	`3072`	`28`
`Salesforce/xLAM-2-1b-fc-r`	`1550`	`32768`	`1536`	`28`
`Salesforce/xLAM-2-3b-fc-r`	`3090`	`32768`	`2048`	`36`
`tiiuae/Falcon3-1B-Instruct`	`1670`	`8192`	`2048`	`18`
`tiiuae/Falcon3-3B-Instruct`	`3230`	`32768`	`3072`	`22`

Chat Completion

	`int8`	`max_position_embeddings`	`hidden_size`	`num_hidden_layers`
`elyza/ELYZA-japanese-Llama-2-7b-fast-instruct`	`6850`	`4096`	`4096`	`32`
`Rakuten/RakutenAI-7B-instruct`	`7380`	`32768`	`4096`	`32`
`llm-jp/llm-jp-3-1.8b-instruct`	`1870`	`4096`	`2048`	`24`
`llm-jp/llm-jp-3-3.7b-instruct`	`3790`	`4096`	`3072`	`28`
`SakanaAI/TinySwallow-1.5B-Instruct`	`1550`	`32768`	`1536`	`28`
`sbintuitions/sarashina2.2-0.5b-instruct-v0.1`	`795`	`8192`	`1280`	`24`
`sbintuitions/sarashina2.2-1b-instruct-v0.1`	`1410`	`8192`	`1792`	`24`
`sbintuitions/sarashina2.2-3b-instruct-v0.1`	`3360`	`8192`	`2560`	`32`
`google/gemma-2-2b-jpn-it`	`2620`	`8192`	`2304`	`26`
`tokyotech-llm/Gemma-2-Llama-Swallow-2b-it-v0.1`	`2620`	`8192`	`2304`	`26`
`SakanaAI/Llama-3-Karamaru-v1`	`8040`	`8192`	`4096`	`32`

Rerank

	`int8`	`int8_float16`	`float16`	`max_position_embeddings`	`hidden_size`	`num_hidden_layers`
`cross-encoder/ms-marco-MiniLM-L6-v2`	`23`	`23`	`45`	`512`	`384`	`6`
`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`	`119`	`119`	`235`	`512`	`384`	`12`
`BAAI/bge-reranker-v2-m3`	`594`	`577`	`1130`	`8192`	`1024`	`24`
`BAAI/bge-reranker-base`	`280`	`279`	`555`	`8192`	`768`	`12`
`BAAI/bge-reranker-large`	`563`	`561`	`1120`	`8192`	`1024`	`24`

Embedding

	`int8`	`int8_float16`	`float16`	`max_position_embeddings`	`hidden_size`	`num_hidden_layers`	`pooling`
`sentence-transformers/LaBSE`	`475`	`474`	`942`	`512`	`768`	`12`	`cls`
`sentence-transformers/paraphrase-multilingual-mpnet-base-v2`	`280`	`279`	`555`	`512`	`768`	`12`	`mean`
`BAAI/bge-small-en-v1.5`	`34`	`33`	`66`	`512`	`384`	`12`	`cls`
`BAAI/bge-base-en-v1.5`	`111`	`110`	`219`	`512`	`768`	`12`	`cls`
`BAAI/bge-large-en-v1.5`	`338`	`337`	`670`	`512`	`1024`	`24`	`cls`
`BAAI/bge-m3`	`595`	`577`	`1130`	`8192`	`1024`	`24`	`cls`
`intfloat/e5-small-v2`	`34`	`33`	`66`	`512`	`384`	`12`	`mean`
`intfloat/e5-base-v2`	`111`	`110`	`219`	`512`	`768`	`12`	`mean`
`intfloat/e5-large-v2`	`339`	`337`	`670`	`512`	`1024`	`24`	`mean`
`intfloat/multilingual-e5-small`	`120`	`119`	`235`	`512`	`384`	`12`	`mean`
`intfloat/multilingual-e5-base`	`280`	`279`	`555`	`512`	`768`	`12`	`mean`
`intfloat/multilingual-e5-large`	`563`	`562`	`1120`	`512`	`1024`	`24`	`mean`
`Snowflake/snowflake-arctic-embed-s`	`34`	`33`	`66`	`512`	`384`	`12`	`cls`
`Snowflake/snowflake-arctic-embed-l`	`338`	`337`	`670`	`512`	`1024`	`24`	`cls`
`sentence-transformers/all-MiniLM-L6-v2`	`23`	`23`	`45`	`512`	`384`	`6`	`mean`
`sentence-transformers/all-MiniLM-L12-v2`	`34`	`33`	`66`	`512`	`384`	`12`	`mean`
`ibm-granite/granite-embedding-30m-english`	`30`	`30`	`60`	`512`	`384`	`6`	`cls`
`ibm-granite/granite-embedding-125m-english`	`126`	`126`	`249`	`512`	`768`	`12`	`cls`
`ibm-granite/granite-embedding-107m-multilingual`	`108`	`108`	`214`	`512`	`384`	`6`	`cls`
`ibm-granite/granite-embedding-278m-multilingual`	`279`	`279`	`555`	`512`	`768`	`12`	`cls`

CTranslate2 can't use models that depend on external python code for tokenisation:

BertJapaneseTokenizer

ruri-base-v2
ruri-large-v2

CTranslate2 does not store or return the dense vector representation of the sentence, which is necessary for using decodes for embeddings. That excludes:

Qwen2ForCausalLM

mixedbread-ai/mxbai-rerank-base-v2
mixedbread-ai/mxbai-rerank-large-v2
gte-Qwen2-1.5B-instruct
gte-Qwen2-7B-instruct

LlamaModel

sarashina-embedding-v1-1b
sarashina-embedding-v2-1b

As of February 2026, CTranslate2 does not support several notable model architectures:

NomicBertModel

nomic-embed-text-v1
nomic-embed-text-v1.5

NewModel

gte-base-en-v1.5
gte-large-en-v1.5

NewForTokenClassification

gte-multilingual-base

Gemma3TextModel

embeddinggemma-300m

ModernBERT

ruri-v3-30m
ruri-v3-70m
ruri-v3-130m
ruri-v3-310m
modernbert-ja-30m
modernbert-ja-70m
modernbert-ja-130m
modernbert-ja-310m
gte-modernbert-base
amber-base
amber-large
granite-embedding-small-english-r2
granite-embedding-english-r2

RoPE

NotImplementedError: RoPE scaling type 'yarn' is not yet implemented. The following RoPE scaling types are currently supported: linear, su, llama3, longrope

MadeAgents/Hammer2.1-0.5b
MadeAgents/Hammer2.1-1.5b

GraniteMoeHybridConfig

ibm-granite/granite-4.0-350m
ibm-granite/granite-4.0-1b

DebertaV2ForSequenceClassification

mixedbread-ai/mxbai-rerank-base-v1
mixedbread-ai/mxbai-rerank-large-v1

Qwen 3.5

Qwen/Qwen3.5-0.8B
Qwen/Qwen3.5-2B

Name		Name	Last commit message	Last commit date
Latest commit History 512 Commits
.github/workflows		.github/workflows
CTranslate2		CTranslate2
_includes		_includes
_layouts		_layouts
_plugins		_plugins
_sass		_sass
assets		assets
cpp		cpp
.gitignore		.gitignore
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
build.md		build.md
chat.md		chat.md
embeddings.md		embeddings.md
generate.md		generate.md
index.md		index.md
package.json		package.json
translate.md		translate.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ct2-server

Dependencies

OpenAI Compatible Endpoints

Cohere Compatible Endpoints

MongoDB Compatible Endpoints

Other Endpoints

Converted CT2 Models

Quantisation

Chat Completion with Tool Calling

Chat Completion

Rerank

Embedding

BertJapaneseTokenizer

Qwen2ForCausalLM

LlamaModel

NomicBertModel

NewModel

NewForTokenClassification

Gemma3TextModel

ModernBERT

RoPE

GraniteMoeHybridConfig

DebertaV2ForSequenceClassification

Qwen 3.5

About

Uh oh!

Releases 74

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ct2-server

Dependencies

OpenAI Compatible Endpoints

Cohere Compatible Endpoints

MongoDB Compatible Endpoints

Other Endpoints

Converted CT2 Models

Quantisation

Chat Completion with Tool Calling

Chat Completion

Rerank

Embedding

BertJapaneseTokenizer

Qwen2ForCausalLM

LlamaModel

NomicBertModel

NewModel

NewForTokenClassification

Gemma3TextModel

ModernBERT

RoPE

GraniteMoeHybridConfig

DebertaV2ForSequenceClassification

Qwen 3.5

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 74

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages