python3 -m llama_cpp.server --model <model_path>
Server options
For a full list of options, run:
python3 -m llama_cpp.server --help
NOTE: All server options are also available as environment variables. For example, --model
can be set by setting the MODEL
environment variable.
Check out the server config reference below settings for more information on the available options.
CLI arguments and environment variables are available for all of the fields defined in ServerSettings
and ModelSettings
Additionally the server supports configuration check out the configuration section for more information and examples.
Guides
Code Completion
llama-cpp-python
supports code completion via GitHub Copilot.
NOTE: Without GPU acceleration this is unlikely to be fast enough to be usable.
You'll first need to download one of the available code completion models in GGUF format:
replit-code-v1_5-GGUF
Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:
python3 -m llama_cpp.server --model <model_path> --n_ctx 16192
Then just update your settings in .vscode/settings.json
to point to your code completion server:
// ...
"github.copilot.advanced": {
"debug.testOverrideProxyUrl": "http://<host>:<port>",
"debug.overrideProxyUrl": "http://<host>:<port>"
// ...
Function Calling
llama-cpp-python
supports structured function calling based on a JSON schema.
Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.
You'll first need to download one of the available function calling models in GGUF format:
functionary
Then when you run the server you'll need to also specify either functionary-v1
or functionary-v2
chat_format.
Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned here, you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
Check out this example notebook for a walkthrough of some interesting use cases for function calling.
Multimodal Models
llama-cpp-python
supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.
You'll first need to download one of the available multi-modal models in GGUF format:
llava-v1.5-7b
llava-v1.5-13b
bakllava-1-7b
llava-v1.6-34b
moondream2
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the llava-1-5
chat_format
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
Then you can just use the OpenAI API as normal
from openai import OpenAI
client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
"role": "user",
"content": [
"type": "image_url",
"image_url": {
"url": "<image_url>"
{"type": "text", "text": "What does the image say"},
print(response)
Configuration and Multi-Model Support
The server supports configuration via a JSON config file that can be passed using the --config_file
parameter or the CONFIG_FILE
environment variable.
python3 -m llama_cpp.server --config_file <config_file>
Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.
The server supports routing requests to multiple models based on the model
parameter in the request which matches against the model_alias
in the config file.
At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.
"host": "0.0.0.0",
"port": 8080,
"models": [
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-3.5-turbo",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-4",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
"model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
"model_alias": "gpt-4-vision-preview",
"chat_format": "llava-1-5",
"clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
"model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
"model_alias": "text-davinci-003",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
"model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
"model_alias": "copilot-codex",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 1024,
"n_ctx": 9216
The config file format is defined by the ConfigFileSettings
class.
Server Options Reference
240
class ServerSettings(BaseSettings):
"""Server settings used to configure the FastAPI and Uvicorn server."""
# Uvicorn Settings
host: str = Field(default="localhost", description="Listen address")
port: int = Field(default=8000, description="Listen port")
ssl_keyfile: Optional[str] = Field(
default=None, description="SSL key file for HTTPS"
ssl_certfile: Optional[str] = Field(
default=None, description="SSL certificate file for HTTPS"
# FastAPI Settings
api_key: Optional[str] = Field(
default=None,
description="API key for authentication. If set all requests need to be authenticated.",
interrupt_requests: bool = Field(
default=True,
description="Whether to interrupt requests when a new request is received.",
disable_ping_events: bool = Field(
default=False,
description="Disable EventSource pings (may be needed for some clients).",
root_path: str = Field(
default="",
description="The root path for the server. Useful when running behind a reverse proxy.",
ssl_keyfile: Optional[str] = Field(default=None, description='SSL key file for HTTPS')
class-attribute
instance-attribute
ssl_certfile: Optional[str] = Field(default=None, description='SSL certificate file for HTTPS')
class-attribute
instance-attribute
api_key: Optional[str] = Field(default=None, description='API key for authentication. If set all requests need to be authenticated.')
class-attribute
instance-attribute
interrupt_requests: bool = Field(default=True, description='Whether to interrupt requests when a new request is received.')
class-attribute
instance-attribute
disable_ping_events: bool = Field(default=False, description='Disable EventSource pings (may be needed for some clients).')
class-attribute
instance-attribute
root_path: str = Field(default='', description='The root path for the server. Useful when running behind a reverse proxy.')
class-attribute
instance-attribute
199
class ModelSettings(BaseSettings):
"""Model settings used to load a Llama model."""
model: str = Field(
description="The path to the model to use for generating completions."
model_alias: Optional[str] = Field(
default=None,
description="The alias of the model to use for generating completions.",
# Model Params
n_gpu_layers: int = Field(
default=0,
ge=-1,
description="The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.",
split_mode: int = Field(
default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,
description="The split mode to use.",
main_gpu: int = Field(
default=0,
ge=0,
description="Main GPU to use.",
tensor_split: Optional[List[float]] = Field(
default=None,
description="Split layers across multiple GPUs in proportion.",
vocab_only: bool = Field(
default=False, description="Whether to only return the vocabulary."
use_mmap: bool = Field(
default=llama_cpp.llama_supports_mmap(),
description="Use mmap.",
use_mlock: bool = Field(
default=llama_cpp.llama_supports_mlock(),
description="Use mlock.",
kv_overrides: Optional[List[str]] = Field(
default=None,
description="List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.",
rpc_servers: Optional[str] = Field(
default=None,
description="comma seperated list of rpc servers for offloading",
# Context Params
seed: int = Field(
default=llama_cpp.LLAMA_DEFAULT_SEED, description="Random seed. -1 for random."
n_ctx: int = Field(default=2048, ge=0, description="The context size.")
n_batch: int = Field(
default=512, ge=1, description="The batch size to use per eval."
n_ubatch: int = Field(
default=512, ge=1, description="The physical batch size used by llama.cpp"
n_threads: int = Field(
default=max(multiprocessing.cpu_count() // 2, 1),
ge=1,
description="The number of threads to use. Use -1 for max cpu threads",
n_threads_batch: int = Field(
default=max(multiprocessing.cpu_count(), 1),
ge=0,
description="The number of threads to use when batch processing. Use -1 for max cpu threads",
rope_scaling_type: int = Field(
default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED
rope_freq_base: float = Field(default=0.0, description="RoPE base frequency")
rope_freq_scale: float = Field(
default=0.0, description="RoPE frequency scaling factor"
yarn_ext_factor: float = Field(default=-1.0)
yarn_attn_factor: float = Field(default=1.0)
yarn_beta_fast: float = Field(default=32.0)
yarn_beta_slow: float = Field(default=1.0)
yarn_orig_ctx: int = Field(default=0)
mul_mat_q: bool = Field(
default=True, description="if true, use experimental mul_mat_q kernels"
logits_all: bool = Field(default=True, description="Whether to return logits.")
embedding: bool = Field(default=False, description="Whether to use embeddings.")
offload_kqv: bool = Field(
default=True, description="Whether to offload kqv to the GPU."
flash_attn: bool = Field(
default=False, description="Whether to use flash attention."
# Sampling Params
last_n_tokens_size: int = Field(
default=64,
ge=0,
description="Last n tokens to keep for repeat penalty calculation.",
# LoRA Params
lora_base: Optional[str] = Field(
default=None,
description="Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.",
lora_path: Optional[str] = Field(
default=None,
description="Path to a LoRA file to apply to the model.",
# Backend Params
numa: Union[bool, int] = Field(
default=False,
description="Enable NUMA support.",
# Chat Format Params
chat_format: Optional[str] = Field(
default=None,
description="Chat format to use.",
clip_model_path: Optional[str] = Field(
default=None,
description="Path to a CLIP model to use for multi-modal chat completion.",
# Cache Params
cache: bool = Field(
default=False,
description="Use a cache to reduce processing times for evaluated prompts.",
cache_type: Literal["ram", "disk"] = Field(
default="ram",
description="The type of cache to use. Only used if cache is True.",
cache_size: int = Field(
default=2 << 30,
description="The size of the cache in bytes. Only used if cache is True.",
# Tokenizer Options
hf_tokenizer_config_path: Optional[str] = Field(
default=None,
description="The path to a HuggingFace tokenizer_config.json file.",
hf_pretrained_model_name_or_path: Optional[str] = Field(
default=None,
description="The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().",
# Loading from HuggingFace Model Hub
hf_model_repo_id: Optional[str] = Field(
default=None,
description="The model repo id to use for the HuggingFace tokenizer model.",
# Speculative Decoding
draft_model: Optional[str] = Field(
default=None,
description="Method to use for speculative decoding. One of (prompt-lookup-decoding).",
draft_model_num_pred_tokens: int = Field(
default=10,
description="Number of tokens to predict using the draft model.",
# KV Cache Quantization
type_k: Optional[int] = Field(
default=None,
description="Type of the key cache quantization.",
type_v: Optional[int] = Field(
default=None,
description="Type of the value cache quantization.",
# Misc
verbose: bool = Field(
default=True, description="Whether to print debug information."
@model_validator(
mode="before"
) # pre=True to ensure this runs before any other validation
def set_dynamic_defaults(self) -> Self:
# If n_threads or n_threads_batch is -1, set it to multiprocessing.cpu_count()
cpu_count = multiprocessing.cpu_count()
values = cast(Dict[str, int], self)
if values.get("n_threads", 0) == -1:
values["n_threads"] = cpu_count
if values.get("n_threads_batch", 0) == -1:
values["n_threads_batch"] = cpu_count
return self
model: str = Field(description='The path to the model to use for generating completions.')
class-attribute
instance-attribute
model_alias: Optional[str] = Field(default=None, description='The alias of the model to use for generating completions.')
class-attribute
instance-attribute
n_gpu_layers: int = Field(default=0, ge=-1, description='The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.')
class-attribute
instance-attribute
split_mode: int = Field(default=llama_cpp.LLAMA_SPLIT_MODE_LAYER, description='The split mode to use.')
class-attribute
instance-attribute
tensor_split: Optional[List[float]] = Field(default=None, description='Split layers across multiple GPUs in proportion.')
class-attribute
instance-attribute
vocab_only: bool = Field(default=False, description='Whether to only return the vocabulary.')
class-attribute
instance-attribute
use_mmap: bool = Field(default=llama_cpp.llama_supports_mmap(), description='Use mmap.')
class-attribute
instance-attribute
use_mlock: bool = Field(default=llama_cpp.llama_supports_mlock(), description='Use mlock.')
class-attribute
instance-attribute
kv_overrides: Optional[List[str]] = Field(default=None, description='List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.')
class-attribute
instance-attribute
rpc_servers: Optional[str] = Field(default=None, description='comma seperated list of rpc servers for offloading')
class-attribute
instance-attribute
seed: int = Field(default=llama_cpp.LLAMA_DEFAULT_SEED, description='Random seed. -1 for random.')
class-attribute
instance-attribute
n_batch: int = Field(default=512, ge=1, description='The batch size to use per eval.')
class-attribute
instance-attribute
n_ubatch: int = Field(default=512, ge=1, description='The physical batch size used by llama.cpp')
class-attribute
instance-attribute
n_threads: int = Field(default=max(multiprocessing.cpu_count() // 2, 1), ge=1, description='The number of threads to use. Use -1 for max cpu threads')
class-attribute
instance-attribute
n_threads_batch: int = Field(default=max(multiprocessing.cpu_count(), 1), ge=0, description='The number of threads to use when batch processing. Use -1 for max cpu threads')
class-attribute
instance-attribute
rope_scaling_type: int = Field(default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED)
class-attribute
instance-attribute
rope_freq_base: float = Field(default=0.0, description='RoPE base frequency')
class-attribute
instance-attribute
rope_freq_scale: float = Field(default=0.0, description='RoPE frequency scaling factor')
class-attribute
instance-attribute
mul_mat_q: bool = Field(default=True, description='if true, use experimental mul_mat_q kernels')
class-attribute
instance-attribute
logits_all: bool = Field(default=True, description='Whether to return logits.')
class-attribute
instance-attribute
embedding: bool = Field(default=False, description='Whether to use embeddings.')
class-attribute
instance-attribute
offload_kqv: bool = Field(default=True, description='Whether to offload kqv to the GPU.')
class-attribute
instance-attribute
flash_attn: bool = Field(default=False, description='Whether to use flash attention.')
class-attribute
instance-attribute
last_n_tokens_size: int = Field(default=64, ge=0, description='Last n tokens to keep for repeat penalty calculation.')
class-attribute
instance-attribute
lora_base: Optional[str] = Field(default=None, description='Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.')
class-attribute
instance-attribute
lora_path: Optional[str] = Field(default=None, description='Path to a LoRA file to apply to the model.')
class-attribute
instance-attribute
numa: Union[bool, int] = Field(default=False, description='Enable NUMA support.')
class-attribute
instance-attribute
chat_format: Optional[str] = Field(default=None, description='Chat format to use.')
class-attribute
instance-attribute
clip_model_path: Optional[str] = Field(default=None, description='Path to a CLIP model to use for multi-modal chat completion.')
class-attribute
instance-attribute
cache: bool = Field(default=False, description='Use a cache to reduce processing times for evaluated prompts.')
class-attribute
instance-attribute
cache_type: Literal['ram', 'disk'] = Field(default='ram', description='The type of cache to use. Only used if cache is True.')
class-attribute
instance-attribute
cache_size: int = Field(default=2 << 30, description='The size of the cache in bytes. Only used if cache is True.')
class-attribute
instance-attribute
hf_tokenizer_config_path: Optional[str] = Field(default=None, description='The path to a HuggingFace tokenizer_config.json file.')
class-attribute
instance-attribute
hf_pretrained_model_name_or_path: Optional[str] = Field(default=None, description='The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().')
class-attribute
instance-attribute
hf_model_repo_id: Optional[str] = Field(default=None, description='The model repo id to use for the HuggingFace tokenizer model.')
class-attribute
instance-attribute
draft_model: Optional[str] = Field(default=None, description='Method to use for speculative decoding. One of (prompt-lookup-decoding).')
class-attribute
instance-attribute
draft_model_num_pred_tokens: int = Field(default=10, description='Number of tokens to predict using the draft model.')
class-attribute
instance-attribute
type_k: Optional[int] = Field(default=None, description='Type of the key cache quantization.')
class-attribute
instance-attribute
type_v: Optional[int] = Field(default=None, description='Type of the value cache quantization.')
class-attribute
instance-attribute
verbose: bool = Field(default=True, description='Whether to print debug information.')
class-attribute
instance-attribute
199