Envoy AI API endpoint
NRP-Managed LLMs
The NRP provides several hosted open-weights LLM for either API access, or use with our hosted chat interfaces.
Chat Interfaces
Librechat
If you are looking to chat with an LLM model similar to the interface provided by ChatGPT, we provide LibreChat, based on the LibreChat project. This is a simple chat interface for all of the NRP hosted models. You can use it to chat with the models, or to test out the models.
Visit the LibreChat interface
On MacOS and Safari you can make it always available in Dock for quick access: having librechat open in Safari, click File->Add to Dock.
Chatbox
You can install the standalone chatbox app or use the web interface version.
Visit the Chatbox app web site
Generate the config for it in the LLM token generation page. Copy the generated config to clipboard - it will already have your personal token. Please always leave Max Output Tokens empty and only fill in Context Window.
In chatbox app go to Settings->Model Provider, scroll down to the end of providers list and click Import from clipboard.
API Access to LLMs via Envoy gateway
To access our LLMs through the Envoy AI Gateway, you need to be a member of a group with LLM flag. Your membership info can be found on the namespaces page.
Start from creating a token. You can use this token to query the OpenAI-compatible LLM endpoint:
with CURL or any OpenAI API compatible tool.
curl -H "Authorization: Bearer <your_token>" https://ellm.nrp-nautilus.io/v1/modelsPlease always leave empty or do not specify Max Output Tokens/max_tokens/max_output_tokens.
Examples
Python Code
To access the NRP LLMs, you can use the OpenAI Python client. Below is an example of how to use the OpenAI Python client to access the NRP LLMs.
import osfrom openai import OpenAI
client = OpenAI( # This is the default and can be omitted api_key = os.environ.get("OPENAI_API_KEY"), base_url = "https://ellm.nrp-nautilus.io/v1")
completion = client.chat.completions.create( model="gemma3", messages=[ {"role": "system", "content": "Talk like a pirate."}, { "role": "user", "content": "How do I check if a Python object is an instance of a class?", }, ],)
print(completion.choices[0].message.content)Bash+Curl
curl -H "Authorization: Bearer <TOKEN>" https://ellm.nrp-nautilus.io/v1/modelscurl -H "Authorization: Bearer <TOKEN>" -X POST "https://ellm.nrp-nautilus.io/v1/chat/completions" \-H "Content-Type: application/json" \-d '{ "model": "meta-llama/Llama-3.2-90B-Vision-Instruct", "messages": [ {"role": "user", "content": "Hey!"} ] }'Available Models
main - Model is generally supported. You can report issues with the service. However, if the model is outdated with no apparent usage purpose, it may be removed if there are no major group or user usage, or switched to a deprecated state. This is to provide our users with the best models within our limited allocation of GPUs.
dep - LLM is deprecated and is likely to go away soon. Please do not start using this model; this is only for existing user groups who have specific purposes of this model.
eval - The LLM is added for testing and we’re evaluating it’s capabilities. The model may be unavailable sometimes, and configurations may be changed without notification.
You can follow all updates and participate in the discussions within our Matrix Nautilus Machine Learning/Data Science (NRP Matrix.to, Matrix.to) channel. Suggestions and decisions for new models are also made here.
| API Title | Model | Features |
|---|---|---|
| qwen3 main | Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 | Multimodal (vision, video), 262,144 tokens, 235B parameters, FP8 quantization, tool calling, Claude/Gemini-level frontier multimodal performance |
| gpt-oss eval | openai/gpt-oss-120b | 131,072 tokens, tool calling, Official MXFP4 quantization, frontier agentic performance |
| glm-4.6 eval | QuantTrio/GLM-4.6-GPTQ-Int4-Int8Mix | 204,800 tokens, tool calling, GPTQ quantization, frontier agentic coding performance |
| minimax-m2 eval | MiniMaxAI/MiniMax-M2 | 262,144 tokens, tool calling, Official FP8 quantization, frontier agentic coding performance |
| glm-v main | cpatonn/GLM-4.5V-AWQ-8bit | Multimodal (vision, video), 65,536 tokens, tool calling, AWQ 8-bit quantization, GPT-4o level multimodal performance |
| gemma3 main | google/gemma-3-27b-it | Multimodal (vision), 131,072 tokens, tool calling |
| embed-mistral main | intfloat/e5-mistral-7b-instruct | embeddings |
| gorilla eval | gorilla-llm/gorilla-openfunctions-v2 | function calling |
| test-gaudi3 eval | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 70B parameters, 128K tokens, running on Intel Gaudi3 |
| olmo eval | allenai/OLMo-2-0325-32B-Instruct | open source |
| watt eval | watt-ai/watt-tool-8B | function calling |
| llama3-sdsc dep | meta-llama/Llama-3.3-70B-Instruct | 8 languages, 131,072 tokens, tool use |
How Models are Added and Removed
Added: New NRP-managed models are added by feedback and also assessments of various benchmarks and community response to the models by the administrator. We try to take into account quantitative benchmarks (such as https://artificialanalysis.ai), but the ultimate decision is based on other qualitative evidence (such as https://www.reddit.com/r/LocalLLaMA/) and discussions between administrators and users.
Removed: Moreover, we would also like to remove certain models that are deemed sufficiently obsolete, where for instance, smaller models may perform all-round better, or another model completely cleared the usage case in a better way compared to the obsolete model.
Deprecated: An exception is when research groups need these models for model reproducibility in various types of research. In that case, we deprecate these models first, and keep them up until such research concludes. If a model has been deprecated or pulled down, please reach out through the below Nautilus Machine Learning/Data Science channel for us to track.
However, we still want to remove deprecated models as soon as possible, due to having limited GPU allocations specialized for deployed LLMs, and these GPUs should be diverted to more recent and better-performing models for the benefit of the whole NRP community. Administrators and researchers use these models for AI-assisted code development, and such models need to be rotated frequently as new and better models are released, vastly increasing individual productivity as newer models are incorporated.
Larger models that require a lot of GPUs are likely to be removed earlier in a more strict way if relative performance falls behind, but smaller, more efficient models or quantized models that do not require as many GPUs may be slightly more lenient in this criterion.
Such discussion is done in the Nautilus Machine Learning/Data Science (NRP Matrix.to, Matrix.to) channel.
Changelogs
Click to expand
Week of November 9th, 2025:
Added/Changed:
qwen3(Qwen/Qwen3-235B-A22B-Thinking-2507-FP8) has been changed to Qwen/Qwen3-VL-235B-A22B-Thinking-FP8. Very similar characteristics such as number of parameters, context size, and benchmarks, but adds state-of-the-art vision and video multimodal capabilities.glm-4.6(QuantTrio/GLM-4.6-GPTQ-Int4-Int8Mix) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or GPT-5 models.minimax-m2(MiniMaxAI/MiniMax-M2) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or GPT-5 models, while being able to fit the official FP8 parameters in four A100 GPUs with ample context length.gpt-oss(openai/gpt-oss-120b) is a very capable agentic model, adequate for general-purpose usage, while requiring only one A100 GPU or two RTX A6000 GPUs for full context due to sliding window attention and official MXFP4 quantization, which is a fraction of other frontier models. This is our candidate for an “LTS” model used for reproducible research, that supersedes the deprecated or removed Llama3 models.gemma3was changed to 2x RTX A6000 GPUs instead of 2x A100 GPUs to conserve the latter. The model’s special sliding window attention method allows full context to fit in this case.
Removed:
llama3(meta-llama/Llama-3.2-90B-Vision-Instruct) has been officially pulled down, due to consuming 4 A100 GPUs that can be used for much more frontier models, such as MiniMax-M2 or GLM-4.6, while being much worse in performance than models that fit in one GPU.deepseek-r1(QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium) has been officially pulled down, due to consuming 8 GPUs but being very slow (5-6 tokens/s) in any larger context size. There are many similar models that work well, although not necessarily better in every way, and are faster. This is an example of thelarger models that require a lot of GPUs are likely to be removed earlierphrase above.
