NRP-Managed LLMs

The NRP provides several hosted open-weights LLM for either API access, or use with our hosted chat interfaces.

Chat with NRP LLMs Use the LibreChat interface to chat with NRP hosted LLMs

Get an API token for NRP LLMs Get an API token to programically interact with the LLMs or use the LLMs in other apps

Chat Interfaces

Librechat

If you are looking to chat with an LLM model similar to the interface provided by ChatGPT, we provide LibreChat, based on the LibreChat project. This is a simple chat interface for all of the NRP hosted models. You can use it to chat with the models, or to test out the models.

Visit the LibreChat interface

On MacOS and Safari you can make it always available in Dock for quick access: having librechat open in Safari, click File->Add to Dock.

Chatbox

You can install the standalone chatbox app or use the web interface version.

Visit the Chatbox app web site

Generate the config for it in the LLM token generation page. Copy the generated config to clipboard - it will already have your personal token. Please always leave Max Output Tokens empty and only fill in Context Window.

In chatbox app go to Settings->Model Provider, scroll down to the end of providers list and click Import from clipboard.

API Access to LLMs via Envoy gateway

To access our LLMs through the Envoy AI Gateway, you need to be a member of a group with LLM flag. Your membership info can be found on the namespaces page.

Start from creating a token. You can use this token to query the OpenAI-compatible LLM endpoint:

Envoy AI API endpoint

https://ellm.nrp-nautilus.io/v1

with CURL or any OpenAI API compatible tool.

curl -H "Authorization: Bearer <your_token>" https://ellm.nrp-nautilus.io/v1/models

Please always leave empty or do not specify Max Output Tokens/max_tokens/max_output_tokens.

Examples

Python Code

To access the NRP LLMs, you can use the OpenAI Python client. Below is an example of how to use the OpenAI Python client to access the NRP LLMs.

import os
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get("OPENAI_API_KEY"),
    base_url = "https://ellm.nrp-nautilus.io/v1"
)

completion = client.chat.completions.create(
    model="gemma3",
    messages=[
        {"role": "system", "content": "Talk like a pirate."},
        {
            "role": "user",
            "content": "How do I check if a Python object is an instance of a class?",
        },
    ],
)

print(completion.choices[0].message.content)

Bash+Curl

curl -H "Authorization: Bearer <TOKEN>" https://ellm.nrp-nautilus.io/v1/models

curl -H "Authorization: Bearer <TOKEN>" -X POST "https://ellm.nrp-nautilus.io/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
    "messages": [
      {"role": "user", "content": "Hey!"}
    ]
  }'

Available Models

main - Model is generally supported. You can report issues with the service. However, if the model is outdated with no apparent usage purpose, it may be removed if there are no major group or user usage, or switched to a deprecated state. This is to provide our users with the best models within our limited allocation of GPUs.

dep - LLM is deprecated and is likely to go away soon. Please do not start using this model; this is only for existing user groups who have specific purposes of this model.

eval - The LLM is added for testing and we’re evaluating it’s capabilities. The model may be unavailable sometimes, and configurations may be changed without notification.

You can follow all updates and participate in the discussions within our Matrix Nautilus Machine Learning/Data Science (NRP Matrix.to, Matrix.to) channel. Suggestions and decisions for new models are also made here.

API Title	Model	Features
qwen3 main	Qwen/Qwen3-VL-235B-A22B-Thinking-FP8	Multimodal (vision, video), 262,144 tokens, 235B parameters, FP8 quantization, tool calling, Claude/Gemini-level frontier multimodal performance
gpt-oss eval	openai/gpt-oss-120b	131,072 tokens, tool calling, Official MXFP4 quantization, frontier agentic performance
glm-4.6 eval	QuantTrio/GLM-4.6-GPTQ-Int4-Int8Mix	204,800 tokens, tool calling, GPTQ quantization, frontier agentic coding performance
minimax-m2 eval	MiniMaxAI/MiniMax-M2	262,144 tokens, tool calling, Official FP8 quantization, frontier agentic coding performance
glm-v main	cpatonn/GLM-4.5V-AWQ-8bit	Multimodal (vision, video), 65,536 tokens, tool calling, AWQ 8-bit quantization, GPT-4o level multimodal performance
gemma3 main	google/gemma-3-27b-it	Multimodal (vision), 131,072 tokens, tool calling
embed-mistral main	intfloat/e5-mistral-7b-instruct	embeddings
gorilla eval	gorilla-llm/gorilla-openfunctions-v2	function calling
test-gaudi3 eval	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	70B parameters, 128K tokens, running on Intel Gaudi3
olmo eval	allenai/OLMo-2-0325-32B-Instruct	open source
watt eval	watt-ai/watt-tool-8B	function calling
llama3-sdsc dep	meta-llama/Llama-3.3-70B-Instruct	8 languages, 131,072 tokens, tool use

How Models are Added and Removed

Added: New NRP-managed models are added by feedback and also assessments of various benchmarks and community response to the models by the administrator. We try to take into account quantitative benchmarks (such as https://artificialanalysis.ai), but the ultimate decision is based on other qualitative evidence (such as https://www.reddit.com/r/LocalLLaMA/) and discussions between administrators and users.

Removed: Moreover, we would also like to remove certain models that are deemed sufficiently obsolete, where for instance, smaller models may perform all-round better, or another model completely cleared the usage case in a better way compared to the obsolete model.

Deprecated: An exception is when research groups need these models for model reproducibility in various types of research. In that case, we deprecate these models first, and keep them up until such research concludes. If a model has been deprecated or pulled down, please reach out through the below Nautilus Machine Learning/Data Science channel for us to track.

However, we still want to remove deprecated models as soon as possible, due to having limited GPU allocations specialized for deployed LLMs, and these GPUs should be diverted to more recent and better-performing models for the benefit of the whole NRP community. Administrators and researchers use these models for AI-assisted code development, and such models need to be rotated frequently as new and better models are released, vastly increasing individual productivity as newer models are incorporated.

Larger models that require a lot of GPUs are likely to be removed earlier in a more strict way if relative performance falls behind, but smaller, more efficient models or quantized models that do not require as many GPUs may be slightly more lenient in this criterion.

Such discussion is done in the Nautilus Machine Learning/Data Science (NRP Matrix.to, Matrix.to) channel.

Changelogs

Click to expand

Week of November 9th, 2025:

Added/Changed:

qwen3 (Qwen/Qwen3-235B-A22B-Thinking-2507-FP8) has been changed to Qwen/Qwen3-VL-235B-A22B-Thinking-FP8. Very similar characteristics such as number of parameters, context size, and benchmarks, but adds state-of-the-art vision and video multimodal capabilities.
glm-4.6 (QuantTrio/GLM-4.6-GPTQ-Int4-Int8Mix) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or GPT-5 models.
minimax-m2 (MiniMaxAI/MiniMax-M2) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or GPT-5 models, while being able to fit the official FP8 parameters in four A100 GPUs with ample context length.
gpt-oss (openai/gpt-oss-120b) is a very capable agentic model, adequate for general-purpose usage, while requiring only one A100 GPU or two RTX A6000 GPUs for full context due to sliding window attention and official MXFP4 quantization, which is a fraction of other frontier models. This is our candidate for an “LTS” model used for reproducible research, that supersedes the deprecated or removed Llama3 models.
gemma3 was changed to 2x RTX A6000 GPUs instead of 2x A100 GPUs to conserve the latter. The model’s special sliding window attention method allows full context to fit in this case.

Removed:

llama3 (meta-llama/Llama-3.2-90B-Vision-Instruct) has been officially pulled down, due to consuming 4 A100 GPUs that can be used for much more frontier models, such as MiniMax-M2 or GLM-4.6, while being much worse in performance than models that fit in one GPU.
deepseek-r1 (QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium) has been officially pulled down, due to consuming 8 GPUs but being very slow (5-6 tokens/s) in any larger context size. There are many similar models that work well, although not necessarily better in every way, and are faster. This is an example of the larger models that require a lot of GPUs are likely to be removed earlier phrase above.

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.