Nvidia Models

Explore the Nvidia language and embedding models available through our OpenAI Assistants API-compatible service.

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5

Context Length:: 131,072 tokens
Architecture:: text->text

Pricing:

Prompt: $0.0000001

Completion: $0.0000004

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality.

In internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.

NVIDIA: Nemotron Nano 9B V2 (free)

Context Length:: 128,000 tokens
Architecture:: text->text

Pricing:

NVIDIA-Nemotron-Nano-9B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response.

The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so.

NVIDIA: Nemotron Nano 9B V2

Context Length:: 131,072 tokens
Architecture:: text->text

Pricing:

Prompt: $0.00000004

Completion: $0.00000016

NVIDIA: Llama 3.1 Nemotron Ultra 253B v1

Context Length:: 131,072 tokens
Architecture:: text->text

Pricing:

Prompt: $0.0000006

Completion: $0.0000018

Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) optimized for advanced reasoning, human-interactive chat, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta’s Llama-3.1-405B-Instruct, it has been significantly customized using Neural Architecture Search (NAS), resulting in enhanced efficiency, reduced memory usage, and improved inference latency. The model supports a context length of up to 128K tokens and can operate efficiently on an 8x NVIDIA H100 node.

Note: you must include detailed thinking on in the system prompt to enable reasoning. Please see Usage Recommendations for more.

NVIDIA: Llama 3.1 Nemotron 70B Instruct

Context Length:: 131,072 tokens
Architecture:: text->text
Max Output:: 16,384 tokens

Pricing:

Prompt: $0.0000006

Completion: $0.0000006

NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging Llama 3.1 70B architecture and Reinforcement Learning from Human Feedback (RLHF), it excels in automatic alignment benchmarks. This model is tailored for applications requiring high accuracy in helpfulness and response generation, suitable for diverse user queries across multiple domains.

Usage of this model is subject to Meta's Acceptable Use Policy.

Ready to build with Nvidia?

Start using these powerful models in your applications with our flexible pricing plans.

View Pricing