Skip to content

[RFC] TensorRT Model Optimizer - Product Roadmap #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
omrialmog opened this issue Mar 6, 2025 · 0 comments
Open

[RFC] TensorRT Model Optimizer - Product Roadmap #146

omrialmog opened this issue Mar 6, 2025 · 0 comments
Assignees
Labels

Comments

@omrialmog
Copy link
Collaborator

omrialmog commented Mar 6, 2025

TensorRT Model Optimizer - Product Roadmap

TensorRT Model Optimizer's goal is to provide a unified library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.

In striving for this, our roadmap and development follow these product strategies:

  1. Provide a one-stop-shop for SOTA optimization methods (quantization, distillation, sparsity, pruning, speculation, etc) with easy-to-use APIs for developers to chain different methods with reproducibility.
  2. Provide transparency and extensibility, making it easy for developers and researchers to innovate and contribute.
  3. Provide the best easy-to-use recipes in the ecosystem through software-hardware co-design on NVIDIA platforms. Since Model-Optimizer's launch, we’ve been delivering 50% to ~5x speedup on top of existing runtime and compiler optimizations on NVIDIA GPUs with minimal impact on model accuracy (Latest News).
  4. Tightly integrate into the Deep Learning inference and training ecosystem, beyond NVIDIA’s in-house stacks. Offer many-to-many optimizations by supporting popular frameworks like vLLM, SGLang, TensorRT-LLM, TensorRT, NeMo, and Hugging Face.

In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features.

Community contributions are highly encouraged. If you're interested in contributing to specific features, we welcome any questions and feedback in this thread and feature requests in github Issues 😊.

Roadmap:

We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.

Image

High level goals:

Quantization

  • Optimized FP4 Post Training Quantization (PTQ).
  • Expand data-type availability.
  • Expand advanced quantization techniques.
  • Expand hosted pre-optimized checkpoints on HuggingFace.

Training for Inference

  • Optimized FP4 Quantization Aware Training (QAT).
  • Improved Token-efficient pruning/distillation techniques.
  • Improved Speculative decoding module training techniques.
  • Simplified flows for Speculative Decoding Module Training/Distillation/QAT.

ONNX/TRT

Platform Support & Ecosystem

  • Open sourced for all developers with improved extensibility, transparency, debuggability and accessibility.
  • Ready-to-deploy optimized checkpoints for both ease of use and resource limited developers.
  • Expanded support for TRT-LLM, vLLM and SGLang.
  • In-framework deployment for quick prototyping.
  • Continuous support for new and upcoming models.

Expanded Details:

1. FP4 inference on NVIDIA Blackwell

NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:

  1. For the majority of developers, Model-Optimizer offers Post Training Quantization (PTQ) (weight and activation, weight-only) and our proprietary AutoQuantize for FP4 inference. AutoQuantize automates per-layer quantization formats to achieve minimal model accuracy loss.
  2. For developers who require lossless FP4 quantization, Model-Optimizer offers Quantization Aware Training (QAT), which makes the neural network more resilient to quantization. Model-Optimizer QAT already works with NVIDIA Megatron, NVIDIA NeMo, native PyTorch training, and Hugging Face Trainer.

2. Model optimization techniques

2.1 Model compression algorithms

Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:

  • Advanced PTQ methods (e.g., SVDQuant, QuaRot, SpinQuant)
  • QAT with distillation, a proven path for FP4 inference
  • Attention sparsity (e.g., SnapKV, DuoAttention)
  • AutoQuantize improvements (e.g., support more fine-trained format selection and various weight and activation combination)
  • New token-efficient pruning and distillation methods
  • Infrastructure to support general rotation and smoothing

2.2 Optimized techniques for LLM and VLM

Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:

  • (Speculation) Integrated draft model: Medusa, Redrafter, MTP and EAGLE.
  • (Speculation/Distillation) Standalone draft model training through pruning and knowledge distillation.
  • (Distillation) Standalone model shrinking/compressing through pruning and knowledge distillation (e.g. Llama-3.2 1/3B).
  • (Quantization) Quantization aware training with support of FP8 and FP4.
  • Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.
  • Hosting pre-optimized checkpoints for popular models such as DeepSeek-R1, Llama-3.1, Llama-3.3 and Nemotron family on Hugging Face Model-Optimizer collection.

2.3 Optimized techniques for diffusers

Model-Optimizer will continue to accelerate image generation inference by investing in these areas:

  • Quantization: Expand model support for INT8/FP8/FP4 PTQ and QAT. e.g., FLUX model series.
  • Caching: Adding more training-free and lightweight finetuning-based caching techniques with user-friendly APIs. (Previous work: Cache Diffusion).
  • Improve easy of use of the deployment pipelines, including adding multi-GPU support.

3. Developer Productivity

3.1 Open-sourcing

To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.

3.2 Ready-to-deploy optimized checkpoints

For developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.

4. Choice of Deployment

4.1 Popular Community Frameworks

To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.

4.2 In-Framework Deployment

We have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:

  1. When optimizing inference performance or exploring new model compression techniques, Model-Optimizer users can quickly prototype in the PyTorch runtime and native PyTorch APIs to evaluate performance gains. Once satisfied, they can transition to the TensorRT-LLM runtime as the final step to maximize performance.
  2. For models not yet supported by TensorRT-LLM or applications that do not need ultra-fast inference speeds, users can get out-of-the-box performance improvements within native PyTorch.

Developers can utilize AutoDeploy or Real Quantization for these in-framework deployments.

5. Expand Support Matrix

5.1 Data types

Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.

5.2 Model Support

We strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.

5.3 Platform & Other Support

Model-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.

In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant