Optimization Guide¶
This guide collects runtime optimization options for ONNX-based models in Model Service. Use it after your model is already exported to ONNX, uploaded to MLflow, and wired into the Python entrypoint.
When to Use This Guide¶
Use these settings when you want to improve inference latency, throughput, or GPU utilization without changing the model architecture.
Typical cases:
- TensorRT acceleration on NVIDIA GPUs.
- CUDA or CPU execution provider selection.
- Mixed precision such as
float16. - Batch size and queue tuning.
- External weight or model loading behavior during startup.
TensorRT Execution Provider¶
TensorRT is usually the first optimization to try on NVIDIA hardware. It can reduce latency and improve throughput by building optimized inference kernels for the exported ONNX graph.
How It Works¶
- Model Compilation: On first run, TensorRT analyzes your ONNX graph and builds an optimized engine for your specific GPU model and batch size.
- Engine Caching: The built engine can be cached to disk, so subsequent restarts skip the compilation step.
- Inference: TensorRT uses highly tuned kernels for inference, often 2–5x faster than generic ONNX Runtime.
Setup in reconfigure()¶
Add TensorRT options to the session provider list in your model's reconfigure method:
def reconfigure(self, config: Config) -> None:
import onnxruntime as ort
# Create cache directory for TensorRT engines
cache_path = config["trt_cache_path"]
os.makedirs(cache_path, exist_ok=True)
# Define batch profile shapes for TensorRT optimization
min_shape = f"input:1x3x{self.tile_size}x{self.tile_size}"
opt_shape = f"input:{config['max_batch_size']}x3x{self.tile_size}x{self.tile_size}"
max_shape = f"input:{config['max_batch_size']}x3x{self.tile_size}x{self.tile_size}"
# TensorRT options
trt_options = {
"device_id": 0, # GPU device index
"trt_fp16_enable": True, # Use float16 for faster inference
"trt_engine_cache_enable": True, # Cache compiled engines to disk
"trt_engine_cache_path": cache_path, # Where to store cached engines
"trt_timing_cache_enable": True, # Cache kernel timing info for faster rebuilds
"trt_max_workspace_size": 8 * 1024**3, # 8GB workspace for kernel search (default 1GB is too small)
"trt_builder_optimization_level": 1, # 1=fast build, 5=slow but highly optimized
"trt_profile_min_shapes": min_shape, # Minimum input shape for batching
"trt_profile_max_shapes": max_shape, # Maximum input shape for batching
"trt_profile_opt_shapes": opt_shape, # Expected typical shape for optimization
}
# Create ONNX Runtime session with TensorRT
self.session = ort.InferenceSession(
provider_path, # Path to ONNX model from MLflow
providers=[
("TensorrtExecutionProvider", trt_options), # Try TensorRT first
"CUDAExecutionProvider", # Fallback to CUDA
"CPUExecutionProvider", # Final fallback to CPU
],
)
Key Parameters¶
| Parameter | Purpose | Default | When to Adjust |
|---|---|---|---|
trt_fp16_enable |
Use float16 precision | False | Set to True for faster inference on Tensor Cores (if model tolerates it) |
trt_engine_cache_enable |
Save compiled engines | False | Set to True to avoid rebuilding on container restart |
trt_engine_cache_path |
Directory for cached engines | N/A | Point to a persistent volume (e.g., ConfigMap or PVC in Helm) |
trt_max_workspace_size |
GPU memory for kernel search | 1 GB | Increase to 4–8 GB for high-res models to find better kernels |
trt_builder_optimization_level |
Build speed vs quality | 1 | Use 1 for fast development, 3–5 for production |
trt_profile_*_shapes |
Batch size range | required | Must match your max_batch_size from config |
Where to Put the Cache Path in Helm¶
Store the cache path in a shared location that persists across pod restarts.
Option 1: Use an emptyDir volume (data lost on pod restart)
In your application YAML in helm/rayservice/applications/, set the environment variable:
deployments:
- name: BinaryClassifier
ray_actor_options:
runtime_env:
env_vars:
TRT_CACHE_PATH: /data/trt-cache
Then add volume mounting to your worker group in helm/rayservice/values.yaml:
rayClusterConfig:
workerGroupSpecs:
- groupName: gpu-group
rayStartParams:
num-gpus: 1
template:
spec:
containers:
- name: ray-worker
volumeMounts:
- name: trt-cache
mountPath: /data/trt-cache
volumes:
- name: trt-cache
emptyDir: {}
Option 2: Use a PersistentVolumeClaim (data persists across restarts)
For production, use a PVC to ensure TensorRT engines persist:
# In worker group template
volumeMounts:
- name: trt-cache
mountPath: /data/trt-cache
volumes:
- name: trt-cache
persistentVolumeClaim:
claimName: trt-cache-pvc
Create the PVC separately (or reference an existing one):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: trt-cache-pvc
namespace: rationai-jobs-ns
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
Once a TensorRT engine is built and cached, subsequent pod restarts will reuse it, avoiding the 1–5 minute compilation overhead.
First Run Overhead¶
The first deployment or after cache cleanup, TensorRT will take 1–5 minutes to compile the engine, depending on model size and trt_builder_optimization_level. Plan for this during initial tests.
Fallback Order¶
Typical provider order:
providers=[
("TensorrtExecutionProvider", trt_options),
"CUDAExecutionProvider",
"CPUExecutionProvider",
]
Notes:
- Put TensorRT first so ONNX Runtime prefers it when available.
- Keep CUDA and CPU providers as fallbacks.
- If TensorRT is not available (e.g., on CPU-only nodes), ONNX Runtime automatically falls back to CUDA, then CPU.
Precision and Dtype¶
Many models can run faster with reduced precision, especially on GPUs.
Where Precision is Decided¶
Precision is determined in three places, in this order:
- ONNX Export (highest priority): If you export the model as
float16or quantized, the artifact inherits that precision. - Runtime Execution Provider (middle): TensorRT's
trt_fp16_enable: Truetells TensorRT to use float16 kernels for inference. - Input Data (lowest priority): The dtype of the input tensor to
session.run().
Common Options¶
- float32 (default): Maximum compatibility, typically required for accuracy-sensitive models.
- float16: ~2x speedup on Tensor Cores, acceptable for most image models, often imperceptible loss of accuracy.
- int8: Quantized inference, requires calibration, best 3–4x speedup, but model must be trained/exported for it.
- Mixed precision: Export key layers as float16, others as float32. Requires custom export logic.
How to Use¶
For most models, enable float16 in TensorRT:
If the model was exported as float16 in ONNX:
import onnx
model = onnx.load("model_fp16.onnx")
# model graph already uses float16 initializers and ops
If you need float32 inference despite the export, ensure input is cast:
@serve.batch
async def predict(self, images: list[NDArray]) -> list[float]:
batch = np.stack(images, axis=0, dtype=np.float32) # Force float32
outputs = self.session.run([self.output_name], {self.input_name: batch})
return outputs[0].flatten().tolist()
Testing Precision¶
Start with float32 to confirm the model runs correctly, then enable float16 and compare outputs:
- Benchmark inference time with float32.
- Enable
trt_fp16_enable: Trueand re-benchmark. - Measure output differences (should be <1% for most models).
- If differences are acceptable, keep float16 enabled.
Batch and Queue Tuning¶
Ray Serve batching is often the simplest throughput optimization. When many HTTP requests arrive, Ray Serve collects them into one batch and passes them together to the @serve.batch decorated method.
Where to Set It¶
In the model's reconfigure() method, right after creating the ONNX Runtime session:
def reconfigure(self, config: Config) -> None:
# ... load session ...
self.predict.set_max_batch_size(config["max_batch_size"])
self.predict.set_batch_wait_timeout_s(config["batch_wait_timeout_s"])
These values come from the Helm config:
Parameters¶
| Parameter | Meaning | Example | Trade-off |
|---|---|---|---|
max_batch_size |
Maximum requests to batch together before forcing inference | 16 | Larger = higher throughput, higher latency |
batch_wait_timeout_s |
How long to wait for more requests before inferencing a smaller batch | 0.1 | Longer = better batching, higher latency |
Tuning Strategy¶
For low-latency workloads:
- max_batch_size: 2–4
- batch_wait_timeout_s: 0.01 (10ms)
- Trade: Each inference runs on fewer samples, but requests are processed faster.
For high-throughput workloads:
- max_batch_size: 32–64
- batch_wait_timeout_s: 0.2–0.5 (200–500ms)
- Trade: Larger batches maximize GPU utilization and throughput, but single requests may wait longer.
Rule of thumb: - Monitor GPU utilization. If <50% busy, increase batch size or timeout. - If latency is >200ms unacceptable, reduce batch size and timeout.
Memory and Batch Size¶
Memory required ≈ (model size) + (batch size × input size × 2–3 for computations).
Example: A 500MB model + batch 16 of 512×512×3 uint8 images: - Model: 500 MB - Input batch: 16 × 512 × 512 × 3 bytes ≈ 12 MB - Activations and intermediates: ~50–100 MB - Total: ~600 MB
Allocate accordingly in Helm ray_actor_options.memory.
Startup and Artifact Loading¶
If model startup is slow, check whether the artifact is large or split into multiple files.
For large exports, see Troubleshooting: Large ONNX Model Export.
If the model is loaded from MLflow, keep the artifact structure simple and ensure the runtime can access the full set of files during replica initialization.
Session Options¶
ONNX Runtime session options control graph optimization and threading behavior.
Key Options¶
In reconfigure(), configure session options before creating the session:
import onnxruntime as ort
sess_options = ort.SessionOptions()
# Threading
sess_options.intra_op_num_threads = 2 # Threads per operation (e.g., matrix multiply)
sess_options.inter_op_num_threads = 1 # Threads for parallelizing different ops
# Set to 1 when using external batching via @serve.batch
# Graph optimization
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Execution mode
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL # Sequential = less overhead for small batches
self.session = ort.InferenceSession(
provider_path,
session_options=sess_options,
providers=[...],
)
Threading Strategy¶
When using @serve.batch (external batching by Ray):
intra_op_num_threads = 2–4: Allow individual ops to use multiple threads.inter_op_num_threads = 1: Avoid parallelizing different ops (Ray handles parallelism via replicas).
When using in-graph batching or single-request mode:
intra_op_num_threads = cpu_count / 2: Let individual ops use more threads.inter_op_num_threads = 1–2: Limited inter-op parallelism is usually better than full.
Graph Optimization¶
Always use ORT_ENABLE_ALL to enable:
- Constant folding (pre-compute static values).
- Node fusion (combine multiple ops into one kernel).
- Layout optimization (reorder data for cache efficiency).
This has no runtime cost and can significantly speed up inference.