Troubleshooting¶
This page lists the most common issues when deploying and running models in Model Service (Ray Serve on KubeRay).
Quick Triage Checklist¶
Start here before digging deeper:
Use your RayService name from the Helm install (preferably a dedicated test release, e.g. rayservice-model-<test>) in the commands below.
kubectl get rayservice -n rationai-jobs-ns
kubectl describe rayservice <release-name> -n rationai-jobs-ns
kubectl get pods -n rationai-jobs-ns
Then inspect logs:
kubectl logs -n rationai-jobs-ns -l ray.io/node-type=head --tail=200
kubectl logs -n rationai-jobs-ns -l ray.io/node-type=worker --tail=200
Check Rancher (cluster events)¶
If a server request is timing out or you see resources exhausted, the Rancher cluster UI often contains useful cluster-level events and node/pod status that explain the root cause. Visit the RayService explorer for the rationai-jobs-ns namespace:
Rancher RayService Explorer — rationai-jobs-ns / rayservice-model
Look for events, node capacity, and pod scheduling failures — these are frequently the reason Serve endpoints drop or time out.
Example:
virchow2:
serveDeploymentStatuses:
Virchow2:
message: >-
Deployment 'Virchow2' in application 'virchow2' has 1 replicas
that have taken more than 30s to be scheduled. This may be due to
waiting for the cluster to auto-scale or for a runtime environment
to be installed. Resources required for each replica: {"CPU": 4.0,
"GPU": 1.0, "memory": 8589934592}, total resources available:
{"memory": 4294967296.0}. Use `ray status` for more details.
status: UPSCALING
status: RUNNING
UPSCALING until the resources are available.
RayService Shows DEPLOY_FAILED¶
What it usually means¶
Ray Serve could not start the application or deployment. The root cause is typically visible in the Ray Serve controller logs.
What to do¶
- Describe the RayService for events:
- Open the Ray dashboard (helps with Serve deployment errors):
Visit http://localhost:8265.
- Look for Python import errors / missing dependencies:
ImportError / ModuleNotFoundError¶
Symptoms¶
- Serve deployment fails immediately.
- Logs show
ModuleNotFoundError: No module named ....
Causes¶
- Dependency not installed in the runtime environment.
- Wrong
import_path. working_dirdoes not contain the expected code.
Fix¶
- Ensure
import_pathmatches your file: - Example:
models.binary_classifier:appmeans there ismodels/binary_classifier.pydefiningapp = .... - Add missing dependencies to
runtime_env.pip.
In this repository, dependencies are typically installed per deployment:
deployments:
- name: BinaryClassifier
ray_actor_options:
runtime_env:
pip: ["onnxruntime>=1.23.2", "mlflow<3.0", "lz4>=4.4.5"]
Worker Crashes (OOMKilled)¶
Symptoms¶
- Pods in
kubectl get podsshow statusOOMKilledor high restart counts. kubectl describe pod ...shows "Last State: Terminated (Reason: OOMKilled)".- Ray Dashboard shows unexpected actor deaths.
Causes¶
- The model loaded into memory + the input batch size exceeds the container's memory limit.
- Physical vs Logical Mismatch: Ray was told the actor needs 2GB, so it scheduled it on a node, but the actual Python process used 4GB, causing Kubernetes to kill it.
Fix¶
You must increase both the Ray logical allocation and the Kubernetes physical limit.
- Increase
ray_actor_options.memory(Software limit):
- Increase Kubernetes container limits (Hardware limit):
Ensure the worker configurations in your Helm values (
helm/rayservice/values.yamlor relevant worker definitions) provide more memory than the sum of all actors on that node plus overhead (~30%).
Autoscaling Not Working (Replicas Don’t Change)¶
Serve replicas not scaling¶
Check your deployment has autoscaling configured:
Also note:
- Scale up/down is not instantaneous (delays and smoothing apply).
- If traffic is low, you may stay at
min_replicas.
Worker pods not scaling¶
Worker pod scaling requires cluster autoscaling enabled:
Also ensure workerGroupSpecs[*].minReplicas/maxReplicas allow scaling.
Not Enough CPU / Memory (Pods Pending)¶
Symptoms¶
- Pods stay in
Pending. - Events mention
Insufficient cpuorInsufficient memory.
Fix¶
-
Check Physical vs Logical:
- Physical: Can K8s schedule the pod?
kubectl describe podwill show if nodes are full. - Logical: Can Ray schedule the actor? Check
ray statusor the dashboard. Ray might say "0/X CPUs available" even if the pod exists, because other actors consumed the slots.
- Physical: Can K8s schedule the pod?
-
Adjust Resources:
- Reduce per-replica requirements (
ray_actor_options.num_cpus,memory). - Increase cluster capacity (maxReplicas) or per-worker limits.
- Reduce per-replica requirements (
Inspect pod scheduling events:
MLflow / Artifact Download Problems¶
Symptoms¶
mlflow.artifacts.download_artifactsfails.- Timeouts during replica initialization.
Fix¶
- Ensure
MLFLOW_TRACKING_URIis set and reachable from the cluster. - Ensure the cluster has network access (proxy settings if needed).
- Verify the
artifact_uriexists and permissions are correct.
In your model's Helm YAML definition this is typically configured via env_vars:
Code Updates Not Applying (Working Dir Cache)¶
Symptoms¶
- You updated your model Python code, pushed it to GitHub, and ran
helm upgrade, but Ray keeps deploying the old logic or throws errors that were already fixed.
Cause¶
Ray downloads the source code defined in working_dir: https://github.com/.../main.zip and saves it strictly based on the URL string constraint. If the URL hasn't changed, Ray Serve will NOT re-download the archive, effectively clinging to an old snapshot.
Fix¶
Append a cache buster query parameter directly to your working_dir setup:
runtime_env:
config:
setup_timeout_seconds: 1800
working_dir: https://github.com/RationAI/model-service/archive/refs/heads/main.zip?v=1
Whenever you push subsequent revisions, just manually bump the v=1 to v=2. During the next Helm deployment process, Ray evaluates the URL as new, retrieves the fresh code zip, and deploys successfully.
Large ONNX Model Export (>2GB)¶
Problem¶
When exporting a large PyTorch model (>2GB) to ONNX, the resulting file is split into multiple parts: a .onnx graph file plus external weight files (.bin or .pb). Deploying such multi-part exports can be tricky.
Two Approaches¶
Option 1: Merge into a single ONNX file
Load the model with external data and save it back as a single unified file:
import onnx
model = onnx.load("model-with-external-data.onnx", load_external_data=True)
onnx.save_model(
model,
"merged-unified-model.onnx",
save_as_external_data=False,
size_threshold=0, # Force everything into one file
)
Then upload merged-unified-model.onnx to MLflow. This is the simplest approach if your server has enough RAM during the merge operation.
Option 2: Upload the entire export directory
If merging consumes too much memory, keep the multi-part structure and upload everything:
import mlflow
with mlflow.start_run():
# Upload the whole export folder with both .onnx and .bin files
mlflow.log_artifacts("onnx_export_dir", artifact_path="model")
When Ray Serve initializes and downloads the artifact, it will fetch both the graph and weight files together.
Helpful Commands¶
# list Serve and RayService resources
kubectl get rayservice -n rationai-jobs-ns
kubectl get svc -n rationai-jobs-ns
# see all pods for a RayService
kubectl get pods -n rationai-jobs-ns -l ray.io/cluster=<release-name>