Deployment Guide¶

This guide focuses on one goal: get a model running on Kubernetes through the Helm chart, verify it, and update it safely.

What This Guide Covers¶

Deploying a model after it has been exported to ONNX, uploaded to MLflow, and wired into a Python entrypoint.
Deploying a Serve app from helm/rayservice/applications/.
Monitoring rollout health.
Running a quick endpoint smoke test.
Updating code/config/model artifact without breaking traffic.
Recovering from failed changes.

What Is Covered Elsewhere¶

To avoid duplication, deeper topics are documented in dedicated guides:

Exporting to ONNX, uploading to MLflow, and Python model class structure: Adding New Models
All YAML knobs and autoscaling fields: Configuration Reference
Incident diagnosis and runtime failure patterns: Troubleshooting

Prerequisites¶

Before deployment, confirm:

KubeRay operator is installed and healthy.
You can deploy into your target namespace (for example rationai-jobs-ns).
Your model entrypoint is importable as module.path:app.
Cluster can access remote dependencies (for example MLflow, object storage, or GitHub working_dir).

Step 1: Prepare the Python Entry Point¶

If you still need to export the model or upload it to MLflow, do that first in Adding New Models. Once the artifact exists, create the model class in models/ and wire __init__, reconfigure, predict, and root as described there.

Step 2: Create or Update App Definition¶

Create a file in helm/rayservice/applications/ (for example my-model.yaml) with at least:

- name: my-model
  import_path: models.my_model:app
  route_prefix: /my-model
  runtime_env:
    working_dir: https://github.com/RationAI/model-service/archive/refs/heads/feature/my-model.zip
  deployments:
    - name: MyModel
      autoscaling_config:
        min_replicas: 0
        max_replicas: 4

Notes:

Use a dedicated branch in working_dir during development.
If code changed but URL did not, append a cache-busting suffix (for example ?v=2).
Keep advanced tuning in YAML, but use Configuration Reference as the source of truth for field meanings.

Step 3: Deploy With Helm¶

helm upgrade --install <release-name> helm/rayservice -n rationai-jobs-ns

Use a dedicated release name while testing (for example rayservice-model-my-model).

Step 4: Watch Rollout¶

kubectl get rayservice <release-name> -n rationai-jobs-ns -w
kubectl get pods -n rationai-jobs-ns -l ray.io/cluster=<release-name>

If rollout stalls, inspect head logs:

kubectl logs -n rationai-jobs-ns -l ray.io/node-type=head --tail=200

Step 5: Smoke Test Endpoint¶

Port-forward Serve service:

kubectl port-forward -n rationai-jobs-ns svc/<release-name>-serve-svc 8000:8000

Send one compressed request:

import lz4.frame
import numpy as np
import requests

tile = np.random.randint(0, 255, (512, 512, 3), dtype=np.uint8)
payload = lz4.frame.compress(tile.tobytes())

resp = requests.post(
    "http://localhost:8000/my-model/",
    data=payload,
    headers={"Content-Type": "application/octet-stream"},
)
resp.raise_for_status()
print(resp.text)

Additionally, you can use the SDK client to send requests with the same payload format as your application expects.

Step 6: Update Safely¶

Update code¶

Push code changes to the branch used in runtime_env.working_dir.
Redeploy:

helm upgrade --install <release-name> helm/rayservice -n rationai-jobs-ns

If old code is still used, bump working_dir cache key (?v=<n>) and deploy again.

Update config¶

Edit application YAML (for example autoscaling, queue limits, user_config).
Redeploy with the same Helm command.

Update model weights (MLflow)¶

Point artifact_uri in user_config to the new artifact version and redeploy.

Rollback and Recovery¶

RayService is a CRD, so typical Kubernetes rollout commands are not the main recovery path. Use state inspection and re-apply a known-good spec.

kubectl get rayservice <release-name> -n rationai-jobs-ns -o yaml
kubectl describe rayservice <release-name> -n rationai-jobs-ns
kubectl logs -n rationai-jobs-ns -l ray.io/node-type=head --tail=200

Common Pitfalls¶

Pod resources vs replica resources¶

A deployment requesting num_cpus: 4 cannot schedule onto workers that expose fewer allocatable CPUs. If replicas remain pending, verify worker pod requests/limits and Ray actor reservations match.

Runtime environment cache¶

Ray caches runtime_env.working_dir by URL string. If URL stays identical, code refresh may not happen. Use a version suffix in URL to force refresh.

Multi-Model Deployment¶

You can deploy multiple models by adding multiple files in helm/rayservice/applications/. Helm renders them into one serveConfigV2.

Keep routes unique and validate each endpoint separately after deploy.