Model Service Documentation¶
Overview¶
Model Service is a Helm based deployment framework for serving ML models on Kubernetes with Ray Serve.
Problem solved: turn Python model code into stable HTTP endpoints that can batch requests, scale replicas, and run on CPU or GPU workers.
Use this documentation when:
- you want to deploy a model as an API endpoint,
- you need to tune throughput and latency,
- or you operate multi-model workloads on one Ray cluster.
What You Get¶
- Ray Serve applications managed by a Helm chart in
helm/rayservice/. - Per-model configuration in
helm/rayservice/applications/. - Default worker profiles in
helm/rayservice/workers/. - Operational guides for deployment, scaling, and troubleshooting.
Two Ways to Use the Service¶
Depending on your goal, you will either run a pre-built model or deploy a custom model:
-
Run an existing model If the model you need is already in Available Models, you just install the Helm chart. Action: Go to Quick Start.
-
Deploy a new model If you're bringing a new ML architecture, follow the Deployment Guide which will guide you through the entire flow: Python implementation, Helm config, and deployment.
Start Here¶
I want to run an existing model¶
If the model is already implemented and you just need to spin it up on the cluster.
- Quick Start
- Deployment Guide (steps 2+: Helm config and deploy)
- Troubleshooting
I want to deploy a new model¶
If you are bringing a new ML architecture and need to write a custom Ray Serve application.
Start with the Deployment Guide:
- Step 1: Prepare Python Entrypoint → this will point you to Adding New Models
- Step 2–6: Return to Deployment Guide for Helm config and deployment
- Optimization Guide (optional, for tuning performance)
- Troubleshooting (if anything goes wrong)
Documentation Map¶
Getting Started¶
- Quick Start: first deployment using a dedicated test release name.
Guides¶
- Adding New Models: implement model code and bind routes.
- Deployment Guide: safe deployment workflow and production checklist.
- Configuration Reference: source of truth for Helm and Ray Serve settings.
- Optimization Guide: TensorRT, precision, batch tuning, and session options.
- Troubleshooting: diagnostics for deployment and runtime failures.
Models¶
- Available Models: list of pre-configured models with endpoints and SDK patterns.
Architecture¶
- Overview: system components and scaling model.
- Request Lifecycle: end-to-end request flow.
- Queues and Backpressure: overload behavior and queue tuning.
- Batching: how request coalescing works per replica.
Important Notes¶
- Use a dedicated test release name (for example
rayservice-model-<test>) while experimenting. - Keep default worker profiles unless you need specific hardware or scheduling behavior.
- Tune application/deployment settings first, worker templates second.
Glossary¶
- RayService: KubeRay custom resource managing a Ray cluster plus Serve applications.
- Deployment (Ray Serve): scalable unit running one part of model code.
- Replica: one running instance of a deployment.
- Worker group: pool of Ray worker pods with its own resource profile.