Model Service Documentation¶

Overview¶

Model Service is a Helm based deployment framework for serving ML models on Kubernetes with Ray Serve.

Problem solved: turn Python model code into stable HTTP endpoints that can batch requests, scale replicas, and run on CPU or GPU workers.

Use this documentation when:

Depending on your goal, you will either run a pre-built model or deploy a custom model:

Run an existing model If the model you need is already in Available Models, you just install the Helm chart. Action: Go to Quick Start.
Deploy a new model If you're bringing a new ML architecture, follow the Deployment Guide which will guide you through the entire flow: Python implementation, Helm config, and deployment.

If the model is already implemented and you just need to spin it up on the cluster.

If you are bringing a new ML architecture and need to write a custom Ray Serve application.

Step 1: Prepare Python Entrypoint → this will point you to Adding New Models
Step 2–6: Return to Deployment Guide for Helm config and deployment
Optimization Guide (optional, for tuning performance)
Troubleshooting (if anything goes wrong)

Available Models: list of pre-configured models with endpoints and SDK patterns.

Use a dedicated test release name (for example rayservice-model-<test>) while experimenting.
Keep default worker profiles unless you need specific hardware or scheduling behavior.
Tune application/deployment settings first, worker templates second.

RayService: KubeRay custom resource managing a Ray cluster plus Serve applications.
Deployment (Ray Serve): scalable unit running one part of model code.
Replica: one running instance of a deployment.
Worker group: pool of Ray worker pods with its own resource profile.