Skip to content

Model Service Documentation

Overview

Model Service is a Helm based deployment framework for serving ML models on Kubernetes with Ray Serve.

Problem solved: turn Python model code into stable HTTP endpoints that can batch requests, scale replicas, and run on CPU or GPU workers.

Use this documentation when:

  • you want to deploy a model as an API endpoint,
  • you need to tune throughput and latency,
  • or you operate multi-model workloads on one Ray cluster.

What You Get

  • Ray Serve applications managed by a Helm chart in helm/rayservice/.
  • Per-model configuration in helm/rayservice/applications/.
  • Default worker profiles in helm/rayservice/workers/.
  • Operational guides for deployment, scaling, and troubleshooting.

Two Ways to Use the Service

Depending on your goal, you will either run a pre-built model or deploy a custom model:

  1. Run an existing model If the model you need is already in Available Models, you just install the Helm chart. Action: Go to Quick Start.

  2. Deploy a new model If you're bringing a new ML architecture, follow the Deployment Guide which will guide you through the entire flow: Python implementation, Helm config, and deployment.

Start Here

I want to run an existing model

If the model is already implemented and you just need to spin it up on the cluster.

  1. Quick Start
  2. Deployment Guide (steps 2+: Helm config and deploy)
  3. Troubleshooting

I want to deploy a new model

If you are bringing a new ML architecture and need to write a custom Ray Serve application.

Start with the Deployment Guide:

  1. Step 1: Prepare Python Entrypoint → this will point you to Adding New Models
  2. Step 2–6: Return to Deployment Guide for Helm config and deployment
  3. Optimization Guide (optional, for tuning performance)
  4. Troubleshooting (if anything goes wrong)

Documentation Map

Getting Started

  • Quick Start: first deployment using a dedicated test release name.

Guides

Models

  • Available Models: list of pre-configured models with endpoints and SDK patterns.

Architecture

Important Notes

  • Use a dedicated test release name (for example rayservice-model-<test>) while experimenting.
  • Keep default worker profiles unless you need specific hardware or scheduling behavior.
  • Tune application/deployment settings first, worker templates second.

Glossary

  • RayService: KubeRay custom resource managing a Ray cluster plus Serve applications.
  • Deployment (Ray Serve): scalable unit running one part of model code.
  • Replica: one running instance of a deployment.
  • Worker group: pool of Ray worker pods with its own resource profile.