Kubetorch – For RL and ML on Kubernetes

py_py 10 hours ago

ML destroys some fundamental assumptions about development on Kubernetes - the inability to test code locally (no GPUs and big datasets) means having to develop through deployment with 15-30 minute iteration loops. Tools like devboxes and Notebooks enable iteration, but lack scale, hurt reproducibility, and introduce a multi-day process to "translate" for production. Talk to any ML team and they'll either complain about DevEx or about research-to-production.

We aim to fix that. With Kubetorch, commanding powerful compute is easy. Use simple, Pythonic APIs to specify the compute you need, and dispatch it (with `.to()`!) to Kubernetes in <2 seconds with our magic packaging and deployment system. Iteration is fast, but everything is perfectly reproducible and still captured in code.

Looking forward, RL needs a system like Kubetorch. There's no simple way to use existing Kubernetes primitives to say "launch a distributed training, launch an inference service, launch 50 code sandboxes with different images, and then go run this train loop." With Kubetorch, it's extremely easy.

olokobayusuf 10 hours ago

Congrats on the launch!