py_py 10 hours ago

ML destroys some fundamental assumptions about development on Kubernetes - the inability to test code locally (no GPUs and big datasets) means having to develop through deployment with 15-30 minute iteration loops. Tools like devboxes and Notebooks enable iteration, but lack scale, hurt reproducibility, and introduce a multi-day process to "translate" for production. Talk to any ML team and they'll either complain about DevEx or about research-to-production.

We aim to fix that. With Kubetorch, commanding powerful compute is easy. Use simple, Pythonic APIs to specify the compute you need, and dispatch it (with `.to()`!) to Kubernetes in <2 seconds with our magic packaging and deployment system. Iteration is fast, but everything is perfectly reproducible and still captured in code.

Looking forward, RL needs a system like Kubetorch. There's no simple way to use existing Kubernetes primitives to say "launch a distributed training, launch an inference service, launch 50 code sandboxes with different images, and then go run this train loop." With Kubetorch, it's extremely easy.