Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps)
Company: Socotra, Inc.
Location: San Francisco
Posted on: May 24, 2025
Job Description:
Build the Future of Scalable AI at TrueFoundryAt TrueFoundry,
we're redefining how ML teams train, deploy, and scale their
models. Our LLMOps and MLOps platform empowers organizations to
experiment faster, train large-scale models reliably, and deploy
them seamlessly on Kubernetes-with the same muscle as Big
Tech.We're looking for ML Systems Engineers who are passionate
about scaling deep learning workloads, optimizing multi-GPU
training, and shipping production-grade solutions. If you live and
breathe PyTorch, multi-node training, and love solving gnarly infra
challenges-this is your place.What You'll Work On
- Write clean, modular, and scalable Python code, with a strong
emphasis on reliability and performance.
- Build platform for training and finetuning large-scale ML
models across multi-GPU, multi-node clusters with PyTorch,
Kubeflow, and other orchestration tools.
- Own the infrastructure and code that enables high-throughput,
low-latency inference pipelines for state-of-the-art models.
- Build platform for developing, deploying and evaluating agentic
applications for our end customers.
- Help shape internal standards and best practices across the
engineering team for high-scale ML workloads.What We're Looking For
- 5+ years of hands-on experience building and deploying ML
systems at scale.
- 5+ years of writing production quality high performance
code.
- Deep experience with multi-GPU/multi-node training, ideally
with PyTorch as your primary framework.
- Experience working with torch, high-level ML frameworks, and
inference engines (vLLM or TensorRT).
- Experience with Kubernetes is highly preferred; exposure to
Kubernetes-native tools is a huge plus.
- A pragmatic mindset-you know when to optimize and when to
ship.
- Bonus: Familiarity with open-source LLM
training/fine-tuning.Why Join TrueFoundry?
- Work directly with ex-Facebook engineers and founders from IIT
Kharagpur, UC Berkeley, and Y Combinator alumni.
- First-hand exposure to building and scaling a deep-tech
startup-insights you'll carry if you want to start your own one
day.
- Be part of a fearlessly experimental culture focused on
customer success and long-term impact.Flexible hours, learning
credits, and the opportunity to work shoulder-to-shoulder with the
co-founders (Abhishek & Nikunj).
#J-18808-Ljbffr
Keywords: Socotra, Inc., Palo Alto , Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps), Engineering , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...