← Back to jobsApply for this position
Wekatest
Senior Team Lead - AI Inference
engineeringfull-timeU.S. Remote
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full-time
INDUSTRY
ai
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more
About the role
What You'll Work On
- Lead & Own: Take end-to-end ownership of AMG's core inference infrastructure — from the NVMe Token Warehouse and GDS data paths to the vLLM/LMCache serving stack — driving technical decisions and delivery outcomes.
- Technical Direction: Guide a team of engineers through design, implementation, and delivery of high-throughput, low-latency LLM inference systems, setting high standards for code quality, architecture, and reliability.
- Build at Scale: Stay hands-on across the AMG stack (Python, C++, CUDA, vLLM, NIXL/Dynamo, Kubernetes), contributing directly to production systems while providing technical leadership to the team.
- Solve Hard Problems: Tackle the real frontier challenges of inference engineering — disaggregated prefill/decode, persistent off-HBM KV caching, RDMA-based transport, and multi-tier GPU memory hierarchies — that define what's possible at scale.
- Grow People & Teams: Mentor and coach engineers through regular 1:1s, career coaching, and sprint reviews. Foster a culture of ownership, collaboration, and technical excellence within the AMG team.
- Stay on the Frontier: Track the evolving inference ecosystem, benchmark new tools (SGLang, TRT-LLM, NVIDIA Dynamo), and help the team make timely decisions about when to adopt, build, or pivot.
What We're Looking For
- Experienced Engineering Leader: 5+ years of professional software engineering, with proven experience leading engineers and owning complex production systems — ideally in AI/ML infrastructure or high-performance computing.
- Deep AI Inference Background: Hands-on expertise with LLM serving systems — KV cache reuse, disaggregated prefill/decode, continuous batching, and multi-tier GPU memory hierarchies (HBM → NVMe). Strong familiarity with vLLM, LMCache, NIXL/NVIDIA Dynamo, or similar frameworks.
- Systems Engineering Depth: Strong Python and C++ skills (Rust a plus), with a solid grasp of CUDA, GPU memory management, and high-performance I/O — including GPUDirect Storage (GDS), RDMA, and NVMe data paths.
- Infrastructure Fluency: Experience deploying and scaling GPU workloads on Kubernetes, with familiarity in RDMA networking, bare-metal GPU clusters (H100/A100), and high-throughput distributed storage.
- People Leadership: Demonstrated ability to mentor and develop engineers — running effective 1:1s, supporting career growth, and balancing technical execution with long-term team health.
High Bar for Quality
A strong sense of engineering craftsmanship, with a track record of building reliable, high-throughput systems and continuously improving engineering practices.
The WEKA Way
- We are Accountable: We take full ownership, always–even when things don’t go as planned. We lead with integrity, show up with responsibility & ownership, and hold ourselves and each other to the highest standards.
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist