← Back to jobs
Nebius
Nebius

Senior ML Engineer (Token Factory)

engineeringfull-timeAmsterdam, Netherlands; Berlin, Germany; Israel; London, United Kingdom; Prague, Czech Republic; Remote - Europe
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full-time
INDUSTRY
ai
Apply for this position
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more

About the role

About Nebius

Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.

Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.

Role

Token Factory is a part of Nebius Cloud, one of the world's largest GPU clouds, running tens of thousands of GPUs. We are building a high-performance inference and fine-tuning platform designed to push foundation models to their hardware limits. Our mission is to maximize throughput, minimise latency, and optimise cost-per-token across tens of thousands of GPUs.

Directions We Are Working On

  • Inference Optimization: Identifying LLM inference bottlenecks to drive production speedups. Squeezing the maximum performance for a wide range of LLM architectures at scale (e.g., GPT-OSS, Kimi K2.5, DeepSeek V3.1/V3.2, GLM-5).
  • Inference engines support: Implement novel speculative decoding architectures, optimise components of various LLM designs (dense/MoE, autoregressive/parallel), and contribute to open-source inference engines.
  • Low Precision Training & Inference: Design and productionise low-precision (FP8, NVFP4/MXFP4) training and inference pipelines with measurable gains in throughput and cost-efficiency.

We Expect You To Have

  • A profound understanding of theoretical foundations of machine learning and transformer architecture.
  • Experience profiling GPU workloads using Nsight, PyTorch profiler, or similar tools
  • Understanding of GPU memory hierarchy and compute/memory tradeoffs
  • Familiarity with important ideas in LLM space, such as MHA, RoPE, KV-cache, Flash Attention, and quantisation
  • Understanding of performance aspects of large neural network training (sharding strategies, custom kernels, hardware features etc.)
  • Strong software engineering skills (we mostly use Python)
  • Deep experience with modern deep learning frameworks
  • Proficiency in contemporary software engineering approaches, including CI/CD, version control and unit testing
  • Strong communication and leadership abilities

Nice To Have

  • Experience working with open-source inference engines (vLLM, SGLang, TensorRT-LLM), including contributions
  • Experience with kernel languages or DSLs such as Triton, Cute, CUTLASS, CUDA
  • A track record of building and delivering products (not necessarily ML-related) in a dynamic startup-like environment.
  • Strong engineering skills, including experience in developing large distributed systems or high-load web services.
  • Open-source projects that showcase your engineering prowess
  • Excellent command of the English language, alongside superior writing, articulation, and communication skills.

Benefits & Perks

  • Competitive compensation
  • Career growth and learning opportunities
  • Flexibility and work-life balance
  • Collaborative and innovative culture
  • Opportunity to work on impactful AI projects
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist
Apply now