ML Infrastructure Engineer
About the role
About Nebius
Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.
Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.
Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D.
The role
We are seeking a highly skilled ML/AI Engineer to join our team to lead and support benchmarking of GPU platforms for machine learning and AI workloads. You will play a critical role in evaluating the performance of GPU-based hardware for various deep learning and AI frameworks, enabling data-driven decisions for platform optimisation and next-generation hardware development.
Your responsibilities will include:
- Work closely with hardware, development teams to profile and analyse GPU performance at the system and kernel level.
- Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g., CUDA, ROCm).
- Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
- Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
- Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability.
- Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends.
- Contribute to internal tooling, frameworks, and best practices.
We expect you to have:
- A profound understanding of theoretical foundations of machine learning.
- Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.).
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, TensorRT-LLM).
- Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries.
- Familiarity with containerized environments (e.g., Docker, Kubernetes).
- Strong communication and ability to work independently.
Ways to stand out from the crowd:
- Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT).
- Experience in Python and performance profiling.