Staff Software Engineer in Hardware Infrastructure Observability
About the role
The team
The Hardware Automation team builds the internal platforms and tooling that power how Nebius operates its data center infrastructure at scale. Our mission is to eliminate manual effort, reduce human error, and give every team in the Hardware Infrastructure department real-time visibility and control over the systems they own. We operate as a product engineering team embedded within hardware infrastructure — meaning we don't just write requirements and hand them off. We own the full stack: from requirements gathering with data center operations and hardware engineering, through design and build, all the way to rollout and ongoing reliability.
The Role
Nebius is looking for a Senior Software Engineer to join the Hardware Infrastructure Observability team. You're welcome to work from our office in Amsterdam. We build and run low-level monitoring for servers and data center engineering systems to ensure reliability at scale. We also design and operate maintenance and remediation systems that enable safe, predictable fleet-wide changes and keep the infrastructure healthy.
Key Responsibilities
- Design and develop services and agents that provide deep visibility into a large server fleet and DC engineering systems
- Evolve our metrics/aggregation/alerting pipelines and improve signals quality
- Build maintenance workflows and automation that keep fleets healthy
- Investigate incidents hands-on (including on-host debugging) and drive root-cause fixes
- Collaborate with hardware, networking, and DC operations to improve reliability
We expect you to have
- 5+ years of professional software engineering experience
- Excellent knowledge of Python and Golang or you are ready to quickly switch to these programming languages
- Strong Linux fundamentals
- Ability to write reliable code and dig into complex problems
- Working proficiency in English
It will be an added bonus if you have
- Solid understanding of modern server architecture and its components
- Experience with metrics/monitoring/alerting Prometheus-compatible stacks (like VictoriaMetrics)
- Good knowledge of computer networks
- Experience designing, developing, and running high-load distributed systems
We expect Staff Engineers to
- Manage large-scale projects involving multiple stakeholders
- Break down complex tasks and guide both their own work and that of more junior colleagues
- Be experts in specific technologies and write high-quality code that can serve as a reference
- Assess task priority and focus on high-impact work, avoiding low-value efforts
- Have strong architectural thinking and contribute