Senior SRE / Platform Engineer (m/f/d)
About the role
The Role
We are looking for a Senior SRE / Platform Engineer (m/f/d) to own and improve the cloud infrastructure behind SimScale's browser-based simulation platform. The role spans AWS and EKS, observability, disaster recovery, security and compliance controls, multi-region architecture, elastic GPU/HPC capacity, and internal developer tooling.
SimScale's engineering teams run workloads directly on AWS; you will build the standards, guardrails, and self-service tooling that let them do so safely, raising reliability and security without slowing engineering velocity. You will join a small, tightly knit infrastructure team supporting 50+ engineers across the company. This is a hands-on senior individual contributor role; people management is not required, but there is a genuine path toward tech-lead ownership as the team grows.
Your Opportunity
- Evolve our Kubernetes platform: Evaluate and adopt technologies such as Kubernetes Gateway API and service mesh patterns, and coordinate platform evolution across 10+ engineering teams.
- Take observability to the next level: Drive organization-wide adoption of OpenTelemetry for distributed tracing and metrics, and help teams define meaningful SLOs.
- Shape multi-region architecture and data residency: Support our move from an EU-centered footprint toward a global, multi-cloud architecture that satisfies disaster-recovery and data-residency requirements.
- Own cloud cost and efficiency at scale: Keep petabyte-scale infrastructure cost-efficient, secure, and well-instrumented.
- Improve tooling: Build self-service AWS account provisioning, guardrails and AI-assisted automations that help engineering teams manage infrastructure safely and efficiently at scale.
What We Expect from You
- 5+ years of professional experience in SRE, platform, or infrastructure engineering.
- Software development experience: Your background is rooted in software development, and you moved into SRE from there. You write production-quality software in at least one of Python, Go, Rust, or Java.
- Strong systems foundation: You understand Linux internals and distributed systems well enough to debug complex production behavior.
- Hands-on cloud and infrastructure experience: AWS (or GCP), declarative infrastructure (Terraform), gitops-workflow (ArgoCD) and container orchestration (Kubernetes).
- Observability and reliability experience: You have worked with OpenTelemetry, Prometheus, distributed tracing, monitoring, and meaningful SLOs/SLIs.
- Production debugging depth: You can investigate complex failures, communicate clearly during incidents, and turn findings into durable improvements.
- Security and compliance awareness: You understand how infrastructure decisions affect access control, auditability, disaster recovery, logging, and standards such as SOC 2.
- Clear communication: You can explain trade-offs to engineering teams and help others adopt better platform practices without unnecessary friction.
Bonus Points
- An open source portfolio or contributions.
- Prior technical leadership experience, especially in infrastructure, reliability, or platform engineering.
Location: Remote (within CET ±5h)
What you can expect from us
- Join a dedicated, supportive team with unlimited growth opportunities and leadership potential
- Make an impact quickly by sharing ideas and contributing to creative, goal-oriented projects
- Work in a diverse, inclusive environment with colleagues from over 35 countries
- Enjoy flexible working hours and remote work options