Senior Site Reliability Engineer
About the role
What this role is about:
Are you excited to work on systems where reliability directly impacts real-world outcomes? At RapidSOS, we build technology that powers emergency response, ensuring critical data gets to the right place at the right time. When these systems degrade or fail, the impact is real and reliability isn’t a background function. It’s fundamental to how our product shows up in critical moments.
We’re seeking a Senior Site Reliability Engineer to own the performance and stability of services that operate at scale in real-world, high-stakes environments. You’ll work across infrastructure-as-code, container orchestration, CI/CD pipelines, and service-level application code, identifying and resolving issues at their root cause while proactively shaping how systems are built to improve reliability from the start. You’ll go beyond surface-level fixes, digging into everything from service behavior in Kubernetes to application-level decisions that impact performance, cost, and reliability. You’ll collaborate closely with engineering teams to improve how our systems are built, observed, and operated. Along the way, you’ll help shape how we approach reliability as a discipline—closing visibility gaps, improving resilience, and ensuring our platform performs when it matters most.
What you’ll do:
- Own performance and reliability outcomes: Ownership of how application-level decisions create system-level impact, including connection pooling, database architecture, traffic routing patterns, and memory allocation. Collaboration with engineering teams that own specific domains, partnering directly to improve reliability and performance across their systems.
- Design for system resilience: Responsibility for strengthening reliability through proactive design decisions, including safer deployment patterns, failover strategies, and redundancy approaches that improve system behavior under stress.
- Build observability into system behavior: Proactively instrument services with structured logging, metrics, and alerting so systems are easier to understand and debug. The focus is on creating clear signals from production behavior before issues escalate.
- Own incidents from signal to resolution: Ownership of production issues from first signal through resolution, including investigation across infrastructure and application layers, root cause identification, and implementation of fixes that restore stability and strengthen system behavior long term.
- Work across the stack without a permission slip: You’ll work across infrastructure-as-code, container orchestration, CI/CD pipelines, and service-level application code, identifying and resolving issues at their root cause while proactively shaping how systems are built to improve reliability from the start.