← Back to jobsApply for this position
Affirm
Manager, Software Engineering (Resilience Engineering)
engineeringfull-timeRemote US
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full-time
INDUSTRY
fintech
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more
About the role
What you'll do
Leadership & Strategy
- Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices.
- Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
- Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle.
- Establish best practices for safely testing system limits and failure scenarios in production.
Systems & Operations
- Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.
- Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users.
- Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments.
- Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments.
- Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation.
Collaboration & Enablement
- Work closely with engineering teams to design and execute production load tests and chaos experiments safely.
- Partner with infrastructure teams to build guardrails around tests and experimentations.
- Enable teams to adopt resilience practices by providing reusable tooling, frameworks, and standardized workflows.
- Identify systemic weaknesses and lead cross-functional efforts to improve reliability and fault tolerance.
- Evangelize a culture of “test failure before failure tests you” across the organization.
What we look for
- Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
- Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
- Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
- Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
- Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
- Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
- Strong programming background (e.g., Python).
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist