Cloud Reliability & Recovery Engineer
About the role
Role Overview
We are seeking an experienced Cloud Engineer to design, implement, and continuously improve our Business Continuity Planning (BCP) and Disaster Recovery (DR) capabilities across AWS cloud environments.
This is a hands-on technical role requiring deep AWS expertise, strong scripting skills, and a passion for building highly available, fault-tolerant, and resilient cloud architecture by leveraging container orchestration with Kubernetes and infrastructure as code using Terraform. Good understanding of CI/CD pipelines to enable rapid, reliable deployments and minimize downtime. Adept at implementing DR strategies including multi-region failover, backup and restore automation, and recovery testing aligned with industry BCP/DR standards. You will collaborate closely with security, infrastructure, and application teams to ensure our systems can withstand and rapidly recover from any disruption.
Reports To: Director of Event Response
Level: Senior Individual Contributor
Key Responsibilities
Cloud Resilience Architecture
- Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
- Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
- Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
- Implement chaos engineering practices using AWS Fault Injection Simulator (FIS) to validate resiliency
- Architect cross-region replication strategies for S3, DynamoDB Global Tables, RDS, and Aurora Global
- Review containerized workloads using Kubernetes, ensuring resilience through self-healing, auto-scaling, and multi-cluster or multi-region deployments.
Backup & Recovery Engineering
- Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
- Design immutable backup vaults and cross-account/cross-region backup replication pipelines
- Develop and automate data recovery testing procedures, ensuring integrity and meeting defined SLAs
- Implement point-in-time recovery (PITR) for databases and storage; validate via regular restore drills
- Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including tracking RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Infrastructure as Code & Automation
- Design and manage infrastructure as code using Terraform to ensure consistent, repeatable deployments
- Implement CI/CD pipelines for automated testing and deployment of infrastructure changes
- Develop automation scripts for routine operational tasks and incident response
- Monitor system health and performance, proactively identifying and resolving issues