Senior/Staff DevOps Engineer
About the role
About Ethos
Ethos is on a mission to bridge the human readiness gap by transforming how training is developed, consumed, and aligned with strategic business outcomes. As a well-funded Series A startup ($40M+ raised), we’re a trusted partner to 150+ enterprise customers across the U.S. military, life sciences, manufacturing, supply chain, and professional sports.
We’re expanding our engineering team to deliver a best-in-class learning platform—smarter, faster, and more optimized. We’ve gone all-in on AI tooling in our development process, and we’re accepting and expanding upon the best new practices for creating software in this era.
About the Role
You’ll lead the deployment and operationalization of our SaaS products across Commercial Cloud, government networks, and bespoke/air-gapped customer environments. As a Senior engineer, you’ll own end-to-end infrastructure delivery, elevate DevOps practices, and collaborate closely with Software and Product. As a Staff engineer, you’ll additionally shape platform engineering strategy, set technical direction for distributed systems at scale, and influence design patterns that enable AI workloads and complex data pipelines. You’ll treat AI tooling as core to your daily workflow — for IaC, pipelines, incident response, and toil reduction — and help shape the agentic operations patterns and AI workloads our platform runs.
If you love solving hard deployment problems, care deeply about security and reliability, can scale modern cloud platforms with rigor, and embrace AI-augmented operations as the way forward, this role is for you.
What You’ll Do
- Design & Operate the Platform: Architect, implement, and run secure, scalable, multi-tenant infrastructure (infra as code, immutable artifacts, GitOps).
- AI-Augmented Operations & Platform Work: Use AI coding and agentic tools (Claude Code, Cursor, Copilot, MCP-based ops agents) for IaC authoring, pipeline development, log/trace analysis, postmortem drafting, and toil reduction; build and improve agentic workflows for the team.
- CI/CD & Release Engineering: Build and harden pipelines (build, test, scan, sign, promote, deploy) for multi-environment delivery—including disconnected/air-gapped workflows.
- Observability & Reliability: Establish SLOs; instrument systems for metrics/logs/traces; drive incident response and postmortems; reduce MTTR and change failure rate.
- Security & Compliance by Design: Integrate supply-chain security (SBOMs, signing, provenance), secrets management, and baseline hardening (CIS/STIG-aligned).
- Cost & Performance: Optimize infrastructure spend and performance (capacity planning, autoscaling, right-sizing, storage/egress strategies).
- Technical Leadership: Lead design reviews, author RFCs, mentor engineers, and raise the quality bar for platform changes.
- Gov/Constrained Deployments: Support IL-4/IL-5-aligned patterns, RMF documentation support, and offline artifact promotion processes where needed.
- (Staff) Strategy & Standards: Define platform roadmaps, establish consistent deployment and infrastructure patterns, and guide cross-team adoption of best practices.
Measures of Success (First 6–12 Months)
- Availability & Reliability: Meet or exceed service SLOs; reduce MTTR by ≥30%.
- Delivery Velocity: Reduce