Principal Site Reliability Engineer
About the role
ABOUT THE ROLE
As a Principal Site Reliability Engineer, you will serve as a technical leader responsible for the reliability, scalability, performance, and operational excellence of Accela's Civic Platform. You will partner closely with Engineering, DevOps, Database Engineering, Security, and Architecture teams to evolve our cloud platform, modernize infrastructure, and ensure our SaaS offerings remain highly available, secure, and cost-effective at scale.
This role combines deep technical expertise with strategic influence. You will drive reliability initiatives, define operational standards, mentor engineers, and lead complex technical efforts that improve the resiliency and efficiency of our platform. Your focus is simple: keep systems resilient, scalable, secure, and continuously improving.
SPECIFIC RESPONSIBILITIES
- Serve as a technical leader for reliability engineering, operational excellence, and platform modernization across the Civic Platform.
- Drive platform modernization initiatives, including the continued evolution from VM-based architectures toward containerized and cloud-native services, in partnership with DevOps Engineering, Database Engineering, Security, and Development teams.
- Lead efforts that improve and sustain the availability, performance, scalability, security, and cost efficiency of Accela's SaaS offerings.
- Define, implement, and operate service level objectives (SLOs), service level agreements (SLAs), and error budgets for critical platform services, using data to drive prioritization and risk-based decision making.
- Lead observability initiatives across metrics, distributed tracing, logging, and monitoring platforms to improve system visibility and accelerate issue detection and resolution.
- Drive Root Cause Analysis (RCA) efforts for complex production incidents, facilitate blameless postmortems, and ensure corrective actions are implemented and tracked to completion.
- Design, develop, and maintain automation, tooling, and software solutions that improve reliability, operational efficiency, scalability, and developer productivity.
- Serve as a senior technical escalation point during production incidents and for platform changes that impact availability, performance, security, or compliance.
- Partner with Security and Compliance teams to ensure platform operations meet regulatory and compliance requirements, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSS.
- Translate operational metrics, reliability trends, and platform health data into actionable insights for engineering leadership and executive stakeholders.
- Mentor engineers across the Cloud Engineering organization and influence engineering best practices through technical leadership and collaboration.
REQUIRED QUALIFICATIONS
- 8+ years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, or related disciplines within a SaaS environment, including experience leading complex technical initiatives.
- Demonstrated technical leadership driving platform modernization in containerized and orchestrated environments, including Kubernetes or equivalent technologies.
- Hands-on experience operating and supporting large-scale SaaS platforms on Microsoft Azure.
- Experience developing automation and operational tooling using Python, PowerShell, Bash, or similar scripting languages.
- Deep expertise designing, operating, analyzing, and troubleshooting complex distributed systems across the application, infrastructure, networking, and operating system layers.
- Strong experience with modern observability platforms, including monitoring, logging, metrics, and distributed tracing.
- Demonstrated success leading incident response, Root Cause Analysis, and continuous improvement initiatives.