Sonicwall

Principal Site Reliability Engineer

engineeringfull-timeUSA-Remote

SALARY

Not listed

WORK TYPE

remote

JOB TYPE

full-time

INDUSTRY

general

Apply for this position

✦ AutoApply Let us apply to roles like this on your behalf.

Learn more

About the role

Job Description

As a Principal Site Reliability Engineer, you will own the reliability, scalability, and operational excellence of our Cloud-based services. You will define and enforce reliability standards, drive the adoption of SRE practices across engineering teams, and build the systems and tooling that keep our production infrastructure healthy. We follow a DevOps model: Development and Operations teams are integrated, and the SRE function acts as the reliability layer — setting Service Level Objectives, managing error budgets, and continuously reducing toil through engineering.

Key Responsibilities

Define, publish, and continuously refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for all critical services, partnering with product and engineering leadership.
Own the error budget framework: track consumption, enforce error budget policies, and drive reliability investments when budgets are at risk.
Lead the design and implementation of comprehensive observability platforms — metrics, structured logging, and distributed tracing — to ensure full visibility into production systems.
Drive toil reduction initiatives by identifying and automating repetitive, manual operational work, targeting measurable reduction in operational burden across teams.
Design and execute chaos engineering programs to proactively uncover reliability weaknesses in our infrastructure and services before they impact customers.
Lead blameless postmortem culture: facilitate incident retrospectives, extract systemic learnings, and track corrective action items to completion.
Build and improve on-call incident response processes, runbooks, and escalation paths; manage and optimize on-call rotation health to prevent burnout.
Help design, build, and support infrastructure and security technologies within the cloud that offer resiliency, observability, and optimized cost.
Develop solutions for automated deployment of software and services on our production infrastructure hosted on AWS, applying reliability engineering principles throughout.
Shape how mission-critical enterprise software solutions are developed and deployed using optimized CI/CD pipelines that embed reliability and quality gates.
Develop management solutions for services across multiple cloud platforms and data centers, with a focus on fault tolerance and graceful degradation.
Collaborate with developers to bring new features and services into production using production-readiness reviews and launch checklists.
Champion reliability engineering best practices across the organization, embedding SRE principles into the software development lifecycle.
Mentor team members on SRE philosophy, technical decision-making, code reviews, and cloud engineering best practices.
Participate in roadmap planning, identify areas of improvement, and perform technology evaluation and selection.

Required Skills and Qualifications

Deep experience in site reliability engineering, platform engineering, or infrastructure roles.
Expertise in cloud platforms (AWS preferred) and container orchestration (e.g., Kubernetes).
Proficiency in programming/scripting (e.g., Python, Go, Bash).
Strong background in observability tools (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
Experience with CI/CD pipelines and infrastructure-as-code (e.g., Terraform, Ansible).
Excellent problem-solving and communication skills.

✦ Let us apply for you

We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.

Join waitlist

Apply now