← Back to jobsApply for this position
Devsu
Senior Site Reliability Engineer (SRE) - (GCP)
engineeringfull timeGuatemala
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full time
INDUSTRY
general
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more
About the role
Responsibilities
Monitoring & Observability (Core Focus)
- Own and operate the monitoring and observability stack across on-prem and GCP environments
- Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Establish observability standards and best practices across teams
- Improve visibility into system health, performance, and reliability
Site Reliability Engineering
- Apply SRE principles to improve availability, performance, and resilience
- Define and track SLIs, SLOs, and error budgets
- Participate in on-call rotations and SEV incident response
- Lead or contribute to incident investigations and root cause analysis (RCA)
- Drive preventative actions to reduce repeat incidents
Kubernetes & Platform Reliability
- Support and monitor Kubernetes environments (GKE and on-prem clusters)
- Monitor cluster health, capacity, and resource utilization
- Troubleshoot platform-level issues impacting application reliability
- Collaborate with Platform and Engineering teams on reliability improvements
Secondary Responsibilities (Backup Application Support)
- Provide L2/L3 application support coverage during support team resource shortages, high-severity incidents (SEVs), peak support periods or escalations
- Triage and troubleshoot application issues using existing runbooks and dashboards
- Collaborate with Application Support and Engineering teams during incidents
- Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist