Site Reliability Engineer I
About the role
About Backblaze
Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.
Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals.
About the role
We are looking for a Site Reliability Engineer I to help support the stability, health, and day-to-day operations of Backblaze’s infrastructure. This role serves as a first line of response for customer-affecting issues and production alerts, helping drive timely incident resolution, maintain service reliability, and support operational readiness across our environments. You will work closely with TechOps, Data Center Technicians, and other cross-functional teams to troubleshoot issues, monitor system health, support deployments and migrations, and improve day-to-day operational processes through documentation and automation. The ideal candidate is technically curious, calm under pressure, eager to learn, and excited to grow in a hands-on infrastructure and reliability role.
What You’ll Do
- Act as first point of contact for all customer affecting issues
- Be a Key Driver for managing the resolution of technical problems
- Ensure that incident management processes are following and that incident post-mortems are completed to capture process deviations and areas for improvement
- Deliver consistent communication to Management
- Respond to zabbix alerts/regular monitoring of zabbix, either by taking direct action on alerts or escalating. Acknowledge every alert if direct action taken, or with escalation point of contact.
- Make sure escalations are handed off successfully.
- Ensure health of pods across all sites (define pod alerts on zabbix).
- Work through daily filesystem checks for pods.
- Troubleshoot technical issues for DC Techs -> advanced pod questions, deployment questions, migration troubleshooting, and ansible playbook issues.
- Identification and escalating any potential issues regarding the network.
- Vault pre-deployment configuration and testing.
- Start Vault Migrations, monitor migration pods, handle applicable migration pod health checks.
- Document/Work on automating Daily Items.
- Document/Provide Network IP's for upcoming deployments.
- Monitor Releases/Updates to the Server Farm, escalate issues as they arise.
- Engaging in on-call rotation shifts.
- Assist fellow TechOps team members in handling tasks.
- Making recommendations for improvements in organizational productivity.
- Be able to work outside of normal business hours(weekend shift, holidays & evenings) as needed
The Right Fit
- 2 - 4 years of relevant experience.
- Knowledge of Sysadmin and Linux skills.
- Desire to learn and develop all necessary technical skills.
- Strong analytical thinking.
- Strong skills in working with different teams and communication.
- Knowledge of network