Strategic Operations Engineer III
About the role
About Backblaze
Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.
Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals.
What You'll Do
Incident Management
- Available to Lead and govern the end-to-end incident management lifecycle, including detection, triage, escalation, and resolution.
- Drive major incident management (MIM) processes and communications.
- Improve MTTR (Mean Time to Resolution) through automation and process optimization.
- Establish and maintain incident response playbooks and runbooks.
Problem Management
- Maintain and improve intelligent heatmaps leveraging AI/ML to identify recurring technical themes and prioritize long-term remediation.
- Implement trend analysis and proactive problem identification using observability data and AI.
- Track and manage problem records to closure.
Change Management
- Govern change management processes (lead the CAB), ensuring safe, compliant, and low-risk deployments.
- Define and enforce change policies, risk assessments, and approval workflows.
- Drive continuous improvement in release and deployment practices.
Observability & Service Reliability
- Maintain a strong understanding of system architecture and monitoring strategies, identifying gaps and opportunities for improvement.
- Partner with engineering teams to improve system resilience and performance.
- Reduce alert fatigue by improving signal-to-noise ratio in monitoring systems.
AI-Driven Operations (AIOps)
- Leverage AI/ML for anomaly detection, predictive alerting, and automated root cause analysis.
- Implement AI-driven solutions to optimize incident response and operational workflows.
- Analyze large-scale operational data to identify patterns and recommend improvements.
- Experience with AIOps platforms or building AI-driven operational solutions.
Required Qualifications
- 5+ years of experience in IT Operations, SRE, or similar roles.
- Strong expertise in Incident, Problem, and Change Management (ITIL or similar frameworks).
- Proven experience in governing and optimizing operational processes.
- AI & Data Expertise: Strong knowledge of AI/ML concepts, including anomaly detection, predictive analytics, and data modeling.
- AIOps Experience: Hands-on experience with AIOps platforms or building AI-driven operational solutions (event correlation, alert prioritization).
Preferred Qualifications
- ITIL certification (Foundation or higher).
- Proficiency with platforms such as Jira, SNOW, FireHydrant, Moogsoft, etc.
- Experience working in high-availability, large-scale environments.