Backblaze

Site Reliability Engineer I

engineeringfull-timeRemote - US

SALARY

Not listed

WORK TYPE

remote

JOB TYPE

full-time

INDUSTRY

general

Apply for this position

✦ AutoApply Let us apply to roles like this on your behalf.

Learn more

About the role

About Backblaze

Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.

Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals.

About the role

We are looking for a Site Reliability Engineer I to help support the stability, health, and day-to-day operations of Backblaze’s infrastructure. This role serves as a first line of response for customer-affecting issues and production alerts, helping drive timely incident resolution, maintain service reliability, and support operational readiness across our environments. You will work closely with TechOps, Data Center Technicians, and other cross-functional teams to troubleshoot issues, monitor system health, support deployments and migrations, and improve day-to-day operational processes through documentation and automation. The ideal candidate is technically curious, calm under pressure, eager to learn, and excited to grow in a hands-on infrastructure and reliability role.

What You’ll Do

Act as first point of contact for all customer affecting issues
Be a Key Driver for managing the resolution of technical problems
Ensure that incident management processes are following and that incident post-mortems are completed to capture process deviations and areas for improvement
Deliver consistent communication to Management
Respond to zabbix alerts/regular monitoring of zabbix, either by taking direct action on alerts or escalating. Acknowledge every alert if direct action taken, or with escalation point of contact.
Make sure escalations are handed off successfully.
Ensure health of pods across all sites (define pod alerts on zabbix).
Work through daily filesystem checks for pods.
Troubleshoot technical issues for DC Techs -> advanced pod questions, deployment questions, migration troubleshooting, and ansible playbook issues.
Identification and escalating any potential issues regarding the network.
Vault pre-deployment configuration and testing.
Start Vault Migrations, monitor migration pods, handle applicable migration pod health checks.
Document/Work on automating Daily Items.
Document/Provide Network IP's for upcoming deployments.
Monitor Releases/Updates to the Server Farm, escalate issues as they arise.
Engaging in on-call rotation shifts.
Assist fellow TechOps team members in handling tasks.
Making recommendations for improvements in organizational productivity.
Be able to work outside of normal business hours(weekend shift, holidays & evenings) as needed

The Right Fit

2 - 4 years of relevant experience.
Knowledge of Sysadmin and Linux skills.
Desire to learn and develop all necessary technical skills.
Strong analytical thinking.
Strong skills in working with different teams and communication.
Knowledge of network

✦ Let us apply for you

We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $15/mo. Cancel anytime.

Get AutoApply

Apply now