Senior Site Reliability Engineer, Wikimedia Enterprise
About the role
Summary
The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to join our team, reporting to the Sr. Engineering Manager. As the Site Reliability Engineer, you will play a key role in designing, developing, and maintaining reliable, scalable, and highly available infrastructure for our API services. You will contribute heavily to the high impact challenges behind innovating, building, and maintaining Wikipedia’s data feeds for high volume reusers. In this role, you will foster cross department collaboration with the wikimedia foundation SRE teams. You will own reliability targets (SLOs) for critical APIs, balancing performance, cost, and availability through data-driven decisions.
You will be involved in designing and running the infrastructure and services that interact with the base of Wikimedia Foundation’s projects, including, but not limited to: Kubernetes clusters, application servers, code collaboration infrastructure, and other developer-facing services. You will participate in incident response and be on-call. This role requires frequent work with other members of the enterprise and Foundation SRE team to maintain and improve our systems, as well as interacting with people not in SRE, like Security, Release and Software Engineers, together striving to move our projects and technologies forward.
Wikimedia Enterprise is a new, revenue-generating product that provides fast, comprehensive, reliable, and secure data ingestion for organizations that wish to repurpose Wikimedia/Wikipedia content in third party environments. Wikimedia Enterprise aims to improve the user experience for Wikimedia/Wikipedia readers beyond our own websites; increase the reach and discoverability of Wikimedia/Wikipedia content; and improve awareness and ease of attribution and verifiability of Wikimedia/Wikipedia content by the organizations that reuse our content the most.
We are a distributed and diverse team of engineers with a drive to explore, experiment, and embrace technologies. We act sort of like a startup within the Wikimedia Foundation: we build quickly, deploy often, and our work has a very high impact on the global knowledge ecosystem. If you are up to the challenge of working on something fast paced, of creating services that will revolutionize the systems distributing our knowledge for billions of people across the world, and enjoy the idea of working with a globally distributed team, you might be just the person we need.
You are responsible for:
- Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
- Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
- Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
- Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
- Partner with engineering team members to embed reliability best practices early in the development lifecycle
- Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD (or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
- Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
- Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
- Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement