Principal Engineer II
About the role
About the Role
Platform Infrastructure Engineering is responsible for building and operating Menlo Security's Infrastructure Platform. Together with the rest of our engineering teams, we enable our customers to connect to the Internet without compromise. Our environment provides services globally. We expect failure, build security in by design, create evolvable systems, and enable multi-tenancy across the infrastructure. Automation and thoughtful usage of Gemini and Claude AI tooling to accelerate our workflows is an absolute for us. We are committed to getting it done properly, the first time.
As a Principal II Platform Infrastructure Engineer, you'll join a group of experienced engineers who are part of a globally distributed team responsible for building and managing the company's core infrastructure services and maintaining our constantly growing platform. The team operates a sophisticated cloud-native infrastructure built on Google Kubernetes Engine and VMs spanning multiple environments globally from development to production.
Operating at the highest level of individual contribution, you will drive the technical vision for this environment. Crucially, you will draw on your expertise to guide the organization through complex architectural transformations, strategically decoupling legacy monolithic systems into scalable, highly resilient cloud-native microservices.
Responsibilities
- Architectural Leadership: Define the long-term architectural roadmap and design, deploy, and maintain VM and Kubernetes infrastructure on GCP and AWS across dozens of clusters spanning development, staging, and production environments in multiple regions.
- Architectural Transformation: Lead the strategic modernization of our services, acting as the primary architectural guide for development teams navigating the complex transition from monolithic architectures to decoupled microservices.
- Strategic Infrastructure as Code (IaC): Build and maintain Infrastructure as Code (IaC) using Terraform modules, managing resources through Spacelift or equivalent Terraform Automation and Collaboration Software (TACOS). Provision cloud infrastructure including networking, compute, storage, and security components primarily on GCP, with secondary AWS support. Implement and manage workflows with sophisticated multi-layer configuration management.
- Cross-Functional Leadership: Partner with Engineering, Product, Compliance, and Security teams to design resilient, scalable systems. Consult on capacity planning, disaster recovery, and architectural decisions for cloud-native applications.
- Next-Generation Observability: Build and maintain comprehensive observability solutions using Grafana Cloud, Prometheus/Mimir, and OTel collectors. Design Grafana dashboards, configure alerting rules, and ensure visibility across all platform components.
- Advanced Networking & Security: Manage certificate lifecycle, DNS automation, ingress controllers, and service mesh networking with Cilium.
- Engineering Excellence: Identify and eliminate toil through automation and usage of modern AI tools like Gemini and Claude. Write scripts, develop tools, and build CI/CD pipelines to improve operational efficiency and reduce manual work.
- Operational Resilience: Participate in a 24x7 on-call rotation as part of a globally distributed team, responding to incidents and driving post-incident reviews.
Requirements
- Education & Experience: Bachelor's degree in Computer Science, similar technical field, or equivalent practical experience.