Senior Site Reliability Engineer
About the role
THE COMPANY:
Juul Labs's mission is to transition the world’s billion adult smokers away from combustible cigarettes, eliminate their use, and combat underage usage of our products. We have the opportunity to address one of the world’s most intractable challenges through a commitment to exceptional quality, research, design, and innovation. Backed by leading technology investors, we are committed to the same excellence when it comes to hiring great talent.
We are a diverse team that is united by this common purpose and we are hiring the world’s best engineers, scientists, designers, product managers, operations experts, and customer service and business professionals. If the opportunity to build your career is compelling, read on for more details.
ROLE AND RESPONSIBILITIES:
A Senior Site Reliability Engineer (SRE) is expected to own the operational stability and performance of Juul’s hybrid cloud infrastructure (Nutanix, AWS/GCP). This involves leading automation efforts, architecting for reliability, and acting as the final escalation point for critical incidents to ensure the platform is scalable and efficient.
Nutanix Platform Management
- Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and Prism Central for multi-cluster management
- Expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations, troubleshooting, and automation
- Develop automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform for infrastructure-as-code
- Create and manage VM templates, golden images, and standardized deployment catalogs for consistent provisioning
- Design disaster recovery solutions using Leap, Protection Domains, cross-cluster replication, and metro clustering
- Implement network micro-segmentation using Nutanix Flow and configure RBAC, encryption, and security hardening
- Lead L3 troubleshooting using advanced diagnostics, log analysis (CVM, Genesis), NCC health checks, and cluster service resolution
- Configure high availability, VM affinity rules, QoS policies, and optimize performance for mission-critical workloads
- Manage AHV networking with OVS bridges, VLANs, bonds, LACP and implement resource reservations and workload balance.
- Design, deploy, and maintain hybrid cloud infrastructure across Nutanix HCI, AWS, and GCP platforms
- Architect and implement multi-cloud solutions ensuring high availability, scalability, and disaster recovery
Cloud Platform Engineering
- Architect and deploy enterprise-scale, highly available multi-cloud solutions across AWS and GCP with multi-region/multi-account strategies
- Expert-level proficiency with AWS CLI, GCP CLI, SDK, boto3, and Python for advanced automation and infrastructure orchestration
- Design AWS Organizations and GCP Organization hierarchies with consolidated billing, IAM policies, and centralized governance
- Configure and manage AWS Systems Manager (SSM) including Session Manager, Run Command, State Manager, and Automation for centralized fleet operations
- Implement centralized logging using CloudWatch/CloudTrail and GCP Cloud Logging with S3/Cloud Storage aggregation
- Integrate AWS and GCP with Splunk using HEC, CloudWatch subscriptions, Pub/Sub, Dataflow, and cloud-specific add-ons for SIEM correlation
- Design and deploy advanced load balancing solutions with AWS ALB/NLB/ELB and GCP Cloud Load Balancing including SSL termination and auto-scaling
- Develop infrastructure-as-code using Terraform, CloudFormation, CDK for repeatable multi-cloud deployments and CI/CD pipelines
- Configure AWS SSO, cross-account IAM roles, GCP Workload Identity, and federated access for centralized identity management