← Back to jobsApply for this position
Runpod
Manager, HPC Storage Engineer
engineeringfull-timeRemote, USA
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full-time
INDUSTRY
ai
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more
About the role
Responsibilities
- Own Distributed Storage Architecture: Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale.
- Build the Storage Engineering Team: Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions.
- High-Performance Shared Filesystems: Design and operate large-scale SAN and NFS deployments, including performance-sensitive shared storage for GPU clusters.
- Advanced Filesystems & Platforms: Lead deployments and operations of VAST Data and experience with Lustre or similar parallel filesystems used in HPC and AI environments.
- End-to-End Performance Ownership: Drive performance optimization from NAND and NVMe media through controllers, networking, and client access patterns.
- Next-Generation Storage Technologies: Evaluate and deploy cutting-edge capabilities such as NFS over RDMA, GPU Direct Storage (GDS), and low-latency data paths for accelerated workloads.
- Reliability & Scale: Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management.
- Automation & Observability: Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics.
- Cross-Functional Collaboration: Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs.
- Vendor & Partner Management: Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution.
Requirements
- Engineering Leadership Experience: 3+ years managing storage, systems, or infrastructure engineering teams in production environments.
- Distributed Storage Expertise: 8+ years designing and operating large-scale storage systems, including SAN and NFS architectures at multi-petabyte scale.
- VAST Data Experience: Hands-on experience deploying, operating, or deeply integrating VAST Data in production.
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist