Infrastructure Engineer (Storage)
About the role
What We Are Looking For
Lightning AI is seeking a Storage Infrastructure Engineer to join our Infrastructure Engineering team.
In this role, you will focus on building and operating the storage systems that power large-scale AI/ML training, inference, and HPC workloads. You will work at the intersection of software, hardware, and operations—developing automation, improving reliability, and scaling distributed storage systems across our bare-metal infrastructure.
You will help own the data plane of our storage infrastructure, supporting high-throughput, low-latency data access for some of the most demanding AI workloads. You’ll play a key role in managing and evolving our storage stack (including VAST and S3-compatible systems like Ceph), ensuring performance, reliability, and efficiency at scale.
What You'll Do
Storage Systems & Infrastructure
- Operate and scale distributed storage systems, including VAST and S3-compatible object storage (e.g., Ceph)
- Improve performance, reliability, and efficiency of storage systems supporting large-scale AI/ML workloads
- Troubleshoot complex storage and data path issues across hardware and software layers
- Optimize storage performance to support high-throughput, low-latency AI training and inference workloads
Automation & Tooling
- Build and maintain automation for provisioning, managing, and monitoring storage infrastructure
- Develop Python-based tools and workflows to reduce manual operational overhead
- Improve lifecycle management of storage clusters, from deployment through maintenance and scaling
Systems & Operations
- Manage and operate Linux-based systems in production, including bare-metal environments
- Partner with infrastructure and data center teams on hardware bring-up, upgrades, and issue resolution
- Support capacity planning, utilization tracking, and forecasting for storage systems
- Leverage monitoring and telemetry to diagnose issues and improve system performance and reliability
Cross-Functional Collaboration
- Work closely with Infrastructure Engineering, Network Engineering, and Platform teams to integrate storage into the broader platform
- Contribute to design discussions around new infrastructure deployments and scaling strategies
- Help define best practices for operating storage systems in high-performance computing environments
What You'll Need
Required Qualifications
- 5+ years