Staff Software Engineer, MDLC
About the role
Who we are
At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it.
What we are building
The Model Development Lifecycle Team is building a cutting-edge platform to simplify the entire machine learning journey. Our platform supports:
- Seamless API Integration: Deploy models as APIs for consistent use across applications
- Collaboration and Discoverability: Use our model registry to version, store, and easily find models across the organization
- Scalable Training Resources: Leverage advanced tools like GPUs, Ray, and Spark to meet the needs of diverse AI projects
What your impact will be
In your first year, you will:
- Integrate model monitoring to provide a holistic view of deployment health and performance
- Enhance tagging capabilities across Domino entities to improve discoverability and tracking
- Expand LLM hosting capabilities to address customer needs for scale, performance, and logging
- Innovate within our Domino Apps offering by incorporating feature requests from major customers
What we look for in this role
- Building Scalable Systems: Hands-on experience developing and managing high-performance back-end systems in distributed computing environments
- Collaboration Across Teams: Working closely with cross-functional teams to integrate systems with front-end interfaces and third-party services
- API Development: Designing and implementing secure, scalable APIs (e.g., RESTful APIs, gRPC)
- Performance Optimization: Profiling and optimizing back-end performance, especially in cloud environments or with container technologies like Docker and Kubernetes
- Testing and CI/CD: Using robust testing frameworks (unit, integration, end-to-end) and setting up CI/CD pipelines
- ML Model Deployment: Familiarity with model registries, versioning, and lifecycle management tools like MLflow or KubeFlow is a big plus
- Distributed Computing: Experience with frameworks like Apache Spark, Azure ML, or SageMaker is a plus
- Cloud Platforms: Proficiency with cloud providers (AWS, Azure, GCP) and deploying services in these environments
- Back-End Development: Expertise in languages such as Python, Java, Scala, or Go