Senior Staff Machine Learning Engineer, Data & Eval
About the role
The Community You Will Join:
AI and ML are at the heart of the Airbnb product. From Trust to Payments, and from Customer Service to Marketing, we rely on ML to ensure that guests and hosts have the best possible experience with Airbnb.
The Core ML team is responsible for driving CSxAI (Customer Support x Artificial Intelligence) initiatives by adopting Generative AI technologies to enable an intelligent, scalable, and exceptional service experience. The team develops and enhances AI models, ML services, and tools including LLM fine-tuning and optimization, RAG/Search, LLM evaluation and testing automation, feedback-based learning, and guardrails for a wide range of applications at Airbnb.
The richness of Airbnb's data, the complexity of its marketplace, and the variety innate in our product mean that we need to operate at the state of the art of AI practice. We are committed to long-term innovation to solve complex problems, and to do that we need experienced ML
The Difference You Will Make:
In this Senior Staff role, you will set technical direction and lead execution for ML evaluation and the end-to-end data flywheel powering CSxAI products (e.g., assistive agents, issue resolution, and tooling). Your work will define how we measure quality, how we turn feedback into learning signals, and how we continuously improve models and products safely and efficiently. You will partner closely with product, engineering, design, operations to build evaluation systems that are trusted, scalable, and actionable - connecting offline metrics to online outcomes.
A Typical Day:
- Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes.
- Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge where appropriate) with strong controls for bias, drift, and reliability.
- Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement.
- Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what “good” looks like and how teams act on evaluation results.
- Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy).
- Drive technical decisions and architecture for evaluation and data infrastructure, balancing speed, rigor, cost, and safety.
Minimum Qualifications:
- Educational Background: PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience).
- Industry Experience: 10+ years building, testing, and shipping ML/AI systems end-to-end; including 2+ years of experience with GenAI/LLM systems in production.
- Leadership Experience: 5+ years leading large, ambiguous technical initiatives as a senior IC, influencing roadmap and engineering/science direction across teams.
- Technical Proficiency: Deep expertise in evaluation methodology