Sayari

Staff Applied Scientist - AI Evaluation & Trust

datafull-timeRemote - US

SALARY

Not listed

WORK TYPE

remote

JOB TYPE

full-time

INDUSTRY

Apply for this position

✦ AutoApply Let us apply to roles like this on your behalf.

Learn more

About the role

About Sayari

Sayari is the judgment infrastructure for trustworthy AI in economic security and commercial risk. The Sayari Commercial World Model resolves 11.7B+ primary-source records from 250+ jurisdictions forming the ground truth of global commerce. A Judgment Ontology, encoding over a decade of investigative tradecraft, and Superconductor, an agentic orchestration platform, deliver AI that reasons like an expert analyst, shows its work, and traces every finding to its source. Trusted by U.S. Customs and Border Protection, HM Revenue & Customs, and Fortune 500 enterprises, Sayari is used by thousands of professionals across 35+ countries to secure supply chains and dismantle illicit networks. Headquartered in Washington, D.C., with offices in London, Singapore, Tokyo, and Tel Aviv.

Position Description

Sayari builds AI systems for high-consequence analytical work where being 'wrong' carries real-world weight. We are looking for a Staff or Principal Applied Scientist to join our AI Innovation Group as the trusted expert on AI Evaluation and Trust. You will own the 'Judgment Layer' of our system: building the specialized judge models, statistical benchmarks, and multi-turn frameworks that ensure our agents act with the high bar of trustworthiness required by our national security and enterprise customers.

Job Responsibilities

Lead the development of specialized 'judge models,' moving from general-purpose frontier models to architectures purpose-built for evaluation and failure mode detection.
Design and execute rigorous scoring pipelines and empirical threshold calibrations for agentic systems, including multi-turn conversation and Graph RAG reasoning.
Establish domain-specific evaluation frameworks that measure whether a system can perform the work of human experts rather than just passing general capability benchmarks.
Own the full lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services into production.
Research and implement advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.
Collaborate cross-functionally with Product, Data Engineering, and the SVP of AI to translate complex statistical uncertainty into clear, actionable product signals.
Act as a technical leader and 'Scientific Conscience' within the AI pod, ensuring every AI-driven risk signal is backed by an empirical derivation story.

Skills & Experience

Required:

10+ years of Machine Learning experience with a focus on Deep Neural Network activities, evaluating model performance & trust.
1-2+ years’ experience focused on post-training activities
1+ year experience creating benchmarks to evaluate LLMs
Technical Mastery: Deep expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).
Statistical Rigor: Mastery of statistics and experimental design, including significance testing, distribution analysis, and inter-rater reliability.
Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.
Builder Mindset: Proven ability to own the path from data collection to production deployment; we are a small team and every role is 'hands-on.'
Domain Fluency: Understanding of Graph RAG and the unique challenges of evaluating non-deterministic, agentic workflows.

Preferred:

Judgment Task Models: Experience building specialized models for judgment tasks, such as factuality verification, safety classification, or reward modeling.

✦ Let us apply for you

We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $15/mo. Cancel anytime.

Get AutoApply

Apply now