← Back to jobsApply for this position
Reddit
Staff Research Engineer, Post-training & Evaluation
engineeringfull-timeRemote - United States
SALARY
Not listed
WORK TYPE
remote
JOB TYPE
full-time
INDUSTRY
ai
✦ AutoApply Let us apply to roles like this on your behalf.
Learn more
About the role
Responsibilities
- Define the "Reddit Benchmark" evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what "Reddit-native" means in measurable terms and set the bar the org trains against.
- Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship.
- Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles.
- Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them.
- Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT runs to understand model capability evolution and catch regressions early.
✦ Let us apply for you
We find roles like this and apply on your behalf. Cover letter written for each one. Plans from $14.99/mo. Cancel anytime.
Join waitlist