Mercury

Senior Software Engineer - Mercury Command

engineeringfull-timeSan Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States

SALARY

Not listed

WORK TYPE

remote

JOB TYPE

full-time

INDUSTRY

fintech

About the role

Ship new capabilities users love:

Design and ship new Command skills, the domain-specific instruction sets that teach the model how to handle workflows like sending money, managing invoices, and understanding cash flow
Design and build agentic workflows in Command, defining the architecture for how multi-step agent interactions should work as we extend what the product can do on a customer's behalf
Work with backend teams to define tool schemas for new capabilities, shaping the data contracts between Mercury's business logic and the model
Own new capabilities end to end, from the system prompt to the frontend component that renders the response

Own the LLM layer:

Maintain and evolve Command's prompt architecture: the system prompt, skill loading system, session context, and the policy and compliance layers underneath
Tune model behavior: reasoning effort, prompt caching strategy, fallback chains, and the streaming patterns that make the product feel fast
Stay current with how models are evolving and bring that knowledge back to how Command is built

Build quality in:

Write and expand Command's eval harness, adding cases that cover new capabilities and scoring rubrics that detect regressions before users do
Partner with product and compliance teams to define what "working correctly" means for each new capability, then build the tests that prove it
Own the reliability and quality of what you ship, from initial design through post-launch monitoring

Has 7 or more years of software engineering experience, with deep technical expertise building and scaling LLM-powered applications in production
Has gone beyond shipping a first version: you have scaled an LLM-powered product, dealt with the reliability and performance problems that come with real usage, and made it better over time
Has experience designing agentic systems and has opinions about how to architect multi-step workflows that are reliable, explainable, and safe to run on behalf of real users
Has built eval infrastructure and can write cases that actually measure whether the product works, not just whether the model outputs something plausible
Understands the real tradeoffs in LLM deployments: latency, cost, compliance, and what breaks in production that doesn't show up in demos
Has opinions about what makes an AI product trustworthy, not just impressive, and can build toward that bar
Is comfortable with TypeScript and willing to learn Haskell for