Applications deadline: We accept submissions until 15 January 2026. We review applications on a rolling basis and encourage early submissions.

ABOUT THE OPPORTUNITY

We’re looking for Full-stack Software Engineers who are excited to build tools for frontier AGI safety research, e.g. building and maintaining evals libraries and tools for monitoring and controlling our own LLM traffic.

REPRESENTATIVE PROJECTS

Your main objective is to develop tooling for analyzing model evaluation results. Here is a list of features that you might build and ship in your first 6 months:

- LLM-powered search that finds interesting fragments in evaluation transcripts

- Comparison views that show how conversations and scores differ between two evaluation runs

- Ability to view and analyse conversations with coding agents (Cursor, Claude Code, etc.) in addition to evaluation transcripts

- Results streaming for evaluations that are currently being run

- Collaborative editing of evaluation logs that automatically updates metrics and other derived data. Think of this as developing an “IDE for evaluations”.

Besides this, here are example auxiliary projects which you might do:

- Automated evaluation pipelines to minimize the time from getting access to a new model for pre-deployment testing to analyzing the most important results and sharing them.

- LLM agents and MCP tools to automate internal software engineering and research tasks, with sandboxes to prevent major failures

- Telemetry API and instrumentation of our existing tools, allowing us to monitor usage and improve reliability

- Upstream improvements to the Inspect framework and ecosystem, e.g. support for evaluating modern agentic scaffolds.

Full-stack Software Engineer

Related Jobs

Embedded Software Engineer, Cellular, Level 5

Specialist, Global Analytic Insights

Staff Robotics Software Engineer

Back office ufficio commerciale - Part time (Nuova apertura)

Data Leader

Senior Software Engineer