The intern will design, evaluate and bring to the state of the art the internal Windmill agentic loop for generating scripts, flows and full-stack apps - and build the benchmarking system that measures its progress. The work tackles several open questions: how to objectively evaluate a generated workflow or app beyond "it compiles" (functional tests, end-to-end execution, UX quality, semantic correctness); how an agent should decompose a natural-language specification into coherent atomic steps; how to efficiently inject Windmill-specific context (hub, types, resource schemas) without saturating the context window; how to exploit execution feedback for self-correction; how to keep a dependency graph of scripts, flows and apps coherent across iterative multi-file edits; and how to detect hallucinations, silent regressions and "fake successes" where tests pass for the wrong reasons.
The mission runs over 5–6 months. Phase 1 maps the existing agentic loop, reviews the literature and reproduces 2–3 reference baselines. Phase 2 builds the benchmark: a task corpus covering isolated scripts, multi-step flows and full-stack apps - inspired by real Windmill workloads from the public hub and anonymized customer workspaces - with a sandboxed execution harness, multi-criteria scoring (correctness, quality, efficiency, readability) and continuous regression tracking re-run on every agent commit; an open-source release is envisioned. Phase 3 is the core experimental work: iterating on prompts, planning strategies, tool design, retrieval and execution-feedback loops; comparing frontier models (Claude, GPT, Gemini) with open-weights alternatives (Llama, Qwen, DeepSeek); and exploring, where ROI is demonstrated, supervised fine-tuning on execution traces or RL approaches. Measured improvements ship progressively to production. Phase 4 consolidates the work into a thesis, internal documentation and - depending on timing - a workshop or conference submission (NeurIPS, ICLR, ICML, COLM).
Expected deliverables: the Windmill benchmark (corpus, harness, tracking dashboard); an improved agentic loop shipped to production with documented progression metrics; a weekly lab notebook; the final thesis report; and possibly a publication or open-source release. The intern works directly with Ruben Fiszel (co-founder & CEO) and the Windmill R&D / AI team, with daily interaction, weekly reviews and full access to the codebase, to anonymized usage data, to frontier-model API budgets and to GPU infrastructure for fine-tuning experiments.
Code-generation agents:
Reference benchmarks:
Limitations of these benchmarks for our use case: none covers workflow generation (step composition, branching, parallelism, state management); none tests generation of full-stack apps with interactive UI; none integrates the specifics of Windmill (type system, resources, variables, hub, multi-language runtime).
Scientific and technical locks:
Phase 1 - Mapping & state of the art (weeks 1–3): audit of Windmill's current agentic loop (architecture, prompts, tool-use); systematic review of existing literature and benchmarks; selection / reproduction of 2–3 reference baselines.
Phase 2 - Benchmark (weeks 3–8): design of the evaluation task corpus (isolated scripts, multi-step flows, full-stack apps); design of the evaluation harness (sandboxed execution, multi-criteria scoring); set up continuous regression tracking; open-source release of the benchmark envisioned.
Phase 3 - Improvement of the agentic loop (weeks 8–20): iterative experimentation on prompts, planning strategies, tool design, retrieval, execution feedback; comparison of frontier models vs open-weights; targeted exploration of supervised fine-tuning and RL approaches; progressive production deployment.
Phase 4 - Consolidation & deliverables (weeks 20–24): writing of the thesis / final-year report; internal technical documentation; possible paper submission.
M2 / final-year student in computer science or applied mathematics. Solid programming foundations (Python, TypeScript, bonus Rust), strong interest in LLMs / agents / evaluation methodology, empirical and rigorous approach.
Required skills : proficiency in Python and TypeScript; concrete understanding of how LLMs work (tokenization, context window, prompting, tool use, function calling); hands-on experience with at least one agentic assistant; design of controlled experiments and reproducible metrics; Git, testing, code review, CI; fluent English.
Nice-to-have: Rust; Svelte / modern frontend; fine-tuning & RL experience (SFT, DPO, RLHF, RLAIF); agent/benchmark evaluation experience; prior publication or significant open-source contribution; Docker, PostgreSQL, sandboxing, observability.
Education : Master’s student (M2) or final-year student (PFE) in computer science or applied mathematics: MPRI, École Polytechnique (X), École Normale Supérieure (ENS) (Ulm / Paris-Saclay / Lyon), Télécom Paris, CentraleSupélec, Mines, ENSIMAG, EPITA, 42, EPFL, or equivalent
This is a 6 month internship with a permanent contract (CDI) offered upon successful completion.