Windmill

Agentic Code-Generation Loop Research Intern

Paris, France, FR Internship

The intern will design, evaluate and bring to the state of the art the internal Windmill agentic loop for generating scripts, flows and full-stack apps - and build the benchmarking system that measures its progress. The work tackles several open questions: how to objectively evaluate a generated workflow or app beyond "it compiles" (functional tests, end-to-end execution, UX quality, semantic correctness); how an agent should decompose a natural-language specification into coherent atomic steps; how to efficiently inject Windmill-specific context (hub, types, resource schemas) without saturating the context window; how to exploit execution feedback for self-correction; how to keep a dependency graph of scripts, flows and apps coherent across iterative multi-file edits; and how to detect hallucinations, silent regressions and "fake successes" where tests pass for the wrong reasons.

The mission runs over 5–6 months. Phase 1 maps the existing agentic loop, reviews the literature and reproduces 2–3 reference baselines. Phase 2 builds the benchmark: a task corpus covering isolated scripts, multi-step flows and full-stack apps - inspired by real Windmill workloads from the public hub and anonymized customer workspaces - with a sandboxed execution harness, multi-criteria scoring (correctness, quality, efficiency, readability) and continuous regression tracking re-run on every agent commit; an open-source release is envisioned. Phase 3 is the core experimental work: iterating on prompts, planning strategies, tool design, retrieval and execution-feedback loops; comparing frontier models (Claude, GPT, Gemini) with open-weights alternatives (Llama, Qwen, DeepSeek); and exploring, where ROI is demonstrated, supervised fine-tuning on execution traces or RL approaches. Measured improvements ship progressively to production. Phase 4 consolidates the work into a thesis, internal documentation and - depending on timing - a workshop or conference submission (NeurIPS, ICLR, ICML, COLM).

Expected deliverables: the Windmill benchmark (corpus, harness, tracking dashboard); an improved agentic loop shipped to production with documented progression metrics; a weekly lab notebook; the final thesis report; and possibly a publication or open-source release. The intern works directly with Ruben Fiszel (co-founder & CEO) and the Windmill R&D / AI team, with daily interaction, weekly reviews and full access to the codebase, to anonymized usage data, to frontier-model API budgets and to GPU infrastructure for fine-tuning experiments.

State of the art

Code-generation agents:

  • Inline assistants: Copilot, Cursor, Codeium - local completion and editing, short context
  • Autonomous agents: Claude Code, Aider, SWE-agent, OpenHands, Devin - planning, execution, self-correction
  • RL / fine-tuning approaches: AgentCoder, Reflexion, Self-Refine, agent tuning on execution traces
  • Retrieval methods: RAG over documentation, code embeddings, graph-RAG

Reference benchmarks:

  • SWE-bench / SWE-bench Verified - resolving GitHub issues (Python); now saturated on frontier models
  • HumanEval, MBPP, APPS, BigCodeBench - generation of isolated functions
  • LiveCodeBench - anti-contamination, temporally controlled tasks
  • WebArena, AppWorld - agents on simulated environments
  • TAU-bench, AgentBench - agent evaluation with tool use

Limitations of these benchmarks for our use case: none covers workflow generation (step composition, branching, parallelism, state management); none tests generation of full-stack apps with interactive UI; none integrates the specifics of Windmill (type system, resources, variables, hub, multi-language runtime).

Scientific and technical locks:

  1. Evaluation: how to objectively measure the quality of a generated workflow or app, beyond mere "it compiles / it passes a unit test"?
  2. Decomposition: how should an agent break a natural-language specification into coherent atomic scripts/steps?
  3. Contextualization: how to efficiently feed the agent with Windmill context without exploding the context window?
  4. Iteration loop: how to optimally exploit execution feedback for self-correction?
  5. Multi-file editing: coherent management of a dependency graph between scripts, flows and apps during iterative editing.
  6. Robustness: detection of hallucinations, silent regressions, and "fake successes."

Work plan (5–6 months)

Phase 1 - Mapping & state of the art (weeks 1–3): audit of Windmill's current agentic loop (architecture, prompts, tool-use); systematic review of existing literature and benchmarks; selection / reproduction of 2–3 reference baselines.

Phase 2 - Benchmark (weeks 3–8): design of the evaluation task corpus (isolated scripts, multi-step flows, full-stack apps); design of the evaluation harness (sandboxed execution, multi-criteria scoring); set up continuous regression tracking; open-source release of the benchmark envisioned.

Phase 3 - Improvement of the agentic loop (weeks 8–20): iterative experimentation on prompts, planning strategies, tool design, retrieval, execution feedback; comparison of frontier models vs open-weights; targeted exploration of supervised fine-tuning and RL approaches; progressive production deployment.

Phase 4 - Consolidation & deliverables (weeks 20–24): writing of the thesis / final-year report; internal technical documentation; possible paper submission.

Who we're looking for

M2 / final-year student in computer science or applied mathematics. Solid programming foundations (Python, TypeScript, bonus Rust), strong interest in LLMs / agents / evaluation methodology, empirical and rigorous approach.

Required skills : proficiency in Python and TypeScript; concrete understanding of how LLMs work (tokenization, context window, prompting, tool use, function calling); hands-on experience with at least one agentic assistant; design of controlled experiments and reproducible metrics; Git, testing, code review, CI; fluent English.

Nice-to-have: Rust; Svelte / modern frontend; fine-tuning & RL experience (SFT, DPO, RLHF, RLAIF); agent/benchmark evaluation experience; prior publication or significant open-source contribution; Docker, PostgreSQL, sandboxing, observability.

Education : Master’s student (M2) or final-year student (PFE) in computer science or applied mathematics: MPRI, École Polytechnique (X), École Normale Supérieure (ENS) (Ulm / Paris-Saclay / Lyon), Télécom Paris, CentraleSupélec, Mines, ENSIMAG, EPITA, 42, EPFL, or equivalent

🚀 Y Combinator Company Info

Y Combinator Batch: S22
Team Size: 7 employees
Industry: B2B Software and Services -> Infrastructure
Company Description: Open-source platform to turn scripts into internal apps & workflows

💰 Compensation

Salary Range: $2,000 - $3,000

📋 Job Details

Job Type: Internship
Engineering Type: Machine learning
Time to Hire: 5

🛠️ Required Skills

Prompt Engineering Git TypeScript Deep Learning Natural Language Processing

🎯 Interview Process

1. Apply here or email [jobs@windmill.dev](mailto:jobs@windmill.dev) 2. 30 min interview with founder 3. 1h case study with a team member 4. You're hired

This is a 6 month internship with a permanent contract (CDI) offered upon successful completion.