We are looking for a Sr. Engineer to design, build, and scale the infrastructure powering NVIDIA’s AI agent ecosystem. You will work at the intersection of distributed systems, developer platforms, and agentic AI — building the foundational services that enable teams across the company to develop, deploy, orchestrate, and operate autonomous AI agents at production scale.
What you will be doing:
Build and develop platform services that own the full agent lifecycle from registration through deployment, execution, and teardown
Architect Kubernetes-based execution environments with pod lifecycle management, namespace isolation, persistent storage, and identity propagation
Develop and maintain automated CI/CD pipelines using GitLab CI and ArgoCD, including reusable pipeline templates and deployment blueprints that standardize how agents are built across teams
Build framework-agnostic infrastructure supporting multiple agent SDKs (Claude Code, OpenAI Codex, LangGraph), with hands-on experience using harnesses, lifecycle hooks, skills configurability, observability (OTEL), and memory services
Build and operate Kafka-based message pipelines and real-time event streaming using Redis PubSub and SSE
Develop data ingestion pipelines, access interfaces, and storage layers that power AI agent knowledge and context
Implement session management for state persistence, conversation history, and agent recovery across sessions
Develop multi-layer auth using OAuth 2.0, JWT validation, token exchange, and gateway integration, and manage secrets lifecycle with Vault (provisioning, rotation, container injection)
Partner with security teams on compliance, access controls, and approval workflows for agent operations
What we need to see:
Bachelor's or Master's degree in Computer Science, Engineering, or related field (or equivalent experience), with 8+ years in software engineering — ideally in platform engineering, infrastructure, or developer tools
Experience building and scaling AI agents in production using frameworks like Claude Code, Codex, or LangGraph
Deep Kubernetes expertise including pod orchestration, persistent storage, RBAC, and multi-cluster management
Strong Python skills with production API experience using FastAPI, Flask, or similar async frameworks
Proven track record designing distributed systems with Kafka, Redis, and MongoDB or PostgreSQL
Expertise building and managing robust CI/CD pipelines using GitLab CI and ArgoCD for continuous delivery to Kubernetes
Experience designing AI data platform components (ingestion pipelines, vector stores, retrieval APIs, data preprocessing workflows) and building developer-facing platform APIs consumed by multiple engineering teams
Solid grasp of auth and identity: OAuth 2.0, JWT, token exchange, and secrets management with Vault
History of leading sophisticated technical projects such as migrations or greenfield platform builds, with strong interpersonal skills to drive alignment across teams and write clear design documents
Ways to stand out from the crowd:
Experience building or operating AI agent platforms or agentic workflow systems, with hands-on expertise in agent protocols and frameworks like MCP, A2A, LangChain, or LangGraph
Hands-on experience with RAG architectures, embedding pipelines, and vector databases (Milvus, Pinecone, or Weaviate)
Full-stack skills with React or Vue for building developer portals and dashboards
Contributions to open-source infrastructure or platform tooling
You will also be eligible for equity and benefits.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.