NVIDIA

Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud

US, CA, Santa Clara Full time

The DGX Cloud organization at NVIDIA brings together cutting-edge hardware and software innovation to deliver industry-leading accelerated computing for the world’s most ambitious AI workloads. We’re a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide!
 

We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open-source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end-to-end experience across the stack, from containers and orchestration to cloud platforms, along with the technical depth to dive deep and solve complex, real-world problems. In this pivotal role, you will take on the challenge of scaling AI infrastructure while optimizing total cost of ownership - driving down cost per token to unlock the next generation of AI innovation and AI factories!

What you'll be doing:

  • Drive deep, end-to-end performance and scale characterization across the DGX Cloud software stack, fearlessly chasing issues from high-level software all the way down to the metal.

  • Collaborate with AI researchers, developers and customers to develop innovative tests that simulate user workloads through comprehensive end-to-end automation, employing custom-built and innovative open-source tools and frameworks.

  • Deep dive into performance and scale issues with the intent of discovering their root causes in complex distributed systems.

  • Design and develop monitoring and reporting tools for performance and scale testing and analysis.

  • Actively engage with upstream communities to validate performance and scalability early, shaping design and development decisions from the outset.

  • Triage, debug, and root cause issues related to operating Kubernetes clusters at ultra-large scale

  • Build a high-velocity framework that enables continuous, always-on performance and scale testing through a modern CI/CD pipeline.

  • Present your work and findings at internal and external venues.

What we need to see:

  • At least 8 years of experience with a background in Computer Architecture, Networking, Storage systems, Accelerators

  • Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)

  • Expertise in Kubernetes and familiarity with related CNCF projects

  • Expertise in working with large scale parallel and distributed accelerator-based systems

  • Expertise optimizing performance and AI workloads on large scale systems

  • Experience with performance modeling and benchmarking at scale

  • Proficiency in Golang/Python

  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI for example)

Ways to stand out from the crowd:

  • Strong operational experience with any one of the Kubernetes distributions

  • Prior experience scaling Kubernetes clusters to ultra-large node and object counts

  • Demonstrated history of working in the open-source community

  • Excellent communication and interpersonal abilities

  • PhD in relevant areas

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until March 6, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.