NVIDIA is known as the "AI Computing Company." Our GPUs power modern Deep Learning software frameworks, accelerated analytics, data centers, and autonomous vehicles. We seek a Software Reliability Engineer - LPU Hardware DataFlow to join our company and concentrate on hardware reliability testing and driver reliability. You will develop and conduct reliability and qualification campaigns for NVIDIA hardware (accelerators, boards). You will also build and sustain automated test frameworks for driver stability and regression. Additionally, you will lead efforts to improve hardware and driver reliability to meet customer expectations.
In this role you own the reliability of our hardware and driver stack. You will complete and automate hardware stress tests, longevity and environmental tests, and failure analysis; you will also take responsibility for driver reliability testing – stability under load, regression suites, compatibility matrices, and crash/hang triage. Your work ensures that hardware and drivers ship with confidence and that field issues are understood and prevented through improved test coverage and monitoring. The ideal candidate has solid software and automation abilities along with enthusiasm for hardware and low-level software. We seek engineers who consider failure modes and stress scenarios, develop consistent reliability testing processes, and connect driver and hardware behavior to identify root causes of reliability problems.
What you'll be doing:
Fix logic bugs before they even happen by providing formal correctness proofs.
Develop and sustain driver reliability test frameworks: automated stability evaluations, regression test suites, and compatibility assessments across OS, driver versions, and hardware SKUs.
Diagnose and identify driver and hardware failures: investigate crashes, freezes, and errors; collaborate with driver and hardware groups to resolve problems and enhance test coverage.
Establish and track reliability metrics and SLOs for hardware and drivers; perform post-mortems and encourage advancements in test automation and coverage.
Build, implement, and run hardware reliability and qualification tests: stress tests, longevity tests, thermal/power cycling, and environmental tests on GPUs and accelerators.
Automate test running, result gathering, and reporting; incorporate reliability tests into CI and release workflows; manage lab or farm infrastructure for reliability testing across EMEA and worldwide.
What we need to see:
BS or higher degree or equivalent experience with 8+ years in reliability engineering, hardware testing, driver testing, or SRE with a focus on hardware/drivers.
Functional programming experience (haskell, nix).
Strong System level programming experience (C++, Rust, Java).
Strong experience with Linux and scripting (Python, Shell) for test automation, result parsing, and tooling.
Proficiency in building automated test pipelines; experience with CI/CD and with running tests at scale (e.g. test farms, lab automation).
Ability to prioritize failures, examine logs and dumps, and collaborate with driver or hardware teams to identify root causes of issues.
Strong communication skills in English; capable of collaborating with distributed teams across EMEA and worldwide.
Ways to stand out from the crowd:
Experience with GPU or accelerator reliability testing; familiarity with NVIDIA or other GPU/driver ecosystems.
Experience with hardware durability or certification testing (stress, longevity, thermal, power) and/or driver consistency and regression testing.
Background in driver development, kernel debugging, or low-level software; ability to read driver code and correlate behavior with test failures.
Experience with hardware testing tools, lab automation, or DUT (device-under-test) management at scale.
Knowledge of reliability standards and methods (e.g. FIT rates, accelerated life testing, failure analysis).
Experience with firmware or BIOS reliability testing; understanding of hardware–software interaction and error reporting (e.g. AER, MCE).
Join our team of world-class engineers and be part of the groundbreaking work we do at NVIDIA. We are committed to encouraging a collaborative and inclusive environment, where every team member has the opportunity to thrive and make a significant impact!