At NVIDIA, we are pioneers in innovation, transforming computer graphics, PC gaming, and accelerated computing for over 25 years. Our team is driven by powerful technology and outstanding people who expand the limits of what’s achievable. Now, we are unlocking the potential of AI to usher in the next era of computing.
As part of our engineering organization, you will play a key hands-on role in developing and executing software-driven characterization workflows on NVIDIA rack-scale systems. This role is focused on running AI workloads across the full stack to analyze, characterize, and optimize power, performance, and drive behavior at system level. This is an opportunity to work at the intersection of software, infrastructure, silicon, and large-scale AI platforms, with direct impact on next-generation NVIDIA systems.
What you’ll be doing:
Develop and run software tools, automation, and workloads to characterize power, performance, and drive behavior across NVIDIA rack-scale systems.
Execute AI and system-level workloads to stress and evaluate behavior across the stack, including GPUs, CPUs, networking, storage, firmware, drivers, and system software.
Build automated frameworks for data collection, telemetry, validation, correlation, and analysis of characterization results.
Investigate system behavior under different workloads and operating conditions to identify bottlenecks, anomalies, and optimization opportunities.
Work closely with hardware, firmware, driver, system software, performance, and validation teams to define characterization methodologies and debug cross-stack issues.
Support bring-up, validation, and readiness activities for new rack-scale platforms and AI infrastructure.
Create clear documentation, test flows, and repeatable processes to improve coverage, efficiency, and reproducibility.
What we need to see:
B.Sc. or M.Sc. in Computer Science, Electrical Engineering, or a related field.
5+ years of software engineering experience, preferably in system software, infrastructure, validation, or performance-focused environments.
Strong programming skills in Python and at least one system-level language such as C/C++.
Experience developing automation and test infrastructure for complex hardware/software systems.
Hands-on experience running, debugging, or optimizing AI, HPC, or large-scale system workloads.
Good understanding of system-level architecture, including interactions across hardware, firmware, drivers, operating systems, and application layers
Experience working in Linux environments and with scripting, telemetry, logging, and data analysis tools.
Strong debugging and problem-solving skills, with the ability to work across multiple engineering disciplines.
Good communication skills and the ability to drive technical work in a fast-paced, cross-functional environment.
Ways to stand out from the crowd:
Experience with NVIDIA platforms, GPU systems, or rack-scale AI infrastructure.
Background in power, thermal, performance, or storage/drive characterization.
Experience with workload automation, cluster orchestration, or lab infrastructure.
Familiarity with AI benchmarks, training/inference workloads, and system stress methodologies.
Experience in post-silicon validation, production testing, or system bring-up.