We are seeking a high-level UNIX/Linux Systems Engineer to architect, own, and operate our next-generation, on-premise private cloud and GPU compute infrastructure supporting global engineering teams in Hillsboro, OR. This is a strategic, deep-technical systems architect position responsible for designing, scaling, and optimizing a world-class HPC/GPU datacenter environment.
You will drive end-to-end systems design, oversee compute, network, and storage architecture, and take full ownership of reliability, automation, performance, and deep multi-layer troubleshooting across thousands of nodes running bare-metal and virtualized workloads.
What You’ll Lead
- Architect, scale, and optimize complex UNIX/Linux-based compute clusters, GPU farms, and high-density datacenter systems.
- Own the design and strategy for on-prem HPC/GPU compute environments including OS architecture, distributed storage, network tuning, and interconnects.
- Perform deep-dive troubleshooting across all layers — kernel, network stack, RPC/NFS, storage protocols, firmware, drivers, bootloaders, and orchestration systems.
- Lead automation efforts using Python, Bash, Ansible, and IaC to eliminate manual processes and improve system reliability.
- Drive configuration standards for compute, network, and storage layers across bare-metal systems.
- Collaborate with architects, system software teams, networking teams, and hardware engineering to ensure platform scalability.
- Own operational excellence: uptime, performance tuning, incident response processes, and long-term platform strategy.
- Mentor and technically lead junior engineers and datacenter technicians.
What We Need to See
- 8–15+ years in UNIX/Linux systems engineering, system administration, or HPC/compute infrastructure roles.
- Expert-level knowledge of Linux internals (kernel, storage subsystems, networking stack, groups, system, NUMA, etc.).
- Proven experience architecting and running large-scale compute clusters or farms (HPC, HCI, GPU clusters, or bare-metal automation environments).
- Deep understanding of compute, network, and storage architectures end-to-end.
- Demonstrated skill in root-cause analysis at multiple layers, including:
- NFSv3/v4 deep troubleshooting
- Packet-level analysis
- Kernel performance tuning
- Distributed storage (NetApp, Ceph, Lustre, BeeGFS, etc.)
- Strong networking fundamentals: TCP/IP, VLANs, BGP, LACP, RoCE/RDMA, NIC offloading.
- Strong automation skills: Python, Bash, Ansible, Terraform, or IaC tools.
- Experience with PXE provisioning, Kickstart, bare-metal deployments, and OS image pipelines.
- Certifications strongly preferred:
- UNIX/Linux certs (RHCE, RHCSA, Linux Foundation)
- Networking certs (CCNP, CCIE, JNCIP, etc.)
- Storage certs (NetApp NCIE/NCDA or similar)
What Makes You Stand Out
- Experience designing GPU clusters or accelerator-dense environments.
- Deep experience with distributed filesystems, block storage tuning, or NFS debugging.
- Strong background in systems and platform performance engineering.
- Ability to continuously evaluate emerging technologies and build long-term architectural recommendations.
- Experience leading and mentoring infrastructure teams.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.