Join NVIDIA as a Senior Deep Learning Algorithms Engineer to optimize cutting-edge biology and structural biology models, including LLMs and VLMs, for maximum performance and efficiency on NVIDIA GPUs. Focus on world-class inference for workloads like protein structure prediction and design.
As part of BioNeMo, you will collaborate across teams to move next-gen AI models (e.g., Boltz1/2, OpenFold2/3) from research to production serving via TensorRT-LLM and related stacks, ensuring industry-leading, scalable performance for scientists and developers.
What you will be doing:
Integrate TensorRT-LLM for BioNeMo models (Boltz1–2, OpenFold2–3) and upcoming structural biology models (RFDiffusion, DiffDock, ProteinNMN, Evo2, ESM3).
Optimize models for low-latency, high-throughput inference using parallelism, quantization (FP8/INT8), and sparsity/pruning.
Profile and debug deep learning workloads on GPUs, resolving kernel/graph bottlenecks in training/inference, including custom operators.
Develop and validate custom GPU kernels (CUDA, Triton) for hot paths, memory-bound ops, and non-standard blocks in structural biology models.
Collaborate with research to align model architecture and training with deployment constraints for smooth production transition.
What we want to see:
MS/PhD in CS, EE, Comp. Eng., or equivalent practical experience.
5+ years professional experience in deep learning/applied ML, with a track record of deploying optimized models/inference paths in production (not research prototypes).
Strong foundation in transformer/diffusion architectures; direct experience with LLMs, VLMs, or large biology models (e.g., structure prediction).
Proficient in PyTorch (and/or TensorFlow) for production-grade model building, debugging, and deployment.
Strong Python/C++; ability to read/modify performance-critical C++/CUDA code for inference stacks and custom ops.
Practical experience with TensorRT/TensorRT-LLM: model conversion, optimization, deployment, and performance measurement (latency/throughput) under realistic conditions.
Familiarity with GPU performance engineering: profiling (Nsight), roofline analysis, and optimization of kernels/memory access; experience writing/extending custom GPU kernels for model hot paths is required.
Ways to stand out from the crowd:
Led or significantly contributed to large-scale LLM/VLM/biology model serving (strict SLOs, high QPS, multi-GPU/node inference, cost/perf ownership).
Deep customization of, or substantial contributions to, TensorRT-LLM, vLLM, SGLang, or comparable stacks, including debugging and extending for novel architectures.
End-to-end ownership of FP8/INT8 (or other formats), including calibration, regression testing, and documenting accuracy vs. speed tradeoffs on biology workloads.
Strong familiarity with protein structure, docking, or diffusion-based design and model families (e.g., OpenFold, Boltz, ESM, RFDiffusion, DiffDock)—demonstrated by benchmarks, publications, or open-source work.
Repeated success taking non-text architectures (geometric, multimodal, structure-centric) from research/checkpoint to optimized, production-ready inference with clear metrics as well as examples of writing, maintaining, or upstreaming custom kernels or fused ops that produced measurable gains on real models or hardware.