At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

Job Summary

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our 39,000 employees work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the globe.

Competency Summary

We are seeking a HPC Performance engineer intern to join our team of scientists and engineers passionate about building the next generation of scientific machine learning (ML) frameworks. The High Performance Compute Engineer will support and advance LillyPod, Lilly's internal GPU compute cluster for AI/ML workloads.It uses Run:ai as its orchestration layer and supports three main workload types: interactive workspaces for development, batch training jobs, and distributed training across multiple nodes.

Key Objectives/Deliverables

Manage and optimize GPU cluster resources using the Run:ai orchestration platform, including scheduling, quota management, and workload prioritization
Support users running workspace (interactive), training (batch), and distributed training workloads across multi-node GPU environments
Design and implement computationally performant features for large scale, CUDA-backed ML training frameworks, using low level acceleration and scaling strategies such as kernel design, GPU porting, data structure innovations, distributed learning technologies
Optimize computational performance of wide range of business-critical ML models via accelerated hardware and software stack, as well as algorithmic improvements
Design and maintain shared and project-specific filesystems for large-scale data and model storage
Build and maintain container images via an internal registry, ensuring reproducibility and security of ML environments
Develop and maintain CLI tooling and automation for workload submission, monitoring, and lifecycle management
Implement and tune preemptible workload strategies to maximize cluster utilization

Minimum Position Requirements

Education Requirements: Masters/PhD
Strong Linux systems administration skills (networking, storage, process management)
Experience with Kubernetes and container orchestration in a GPU-accelerated environment
Deep knowledge and expertise of all kinds of high performance infra, including NVIDIA series clusters H200s & B200s.
Hands-on experience with Run:ai or similar ML workload schedulers (SLURM, PBS, Volcano)
Proficiency building and managing Docker/OCI container images and private registries
Familiarity with distributed training frameworks (PyTorch DDP, DeepSpeed, FSDP, Horovod)
Understanding of shared filesystem architectures (NFS, Lustre, GPFS/Spectrum Scale, or similar)
Experience with CLI tool development (Python, Go, or Bash)
Comfort working with ML/DL practitioners and translating their compute needs into infrastructure solutions

Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.

Lilly does not discriminate on the basis of age, race, color, religion, gender, sexual orientation, gender identity, gender expression, national origin, protected veteran status, disability or any other legally protected status.

#WeAreLilly

High Performance Compute Intern

Related Jobs

UX Designer II, Community

Land Solutions, Document Management Specialist (Broadview Heights, OH)

Senior Exploitation Specialist / Data Scientist (NGA Washington, NCE)

Senior Software Quality Engineer

Senior Analyst | Finance Operations - Collection | WPP SSC MY

Partner Success Manager