Company overview:
TraceLink’s software solutions and Opus Platform help the pharmaceutical industry digitize their supply chain and enable greater compliance, visibility, and decision making. It reduces disruption to the supply of medicines to patients who need them, anywhere in the world.
Founded in 2009 with the simple mission of protecting patients, today Tracelink has 8 offices, over 800 employees and more than 1300 customers in over 60 countries around the world. Our expanding product suite continues to protect patients and now also enhances multi-enterprise collaboration through innovative new applications such as MINT.
Tracelink is recognized as an industry leader by Gartner and IDC, and for having a great company culture by Comparably.
TraceLink is seeking a strategic and hands-on Senior Director of Cloud Engineering to lead a multi-disciplinary organization spanning Site Reliability Engineering (SRE), Performance & Tools Engineering, and Release Engineering. This role is critical to ensuring the scalability, reliability, and operational excellence of TraceLink’s cloud-native SaaS platform, while also owning the infrastructure behind both internal and customer-facing AI capabilities.
The Director will be the single-threaded owner of our internal suite of AI-enabled tools for engineering productivity, as well as responsible for the DevOps and infrastructure support for external AI features integrated into the Opus platform, such as LLM-powered agentic functionality.
They will drive initiatives that enable AI-powered operational intelligence, cost-optimized infrastructure, and high-velocity product delivery across a globally distributed engineering team.
Responsibilities:
Act as a Single Threaded Owner (STO) for infrastructure & operational excellence and lead a global organization across three primary areas:
SRE, with an SRE Manager and team focused on reliability, observability, incident response, and cloud operations
Performance & Tools, building tooling for automated testing, test orchestration, system health monitoring, and integration testing
Release Engineering, responsible for CI/CD tooling, release orchestration, and deployment automation
Own and evolve TraceLink’s internal suite of AI-enabled tools designed to enhance developer productivity and platform insight
Play a leadership role in DevOps and infrastructure operations for AI capabilities integrated into TraceLink’s Opus platform, including support for LLM-based workflows, inference pipelines, and secure model interactions
Evaluate and adopt emerging technologies aligned with the company’s product vision and technical architecture
Partner with the CISO, architecture, and product teams to align cloud practices with security, compliance, and business goals
Drive maturity in infrastructure as code, observability (OpenTelemetry, Prometheus, Grafana, Jaeger), and release automation (Jenkins, Flux-CD, Env0, CodeBuild)
Lead the design and rollout of AI-driven anomaly detection, telemetry pipelines, and proactive system health monitoring
Extend CI/CD and integration testing systems to support performance testing, distributed tracing, and alerting workflows
Be a major contributor to efforts to improve product quality through improved automated testing
Champion cost optimization initiatives, including efficient AWS resource usage (Karpenter, Spot Instances, serverless), and align to target COGS metrics
Set high standards for reliability, latency, availability, and scalability of core systems
Oversee deployment health, platform smoke tests, and post-deployment validation strategies
Monitor and report on platform KPIs, system uptime, alerting noise ratios, and MTTR
Lead incident response strategies and reduction of manual toil through automation and self-service tools
Hire, mentor, and grow high-performing engineering managers and technical leaders
Align team OKRs with broader engineering and company goals
Foster a culture of engineering rigor, continuous improvement, and cross-functional collaboration
Qualifications:
Required:
Bachelor’s degree in Computer Science, Engineering, or equivalent experience
5+ years in engineering leadership roles managing multiple cross-functional DevOps/SRE/tooling teams
Deep experience with cloud-native architecture, especially AWS services, infrastructure-as-code, CI/CD systems, and observability platforms
Proven success running SaaS at scale, including performance, reliability, and cost optimization
Hands-on experience with tools such as Terraform, Helm, Docker, Kubernetes, Prometheus, ELK, Redis, Kafka, Karpenter, Jenkins, OpenTelemetry, Grafana, Env0, CodeBuild
AWS Bedrock or equivalent managed foundation model platforms
Experience supporting AI/ML-enabled applications, including inference pipelines and secure LLM integration
Experience with high-performance inference runtimes such as KServe, vLLM, TensorRT-LLM, TGI, or Envoy AI Gateway
Techniques for optimizing inference performance and cost, including KV Cache management, prompt caching, model quantization, and batching strategies
Clear understanding of security practices, DevSecOps, and compliance (e.g., SOC-2, ISO27001)
Excellent communication and stakeholder management skills
Preferred:
Advanced degree in Engineering or related field
Experience with regulated industries (e.g., healthcare, pharma, or life sciences)
Familiarity with reactive frameworks and modern Java/JavaScript application stacks
Please see the Tracelink Privacy Policy for more information on how Tracelink processes your personal information during the recruitment process and, if applicable based on your location, how you can exercise your privacy rights. If you have questions about this privacy notice or need to contact us in connection with your personal data, including any requests to exercise your legal rights referred to at the end of this notice, please contact Candidate-Privacy@tracelink.com.