The Senior Manager of Site Reliability Engineering (SRE) provides technical and people leadership for the reliability, scalability, and performance of our mission‑critical systems and services. This role blends hands‑on SRE expertise with strong engineering management, driving the execution and maturation of reliability practices across supported platforms and teams.
The ideal candidate has a strong software engineering background, a passion for automation and operational excellence, and a proven ability to lead, mentor, and scale high‑performing SRE teams.
The Senior Manager partners closely with engineering, product, and operations leaders to design, deliver, and operate resilient, highly available systems that support customer needs and business objectives.
Responsibilities
- Lead the execution and continuous improvement of SRE practices across assigned platforms and services, reinforcing a culture of reliability, efficiency, and operational ownership.
- Manage and evolve automation strategies that reduce operational toil, improve system reliability, and increase engineering productivity.
- Design, implement, and operate observability, monitoring, and alerting solutions that provide actionable insight into system health, availability, and performance.
- Own and lead high‑severity incident response for supported services, ensuring effective triage, coordination, root cause analysis, and completion of corrective and preventative actions.
- Analyze reliability, performance, and capacity metrics to identify risks, drive proactive improvements, and support long‑term system resilience.
- Partner with software engineering, product, and infrastructure teams to embed SRE principles throughout the development lifecycle and influence architecture and design decisions.
- Build, coach, and develop SRE managers and engineers, fostering technical excellence, career growth, and strong on‑call and operational practices.
- Support capacity planning, scalability assessments, and demand forecasting for critical systems and services.
- Ensure SRE processes, standards, and best practices are well documented, understood, and consistently applied.
Key Decision Rights
- Own technical and operational decisions for SRE‑managed systems, including reliability improvements, automation priorities, and tooling choices within the team’s scope.
- Act as a senior incident leader during major outages, driving response execution and recommending remediation and long‑term reliability improvements.
- Provide input into staffing plans, hiring decisions, and budget requests for the SRE organization.
- Define and manage service level indicators (SLIs), service level objectives (SLOs), and error budgets in partnership with engineering and product teams.
- Recommend and help evaluate third‑party tools and platforms related to observability, incident management, reliability testing, and automation.
- Drive operational process improvements that enhance system reliability and team effectiveness.
Leadership & Interpersonal Skills
- Strong people leadership skills, with demonstrated experience managing and developing senior engineers and first‑line managers.
- Ability to lead calmly and decisively during high‑pressure incidents.
- Effective communicator who can translate complex technical topics into clear, actionable guidance for engineering leaders and stakeholders.
- Collaborative mindset with the ability to influence across engineering, product, and operations without direct authority.
- Proven ability to balance short‑term operational needs with long‑term reliability investments.
- Comfortable navigating ambiguity, resolving conflict, and fostering healthy, accountable team dynamics.
- Strong sense of ownership and accountability for service reliability and operational outcomes.
Required Skills and Competencies
- Strong experience with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar.
- Proficiency in at least one programming language such as Python, Go, or Java.
- Hands‑on experience with cloud platforms (AWS, Azure, or GCP) and container orchestration technologies (Docker, Kubernetes).
- Solid working knowledge of AWS services such as VPC, EC2, ELB, ECS, EKS, Lambda, IAM, CloudWatch, S3, SQS, SNS, Route53, and WAF.
- Experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents.
- Strong troubleshooting and problem‑solving skills in distributed systems environments.
- Working knowledge of security best practices and operational risk management.
- Experience with resilience testing, chaos engineering, or failure‑injection techniques.
- Familiarity with applying AI/ML‑assisted approaches to observability, operations, or incident management is a plus.
Education and Experience
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 12+ years of overall engineering experience, including 5+ years in Site Reliability Engineering, DevOps, or a similar role.
- 3+ years of experience leading engineering teams or managing senior technical contributors.
- Demonstrated experience operating and improving highly available, scalable, and fault‑tolerant systems.
NOTE: This position is eligible for hybrid working arrangements and requires on-site work from an Insulet office.
Additional Information:
Compensation & Benefits:
For U.S.-based positions only, the annual base salary range for this role is $178,700.00 - $268,025.00
This position may also be eligible for incentive compensation.
We offer a comprehensive benefits package, including:
• Medical, dental, and vision insurance
• 401(k) with company match
• Paid time off (PTO)
• And additional employee wellness programs
Application Details:
This job posting will remain open until the position is filled.
To apply, please visit the Insulet Careers site and submit your application online.
Actual pay depends on skills, experience, and education.
Insulet Corporation (NASDAQ: PODD), headquartered in Massachusetts, is an innovative medical device company dedicated to simplifying life for people with diabetes and other conditions through its Omnipod product platform. The Omnipod Insulin Management System provides a unique alternative to traditional insulin delivery methods. With its simple, wearable design, the tubeless disposable Pod provides up to three days of non-stop insulin delivery, without the need to see or handle a needle. Insulet’s flagship innovation, the Omnipod 5 Automated Insulin Delivery System, integrates with a continuous glucose monitor to manage blood sugar with no multiple daily injections, zero fingersticks, and can be controlled by a compatible personal smartphone in the U.S. or by the Omnipod 5 Controller. Insulet also leverages the unique design of its Pod by tailoring its Omnipod technology platform for the delivery of non-insulin subcutaneous drugs across other therapeutic areas. For more information, please visit insulet.com and omnipod.com.
We are looking for highly motivated, performance-driven individuals to be a part of our expanding team. We do this by hiring amazing people guided by shared values who exceed customer expectations. Our continued success depends on it!
At Insulet Corporation all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.