At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

About the Technology Organization

Technology at Lilly builds and maintains capabilities using pioneering technologies like most prominent tech companies. What differentiates Technology at Lilly is that we create new possibilities through tech to advance our purpose – creating medicines that make life better for people around the world, like data driven drug discovery and connected clinical trials. We hire the best technology professionals from a variety of backgrounds, so they can bring an assortment of knowledge, skills, and diverse thinking to deliver solutions in every area of our business.

About the Business Function

The Software Product Engineering (SPE) team is a specialized engineering group that delivers strategic solutions and differentiated capabilities. We take a forward-thinking approach, focusing on an enterprise platform and product mindset, ensuring that the solutions we build can be leveraged across Technology teams for broader impact and efficiency.

Role Summary

As a Principal Support Engineer – Operations (R3), you will be the senior technical authority for production support across a suite of products and services. You will lead complex incident resolution, drive systemic reliability improvements, and influence operational standards across teams. This role expands beyond advanced troubleshooting to include end-to-end ownership of major incidents, deep technical remediation, automation to reduce operational toil, and mentoring of support engineers. You will partner closely with Engineering, Product, QA, Security, and Platform teams to ensure resilient services, strong operational readiness, and measurable improvements in uptime, latency, and customer experience.

What You’ll Be Doing (Key Responsibilities)

1) Advanced Incident Leadership & Resolution

Act as the final escalation point for the most complex, high-impact production issues spanning frontend, backend, integrations, data stores, and cloud infrastructure.

Lead major incident response (swarming/war-room execution), including triage strategy, technical direction, and recovery coordination across multiple teams.

Drive consistent incident execution aligned with incident management expectations (escalation, outage/deviation considerations, and appropriate stakeholder visibility).

2) Problem Management, RCA, and Defect Elimination

Own and drive Root Cause Analysis (RCA) for recurring and severe incidents; identify systemic failure patterns and champion long-term fixes over workarounds.

Partner with engineering to translate RCA outcomes into durable changes (code, configuration, architecture, monitoring, or process), and track fixes to closure with measurable reliability impact.

3) Reliability Engineering & Operational Excellence

Lead initiatives to improve availability, performance, scalability, and operational resilience (e.g., reducing MTTR, improving detection, reducing repeat incidents).

Define and implement operational guardrails: readiness checks, runbooks, rollback patterns, post-release validation, and shift-left operational readiness with Dev/QE.

Contribute to or lead stabilization work consistent with engineering/SRE responsibilities (reliability improvements, defect elimination, major-incident swarming).

4) Observability, Monitoring & Automation

Design and evolve observability across logs/metrics/traces; improve signal quality (actionable alerts, noise reduction, meaningful dashboards).

Build automation for common operational tasks (triage, remediation, reporting), using scripting and tooling to reduce manual effort and improve consistency.

5) Deployment & Change Support

Provide senior support for deployments/releases: risk assessment, go/no-go input, rollback readiness, and rapid response for post-release issues.

Improve CI/CD operational safety through better validation, monitoring hooks, and release checklists in partnership with DevOps/Platform teams.

6) Compliance, Security & Regulated Environment Readiness

Ensure support processes and fixes align with internal standards and external regulations (e.g., GDPR, HIPAA where applicable).

Promote secure operational practices: least privilege, auditability, secure debugging, and appropriate handling of sensitive data during incident response.

7) Knowledge Leadership & Mentoring

Raise the operational bar by creating and governing high-quality runbooks, knowledge base articles, and operational standards; ensure reusability and adoption across teams.

Mentor L2/R2 engineers: technical coaching, incident handling patterns, RCA quality, and effective cross-team collaboration—acting as a role model for knowledge sharing.

How You Will Succeed (Success Profile)

At R3, success is measured not only by resolving incidents, but by preventing them, improving reliability at scale, and influencing standards across teams:

Be a recognized technical expert who solves complex problems and introduces improved methods/approaches for operations and reliability.

Lead technical decisions during incidents and influence operational standards, technical direction, and cross-team alignment.

Demonstrate strong systems thinking understand failure modes across distributed services, data stores, networks, and cloud infrastructure.

Drive measurable outcomes (examples): reduced repeat incidents, improved alert quality, lower MTTR, improved SLO attainment, reduced manual toil.

Communicate crisply under pressure, facilitating fast alignment between engineering, product, and stakeholders during major incidents.

What You Should Bring (Qualifications)

Required

7–10 years of experience in application support, production engineering, SRE, or software engineering with strong operations ownership (including high-severity incident response).

Deep hands-on debugging across web applications (frontend + backend), integrations, and production environments.

Strong experience with incident management and ticketing workflows (e.g., ServiceNow, Jira), including major incident execution and RCA.

Strong knowledge of RESTful APIs, databases (e.g., PostgreSQL), caching/data stores (e.g., Redis), and cloud platforms (AWS/Azure/GCP).

Expertise in monitoring/logging/alerting stacks (e.g., CloudWatch, ELK, Datadog, Splunk/AppDynamics or equivalent) and the ability to build actionable observability.

Advanced scripting/automation capability (e.g., Bash, Python, JavaScript) to reduce toil and standardize response.

Experience supporting products in regulated industries; working knowledge of privacy/security expectations and secure handling.

Strong collaboration and communication skills across Dev, QA, Product, Security, and platform teams.

Preferred / Nice to Have

Experience defining and operationalizing SLIs/SLOs, error budgets, and reliability reporting (SRE ways of working).

Experience with containerization and deployment patterns (Docker/Kubernetes/ECS), CI/CD systems, and infrastructure-as-code concepts.

Demonstrated mentoring/leadership: raising the capability of teams through coaching and standards.

Additional Information

Availability to work flexible work hours is/may be required. This team will support continuous operations across two shifts and therefore, this role will require non-standard work hours, and some work on weekends and holidays. Appropriate adjustments in benefits will be provided for employees working non-standard hours where applicable

Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.

Lilly does not discriminate on the basis of age, race, color, religion, gender, sexual orientation, gender identity, gender expression, national origin, protected veteran status, disability or any other legally protected status.

#WeAreLilly

Principal Support Engineer – Operations

Related Jobs

Health System Administrator (Associate Director)

Information Technology Specialist (Infosec)

Supervisory Correctional Treatment Specialist (Case Management Coordinator)

Supervisory Contracting Officer Representative

Health System Administrator (Associate Director)

Diagnostic Radiologic Technologist (Picture Archive Communication System)