Work Schedule
Standard (Mon-Fri)Environmental Conditions
OfficeJob Description
When you’re part of the team at Thermo Fisher Scientific, you’ll do important work. Surrounded by collaborative colleagues, you’ll have the support and opportunities that only a global leader can give you. Our respected, growing organization has an exceptional strategy for the near term and beyond. Take your place on our strong team and help us make significant contributions to the world.
Responsibilities
Serve as a senior technical SME for enterprise infrastructure and application observability, supporting proactive detection, incident response, and service reliability.
Design, implement, and operate observability platforms including Zabbix, Prometheus, Grafana, and BigPanda across on-premises and cloud environments.
Own and lead integrations between observability tools, event management platforms, and ITSM systems (ServiceNow) to enable automated incident and workflow management.
Define and maintain standardized alerting strategies, dashboards, service health models, SLIs/SLOs, and operational reporting.
Act as a senior escalation point for complex observability and event management issues, leading root cause analysis and corrective actions.
Develop and maintain automation and auto-remediation workflows using StackStorm and supporting scripts.
Maintain observability-related documentation, including runbooks, dashboards, and procedures.
Provide technical leadership and mentorship to junior engineers.
Participate in on-call rotations and provide off-hours support as required.
Qualifications
Bachelor’s degree in information technology, Computer Science, Engineering, or a related discipline (or equivalent practical experience).
7+ years of experience in enterprise infrastructure and application observability, monitoring platforms, and IT operations.
Strong hands-on experience with observability and monitoring tools such as Zabbix, Prometheus, Grafana, Playwright, BigPanda, including metrics, alerting, dashboards, event correlation, and reporting.
Proven experience integrating observability platforms with ITSM systems (ServiceNow or equivalent) and event management workflows.
Hands-on experience building automation and auto-remediation workflows, including scripting with Playwright, Python and Shell and API-based integrations (e.g., Stackstorm or similar tools).
Solid understanding of infrastructure, cloud, and application architectures, with the ability to troubleshoot complex cross-domain issues and act as a senior technical escalation point.
Nice to Have
Experience with logging platforms and log aggregation pipelines (e.g., ELK, OpenSearch, Splunk).
Exposure to cloud-native observability, SRE practices, or reliability engineering concepts.