About this Role

The SRE Center of Excellence owns the enterprise reliability standard across all technology

towers and this role sits at the center of that model. The team is actively executing a phased

Dynatrace full-stack implementation Phase 1 is underway; Phase 2 is enterprise-wide. Whoever steps into this role owns what gets built and scales it forward.

Senior level and self-directed. Must have 1+ years of hands-on Dynatrace experience and the depth to hit the ground running. Equally critical: polished and credible in front of stakeholders, able to lead teams through SRE principles and Dynatrace implementation with confidence. This is not a heads-down technical role. It requires someone who can walk into a room, build credibility fast, and bring teams along.

Key Responsibilities

Dynatrace Platform Ownership Configure and manage Dynatrace agents, APM, RUM/Synthetics, Log-in-Context, and Zenoss alerting integrations. Build dashboards aligned to the four Golden Signals for both engineering and executive audiences. Own SLO-based alerting targeting MTTD under 5 minutes.

SRE Embedded Engagement Conduct 2-3 week embedded engagements with tower teams to assess observability maturity, identify toil, and stand-up monitoring frameworks. Define and track SLIs/SLOs with program managers and team leads. Deliver runbooks and alert frameworks for Tier 1/2 incident execution. Champion blameless postmortems and RCA feedback loops.

Automation & Infrastructure Implement and extend Ansible Automation Platform for

infrastructure provisioning, configuration management, and event-driven workflows across Linux (600+ servers) and Windows (3,000-7,000 servers) environments. Contribute to automated remediation workflows targeting zero human intervention for known issues.

Standards & CoE Contribution Establish and maintain enterprise-wide SRE baseline

standards across all towers. Serve as a Domain Champion bridging the CoE to individual tower SRE teams. Contribute to OKR tracking across observability, MTTD/MTTI reduction, incident resolution, automation, and SRE readiness.

Required Qualifications

• 5+ years of experience in Site Reliability Engineering, IT operations, or related fields.

• Bachelor's degree in Computer Science, Engineering, or equivalent (2 additional years in lieu of degree).

• Dynatrace — required, 1+ year hands-on minimum (more preferred). Must be able to configure agents, APM instrumentation, dashboards, SLO alerting, and log integrations in a production environment with no ramp time needed. This is the single most critical qualifier — depth matters more than breadth across other tools.

• Grafana and AppDynamics experience — helpful and valued, not blocking: Grafana, AppDynamics, Sumo Logic, New Relic, or Thousand Eyes.

• Ansible Automation Platform — strong plus; Ansible Automation Platform. Any automation or configuration management tool (Terraform, Chef, Puppet) will be considered. Demonstrated automation mindset is what matters.

• Demonstrated ability to define SLIs/SLOs in collaboration with product and engineering teams, not just consume them.

• Demonstrated ability to present and lead in front of stakeholders — must be able to walk into a room, command credibility, explain SRE principles clearly to both technical and non-technical audiences, and guide teams through implementation. This is a hard requirement, not a soft skill.

• Experience in enterprise environments with a mix of on-prem, cloud, and homegrown applications — not just single-product SRE.

• Must be authorized to work in the U.S. (W2 only; no sponsorship). West Coast / PST hours required.

• Must be located near an Alaska Airlines hub city. Remote candidates expected on-site approximately once per month.

Preferred Qualifications

• Experience with Dynatrace-to-Zenoss alerting integration or similar ITSM event management pipelines.

• Familiarity with Dynatrace RUM, Synthetics, and Log-in-Context modules specifically.

• Event-Driven Automation (EDA) experience with Ansible.

• Scripting proficiency in Python, Shell, or PowerShell.

• Experience supporting or extending a consulting-led platform deployment (ability to absorb and own what a third party built).

• ITIL familiarity — incident management, change management, RCA documentation.

• Splunk experience (in lieu of or alongside Sumo Logic).

• Kubernetes exposure in the context of Ansible or monitoring infrastructure.

Salary Range

$31.79 - $50.19 USD (Hourly)

Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives - none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit.
Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors.

Sr. Systems Reliability Engineer

Related Jobs

Senior Solutions Engineer

IT CYBERSECURITY SPECIALIST (INFOSEC)

IT CYBERSECURITY SPECIALIST (INFOSEC)

Recreation Assistant (Riding Instructor) NF-02

Food Service Worker

RESOURCE AND REFERRAL ADMINISTRATIVE SPECIALIST NF-03 RFT (ALL SOURCES)