Company
Cox Automotive - USAJob Family Group
Job Profile
Management Level
Flexible Work Option
Travel %
Work Shift
Compensation
Compensation includes a base salary of $99,000.00 - $165,000.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate’s knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Job Description
The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements.
Engineering/Tooling: Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil.
Incident Troubleshooting: Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents.
Monitoring & Observability: Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms.
AI Centric Engineering: Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks
Executive Communication: Ability to distill complex technical issues into concise, business-relevant summaries for senior leadership.
Analytical Rigor: Strong attention to detail in validating incident data and identifying trends or gaps in response.
DevOps & Architecture Knowledge: Understanding full-stack systems, CI/CD pipelines, caching, scaling, and cloud-native infrastructure.
Metrics & Reporting: Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve).
Here’s how it typically looks when not tied to active on-call:
Post-Incident Review Development
Draft and deliver executive summaries post-incident
Develop and coach teams on blameless postmortems.
Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams).
Maintain a central library of learnings and cross-cutting themes.
Incident Process Improvement
Actively support engineering teams during incidents by helping diagnose and resolve issues quickly
Navigate and analyze data from observability platforms to make informed inferences about root causes
Analyze the effectiveness of incident response to identify systemic reliability gaps.
Standardize incident response workflows (incident roles, comms, escalation paths).
Create or refine runbooks, incident command frameworks, and severity classification guides.
Metrics and Insights
Build dashboards around incident frequency, MTTR, MTTA, and recurrence rates.
Use incident data to drive reliability of OKRs or engineering investments.
Tooling & AI Solutions
Partner with engineering teams to identify repetitive or high-impact tasks suitable for automation.
Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage.
Evaluate and integrate emerging AI/ML technologies to optimize detection, root cause analysis, and reporting.
Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices.
Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams.
Cross-Team Collaboration
Collaborate with Engineering Managers and Incident Commanders to gather and validate incident data
Partner with product teams, infra, and leadership to socialize reliability best practices.
Act as a reliability “consultant” to squads that have impactful incidents.
Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impact
Drug Testing
Benefits
About Us