About us:
DoubleVerify (DV) is the leader in digital performance solutions, improving the impression quality and audience impact of digital advertising. Built on best practices, DV solutions create value for media buyers and sellers by bringing transparency and accountability to the market, ensuring ad viewability, brand safety, fraud protection, accurate impression delivery and audience quality across campaigns to drive performance. Since 2008, DV has helped hundreds of Fortune 500 companies gain the most value out of their media spend by delivering best in class solutions across the digital ecosystem that help build a better industry.
About the role:
The Senior Incident Manager serves as the strategic and operational leader for DoubleVerify's Major Incident Management program. This role owns the end-to-end incident lifecycle—from detection and technical mitigation through business impact assessment, stakeholder communication, and post-incident improvement. The position requires both deep technical acumen and strong business judgment to minimize customer impact, protect revenue, and ensure rapid, coordinated response across Engineering, Product, Commercial, and Executive teams. This is an individual contributor role focused on incident command and program excellence.
Incident Command: Lead technical and corporate response to Sev1/Sev2/Sev3 incidents, serving as single point of accountability during major disruptions
Stakeholder Orchestration: Mobilize cross-functional teams (Engineering, Product, Commercial, QA, TechOps) based on incident severity and business impact using established RACI frameworks
Real-Time Decision Making: Make rapid severity classification decisions, approve escalations, and coordinate technical remediation efforts to restore service
Communication Leadership: Drive timely, accurate updates to internal stakeholders and executives; coordinate external client/partner communications when business impact warrants
Program Management: Own and evolve the Major Incident Management (MIM) process, ensuring adherence to defined standards, SLAs, and escalation procedures across all product lines
Metrics & Analytics: Track, analyze, and report on incident trends, MTTR, recurring issues, and process effectiveness; present quarterly insights to leadership
Retrospective Leadership: Facilitate post-incident reviews within 48 hours of resolution, driving accountability for action items and lessons learned
Documentation & Training: Maintain runbooks, response playbooks, and communication templates; train incident managers and technical leads on MIM best practices
Impact Analysis Oversight: Partner with TechOps, Product, and Commercial teams to assess customer, revenue, and reputational impact
Risk Assessment: Translate technical incidents into business language for executive stakeholders; recommend actions based on exposure, financial impact, and client churn risk
Client Communication Strategy: Determine when external communication is required; collaborate with SSO, Product Marketing, and Legal to draft and approve client-facing messaging
Billing & Revenue Protection: Work with Commercial and Billing leads to quantify financial impact and coordinate credit/refund decisions when revenue is affected
Automation & Tooling: Drive adoption of AI-powered incident automation, including auto-triage, impact analysis tools, and intelligent alerting
SLO Framework Integration: Partner with SRE teams to align incident response with Service Level Objectives and error budget policies
Problem Management Partnership: Collaborate with QA and Problem Management teams to identify systemic issues and drive preventative fixes
Vendor & Partner Coordination: Manage incident communication protocols with key partners (e.g., Amazon, TikTok, YouTube) to ensure rapid escalation and resolution
7+ years in technical operations, site reliability engineering, DevOps, or incident management roles
3+ years in a program management or incident command capacity, with direct experience leading technical incident response
Proven track record of managing Sev1/Sev2 incidents in high-availability, customer-facing SaaS or AdTech environments
Experience coordinating cross-functional response teams during business-critical outages
Deep understanding of distributed systems, cloud infrastructure (GCP, AWS), and modern application architectures
Proficiency with monitoring and observability tools (Nagios, Prometheus, Grafana, Datadog, PagerDuty)
Familiarity with SQL, log analysis, and data integrity validation techniques
Knowledge of ITIL, SRE principles, SLO/SLI frameworks, and incident response best practices
Executive communication: Ability to brief C-level stakeholders during incidents, translating technical issues into business impact and risk exposure
Crisis management: Calm under pressure, with strong decision-making skills in ambiguous, high-stakes situations
Stakeholder management: Experience working with Product, Commercial, Legal, and Marketing teams to coordinate client/partner communications
Documentation & presentation: Strong written and verbal communication skills; ability to create clear, concise executive summaries and retrospectives
Cross-functional influence: Proven ability to drive alignment across Engineering, Product, and Business teams without direct reporting authority
Process optimization: Track record of improving incident response processes through automation, metrics, and continuous improvement
Cultural awareness: Ability to work effectively with globally distributed teams across multiple time zones
Executive presence: Serve as trusted advisor to VP and SVP-level stakeholders during critical incidents; translate technical complexity into actionable business insights
ITIL Foundation or Practitioner certification
Experience with AI-driven operations tools (Glean, ChatOps, automated runbook execution)
Background in AdTech, digital media, or data-intensive platforms
Familiarity with SOC 2, ISO 27001, or other compliance frameworks
Experience building and scaling incident management programs from the ground up