MX

Senior Director - Cloud Platform Engineering

Lehi, UT Full time

Life at MX

We are driven by our moral imperative to advance mankind - and it all starts with our people, product and purpose. We always carry a deep sense of drive and passion with us. If you thrive in a challenging work environment, surrounded by incredible team members who will help you grow, MX is the right place for you.

Come build with us and be part of an award-winning company that’s helping create meaningful and lasting change in the financial industry.

About the Role

We are looking for a US-based Senior Director that would be a strategic, operational, execution, and escalation owner for all the site, infrastructure and cloud platform services.

This role is personally accountable for the production reliability and stability, including owning US time-zone incidents, Sev 0/1 events, leading cutovers, and directly representing site, infrastructure and platforms to executive leadership during high-impact events. The expectation is that this leader stands front of the line during critical incidents and events like migration and stabilization, makes real-time decisions, and clearly articulates risk, impact, and trade-offs to executives under pressure. This front-line ownership is intentional but transitional. A core measure of success for this role is building the systems, operating model, delegation structure, and a strong leadership bench such that sustained, high-quality operations do not depend on the continuous personal presence of a single leader. The leader is expected to design for leverage: establishing clear ownership, developing managers/leaders, and embedding practices that scale reliability beyond individual heroics.

In parallel, they are expected to lead the full lifecycle of our infrastructure transformation, from data center exit and AWS migration through steady-state cloud operations and platform maturity. Success is measured not just by completing the migration, but by leaving behind a durable operating model with clear delegation, on-call ownership, and predictable executive engagement. The ideal candidate will have personally led large-scale data center exits and Cloud migrations, not just advised or governed them.

What Success Looks Like in 12–24 Months

  • 100% exit from on-premise data centers, with all targeted workloads successfully migrated to AWS and on-prem dependencies fully decommissioned.

  • A clear, stable post-migration operating model in place, with unambiguous ownership across teams.

  • 99.99%+ Availability for Platform and infrastructure services consistently, with active error budget management guiding operational and delivery decisions.

  • Reduction in Sev 0 and Sev 1 incidents, with measurable reduction in customer-impacting events and improved predictability of recovery.

  • Improved incident KPIs, including faster MTTR and reduced incident recurrence.

  • Declining operational toil through automation, standardization, and self-service platform capabilities.

  • Mature incident management practices, including blameless postmortems and systemic remediation of root causes.

  • A strong leadership bench providing resilient production coverage, confident incident leadership, and effective delegation.

  • Improved cost efficiency and visibility across cloud infrastructure post-migration through FinOps practices, capacity right-sizing, and platform standardization.

Job Duties

Lead Data Center Exit & AWS Migration

  • Personally own and execute the end-to-end data center exit and AWS migration, from discovery and planning through cutover, stabilization, and full decommissioning.

  • Define migration waves, readiness gates, and cutover plans with explicit transition into steady-state ownership, avoiding temporary or parallel operating models.

  • Own architectural decisions across AWS networking, compute, storage, security, and observability, ensuring designs are operable, supportable, and resilient post-migration.

  • Establish and own the post-migration operating model for cloud infrastructure and platforms, explicitly tied to outcomes:

    • Clearly defined SLIs, SLOs, and error budgets for all Tier-1 and Tier-2 services

    • Accountable owners for SLO attainment across SRE, platform, and product teams

    • On-call and escalation models that provide durable time-zone coverage

    • Ongoing Cost-efficiency and optimization

    • Incident response, change management, and release practices aligned to reliability targets.

    • Post-migration roadmap for Platforms

  • Hold teams accountable for post-migration reliability metrics, including:

    • SLO compliance and error budget burn,

    • Sev 0 / Sev 1 incident frequency and customer impact,

    • MTTR and incident recurrence rates.

  • Ensure migration execution does not introduce long-term operational debt, and that workloads transition cleanly into measured, observable, and well-owned cloud operations.

  • Lead physical and logical data center decommissioning only after post-migration SLOs are consistently met and incident KPIs have stabilized.

Build and Evolve the Cloud Platform

  • Own the vision, roadmap, and execution for the company’s cloud platform, ensuring it supports both migration needs and long-term, steady-state operations on AWS.

  • Own core platform capabilities  and tooling strategy such as Kubernetes (EKS), CI/CD pipelines, infrastructure-as-code, identity and access management, secrets management, observability, and disaster recovery.

  • Deliver self-service, opinionated platform services that improve developer productivity while meeting security and reliability standards.

  • Modernize legacy and architect for Multi-Tenant SaaS: Enable secure and efficient scaling across tenants in AWS, with attention to cost, compliance, and observability

  • Drive platform standardization to reduce fragmentation, operational toil, and cognitive load for product engineering teams.

  • Partner closely with application and product engineering to ensure the platform accelerates delivery while maintaining reliability and compliance.
     

Incident Management, SRE & Operational Resilience Leadership

  • Own and evolve the end-to-end incident management lifecycle for infrastructure and platform services, grounded in SRE principles of reliability, learning, and automation.

  • Define and enforce SLIs, SLOs, and error budgets for platform and infrastructure services, using them to guide operational decisions, release risk, and incident response.

  • operate on a clear severity framework (Sev 0/1/2) with explicit ownership, escalation paths, and decision rights.

  • Lead the transition from incident response as heroics to incident prevention by design, embedding reliability, AI,capacity planning, and failure-mode analysis into platform roadmaps and change processes.

  • Serve as the executive escalation owner for Sev 0 and Sev 1 incidents, personally leading response, trade-off decisions, and executive communications when required, while delegating incident command to empowered leaders to ensure sustained coverage.

  • Hold clear decision authority under pressure, including the ability to unilaterally halt or roll back changes, trigger failovers/traffic-shifts and disaster recovery actions, reallocate engineering resources in demanding situations, and make go/no-go cutover decisions to protect customers and data escalating to executive leadership when actions materially impact regulatory posture, contractual commitments, or significant financial exposure.

  • Build and maintain a US-based SRE and incident leadership bench, with multiple leaders capable of acting as Incident Commander, owning executive updates, and coordinating cross-functional response.

  • Lead through error budgets and reliability signals to drive blameless postmortems, root-cause analysis, and prioritization of systemic fixes over short-term feature velocity.

  • Own the systematic reduction of operational toil and capacity tax across infrastructure and platform teams, with clear accountability for ensuring reactive work declines as systems mature. 

  • Hold teams accountable to measurable toil and resilience KPIs, such as percentage of engineer time spent on reactive work, on-call interrupt frequency, manual intervention rates, and incident recurrence.

  • Influence readiness through game days, chaos testing, and migration-specific drills, validating both technical resilience and delegation models under pressure.

  • Ensure incident management tooling, observability (metrics, logs, traces), and documentation are standardized, well-owned, and continuously improved.
     

Program, Stakeholder, and Executive Leadership

  • Partner with product, engineering, security, enterprise architecture, and finance to shape cloud migration and platform decisions that directly impact cost-to-serve, unit economics, and operational overhead, ensuring infrastructure choices scale sustainably with business growth.

  • Drive architectural and platform standards that reduce total cost of ownership, including infrastructure spend, support burden, reliability overhead, and on-call load.

  • Embed FinOps and Reliability signals (utilization, reliability cost, incident-driven spend, operational toil) into platform roadmaps and migration sequencing, making trade-offs explicit between performance, resilience, speed, and cost.

  • Translate infrastructure and platform choices into clear business outcomes such as per-customer cost, per-transaction cost, and support effort, enabling executives to make informed investment and prioritization decisions.

  • Act as a trusted advisor on infrastructure and cloud strategy, challenging assumptions and translating complex technical risks into clear business impact, options, and trade-offs to enable informed decision-making under pressure.

  • Build and delegate clear ownership and accountability for cloud migration timelines, risks, and outcomes.

  • Establish clear governance, readiness reviews, and success metrics for migration and platform initiatives.

  • Partner and guide steering committees, technical working groups, and cross-organizational readiness forums.

People and Organization Leadership

  • Own the design, scale, and effectiveness of the Cloud Platform Engineering organization, including SRE, cloud infrastructure, and platform engineering teams across geographies.

  • Build and lead a strong leadership bench, developing senior managers, principal engineers, and architects who can operate independently at scale.

  • Clearly define delegation, decision rights, and escalation paths so that critical incidents, migrations, and operational responsibilities are owned at the right level.

  • Drive organizational clarity across charters, roles, responsibilities, and decision rights to reduce friction and increase delivery velocity.

  • Actively recruit, retain, and develop top-tier infrastructure, SRE, and platform talent, including succession planning for critical roles.

  • Establish a culture of engineering excellence, reliability, and continuous improvement, grounded in data, post-incident learning, and blameless accountability.

  • Lead change management during periods of transformation, including data center exit, cloud migration, and operating model shifts.

  • Foster strong partnerships with product, application engineering, security, and business leaders, ensuring platform teams are seen as strategic enablers and not service providers.

  • Champion diversity of thought, inclusive leadership, and high team engagement across a growing, global organization.
     

Role Requirements 

  • 15+ years of experience in infrastructure, Cloud, SRE, or platform engineering.

  • 7+ years leading large engineering organizations (managers of managers or equivalent).

  • Direct, hands-on leadership of at least one full data center exit and AWS migration, including decommissioning of on-premise infrastructure.

  • Deep technical expertise in AWS, including VPC networking, EC2, EKS/Kubernetes, RDS/Aurora, S3, IAM, and observability tooling.

  • Strong experience operating highly available, distributed systems using SRE principles.

  • Proven ability to lead complex, high-risk infrastructure transformations in production environments.

  • Expertise in FinOps and cloud cost optimization practices.

  • Demonstrated ability to drive standards and adoptions across distributed engineering teams without relying on reporting lines. 

  • Skillful operating as a front-line executive leader during critical situations, including migrations, upgrades, DR, incidents, and major production events.

 

At MX, we are a high-performance organization that thrives on trust and results. This role is based in Lehi, Utah, with flexibility for both in-office and remote work. We believe in empowering our team members to deliver exceptional outcomes while taking advantage of our incredible office space when it best supports their work. Our Utah office features onsite perks such as company-paid meals, massage therapists, a sports simulator, gym, mother’s lounge, and meditation room and meaningful interactions with amazing people. We encourage team members to come together in the office to collaborate, kick off key projects, or strategize cross-functionally, fostering connection and innovation.

MX is proudly committed to recruiting and retaining a diverse and inclusive workforce. As an Equal Opportunity Employer, we never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, military or veteran status, status as an individual with a disability, or other applicable legally protected characteristics. We particularly welcome applications from veterans and military spouses. All your information will be kept confidential according to EEO guidelines. You may request reasonable accommodations by sending an email to hr@mx.com.