Wells Fargo & Company

principal Engineer-Site Reliability Engineering and AIOps

Hyderabad, India Full time

About this role:

We are looking for a Principal Engineer to set the enterprise technical direction for Site Reliability Engineering and AIOps within Enterprise Functions Technology (EFT). This is a hands-on architecture and engineering leadership role responsible for defining the reliability strategy, reference architectures, and engineering standards across a large application portfolio and multiple lines of business. You will drive cross-organization adoption of SLOs/error budgets, full-stack observability, incident and problem management rigor, and automation-first operationsensuring reliability is designed into the software delivery lifecycle and operating model. Success is defined by measurable outcomes at scale: improved availability and resiliency of critical journeys, fewer customer-impacting incidents, reduced operational toil, and faster, safer recoverydelivered through modern engineering practices, data-driven decisioning, and platform capabilities.


In this role, you will:

  • Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups

  • Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking

  • Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions

  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions

  • Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization

  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership


Required Qualifications:

  • 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

Desired Qualifications:

  • 7+ years of engineering experience, including principal-level technical leadership on large-scale reliability, production operations, or platform programs across complex environments.
  • 7+ years of software engineering experience (e.g., Java, C#, Python) with demonstrated expertise in system design and distributed systems; track record of delivering reusable automation and platform capabilities adopted by multiple teams.
  • 5+ years operating Linux/Unix and Windows platforms in production, including performance tuning, capacity planning, and reliability hardening for mission-critical services.
  • 5+ years designing and operating cloud solutions (public and/or private cloud), including reliability and security architecture, infrastructure-as-code, and cost-aware engineering at scale.
  • 5+ years leading reliability and operations practices for enterprise-scale, highly available services, including major incident leadership, problem management, and establishing operational readiness mechanisms.
  • 5+ years architecting and scaling full-stack observability solutions, including instrumentation standards, alert strategy, service dashboards, and governance that improves signal quality and reduces noise.
  • 5+ years with automation and observability toolsets (e.g., Ansible, Grafana, Elastic, Splunk, Prometheus) and experience building reusable components, templates, and paved paths integrated with CI/CD.
  • Exceptional communication and influence skills, including the ability to align senior stakeholders, drive technical decisions across organizations, and clearly articulate risk, tradeoffs, and recommended paths forward.
  • Deep experience applying advanced analytics and/or AI/ML to production operations (AIOps), including model monitoring, drift/quality controls, explainability, and risk/compliance alignment.
  • Experience leading complex, cross-team delivery using Agile/Scrum and/or Kanban, including establishing operating mechanisms (reviews, readiness gates, metrics cadences) that scale.
  • Experience defining reliability and resiliency architecture for regulated environments, including alignment with information security, audit, and risk management partners.
  • Demonstrated technical thought leadership (e.g., internal engineering community leadership, reference implementations, publications/patents, or speaking engagements).

Job Expectations:

  • Set and evangelize the SRE and AIOps technical strategy for EFT, establishing reference architectures, standards, and guardrails (service tiering, onboarding criteria, SLO/error budget governance) and holding teams accountable through transparent executive-level reporting.
  • Act as a principal-level technical advisor and multiplier: mentor senior engineers, contribute to hiring and technical bar-raising, and define reliability patterns and guardrails across applications, networks, databases, operating systems, and web technologies.
  • Own the reliability and observability architecture across hybrid/multi-cloud, driving standardization of monitoring, logging, tracing, synthetics, and resilience/chaos testing; define platform patterns that teams can adopt with minimal friction.
  • Design and implement AIOps and automation platforms (event correlation, anomaly detection, runbook automation, self-healing) with strong engineering discipline (testability, auditability, change safety) and prioritize initiatives that materially reduce incident volume, toil, and MTTR.
  • Define the reliability measurement system (SLIs/SLOs, error budgets, customer impact, MTTR/MTBF, change failure rate) and build reusable dashboards and alerts that drive consistent prioritization, investment decisions, and engineering behavior across teams.
  • Provide technical leadership during major incidents for critical services, driving rapid triage, clear stakeholder communications, and cross-domain coordination; institutionalize blameless post-incident reviews and engineering mechanisms that eliminate systemic causes.
  • Partner with application, platform, and architecture leaders to embed reliability into planning and delivery (design and architecture reviews, operational readiness gates, non-functional requirements, capacity/performance engineering), influencing roadmaps based on quantified risk and customer impact.
  • Lead multi-quarter, cross-organization reliability transformations (e.g., platform modernization, resilience programs, observability convergence), delivering reusable capabilities and operating mechanisms that improve reliability posture and reduce operational risk at scale.
  • Ability to travel up to 10%, as needed for stakeholder engagement, program delivery, and operational reviews.
  • Hybrid work expectation: work from the office three days per week at one of the listed locations, aligned to team and business needs.

Posting End Date: 

10 May 2026

*Job posting may come down early due to volume of applicants.

We Value Equal Opportunity

Wells Fargo is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other legally protected characteristic.

Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit’s risk appetite and all risk and compliance program requirements.

Candidates applying to job openings posted in Canada: Applications for employment are encouraged from all qualified candidates, including women, persons with disabilities, aboriginal peoples and visible minorities. Accommodation for applicants with disabilities is available upon request in connection with the recruitment process.

Applicants with Disabilities

To request a medical accommodation during the application or interview process, visit Disability Inclusion at Wells Fargo.

Drug and Alcohol Policy

 

Wells Fargo maintains a drug free workplace.  Please see our Drug and Alcohol Policy to learn more.

Wells Fargo Recruitment and Hiring Requirements:

a. Third-Party recordings are prohibited unless authorized by Wells Fargo.

b. Wells Fargo requires you to directly represent your own experiences during the recruiting and hiring process.