About the Role
We're seeking an exceptional Site Reliability Engineering Architect to lead the technical vision and operational excellence of our enterprise GenAI platform serving 180,000+ Citi employees globally. This is a senior individual contributor role for someone who wants to architect intelligent, self-healing infrastructure at the intersection of AI and reliability engineering—without the overhead of people management.
You'll work with cutting-edge AI infrastructure including Claude, Gemini, and proprietary Citi models running on OpenShift/Kubernetes, building the next generation of AI-Ops capabilities that transform traditional operations into intelligent, autonomous systems.
About Our Team
Our team operates like a research-driven startup within Citi, rapidly innovating on AI operations while maintaining enterprise-grade reliability, security, and compliance. We build and operate Citi Stylus Workspaces and other mission-critical GenAI platforms that demand exceptional reliability, security, and performance at global scale.
What You'll Do
Platform Architecture & Reliability
- Design and architect highly available, GPU-accelerated OpenShift clusters optimized for GenAI workloads
- Build Model-as-a-Service platforms enabling seamless LLM hosting, inference, and lifecycle management
- Architect multi-cluster, multi-region infrastructure supporting global AI platform availability (99.9%+ SLA)
- Implement intelligent resource scheduling and optimization for GPU workloads and AI inference engines
AI-Ops & Intelligent Automation
- Design and implement agentic AI workflows for automated incident detection, diagnosis, and remediation
- Build Model Context Protocol (MCP) integrations enabling AI-driven operational decision-making
- Create self-healing systems leveraging log analysis, anomaly detection, and automated remediation pipelines
- Transform operational toil into intelligent automation that learns and adapts
Observability & Performance
- Design and implement comprehensive observability stacks with Prometheus and Grafana providing deep visibility into AI workloads
- Build custom metrics, exporters, and dashboards for LLM-specific monitoring (token throughput, inference latency, GPU utilization)
- Establish SLO/SLI frameworks and error budget management for AI services
- Drive performance optimization through data-driven analysis
Platform Engineering & GitOps
- Architect and deploy OpenShift operators for AI/ML workloads (OpenShift AI, NVIDIA GPU Operator, Knative)
- Design custom Kubernetes operators and controllers for platform-specific automation needs
- Architect and maintain GitOps-driven deployment pipelines for multi-cluster AI infrastructure
- Manage cluster lifecycle operations including upgrades, patching, and capacity expansion
Technical Leadership
- Define technical vision and roadmap for GenAI platform reliability and operational excellence
- Lead production incident response, root cause analysis, and blameless post-mortem processes
- Provide technical mentorship to SRE and DevOps teams on advanced automation and AI-Ops practices
- Partner with engineering, security, and business leaders to align infrastructure strategy with organizational objectives
What You Bring
Core Technical Expertise (Must-Have)
OpenShift & Kubernetes Mastery
- 5+ years expert-level OpenShift 4.x administration and architecture experience
- 5+ years deep Kubernetes expertise including custom operators, controllers, and CRDs
- Hands-on experience with Red Hat Advanced Cluster Management (RHACM) and multi-cluster operations
- Experience designing and implementing Kubernetes operators using Operator SDK or similar frameworks
AI/ML Infrastructure & Operations
- Practical experience deploying and operating AI/ML platforms (OpenShift AI, Kubeflow, or similar)
- Knowledge of GPU cluster provisioning, NVIDIA GPU Operator, and accelerated computing workloads
- Understanding of LLM inference optimization and model serving frameworks (vLLM, TensorRT, ONNX)
- Experience with Model-as-a-Service architectures and MLOps lifecycle management
Automation & Infrastructure as Code
- 5+ years expert-level experience with Terraform and Ansible for infrastructure provisioning and configuration management
- Strong scripting skills: Python, Bash, PowerShell for automation and tooling
- Experience with GitOps workflows and declarative infrastructure management
- Proficiency with Helm charts and Kubernetes manifest templating
Observability & Reliability Engineering
- Deep expertise in Prometheus, Grafana, and metrics-driven reliability engineering
- Experience designing custom metrics, exporters, and dashboards for specialized workloads
- Knowledge of distributed tracing and log aggregation (Splunk or similar)
- Understanding of SLO/SLI frameworks and error budget management
Cloud & Hybrid Infrastructure
- Experience with AWS and Azure cloud platforms and hybrid cloud architectures
- Knowledge of GPU instance types and cost optimization strategies
- Understanding of cloud-native networking, storage, and security patterns
- Familiarity with vSphere and on-premises virtualization platforms
Emerging AI-Ops Capabilities (Highly Valued)
- Experience implementing agentic AI workflows and autonomous remediation systems
- Knowledge of Model Context Protocol (MCP) or similar AI orchestration frameworks
- Practical experience with AI-driven anomaly detection and predictive analytics
- Familiarity with serverless frameworks (Knative) and event-driven architectures
Professional Experience
- 15+ years of overall infrastructure, DevOps, or SRE experience
- 5+ years in senior SRE, DevOps Architect, or Platform Engineering leadership roles
- 5+ years hands-on experience with OpenShift/Kubernetes in production environments
- 3+ years practical experience with AI/ML infrastructure and operations
- Experience managing enterprise-scale platforms (100,000+ users, multi-region deployments)
- Track record of successfully delivering complex infrastructure modernization projects
- Experience operating in regulated industries (finance, healthcare, government)
Nice to Have
- Experience with Go programming language for building operators, controllers, or automation tools
- Familiarity with CI/CD tools (Jenkins, Bitbucket, Git)
- Experience with service mesh implementations (Istio)
- Understanding of enterprise security frameworks and compliance requirements (SOC2, PCI-DSS)
- Experience with secrets management (Vault or similar)
- Knowledge of policy-as-code frameworks (OPA, Kyverno)
Who You Are
Beyond technical skills, you are:
- Innovative problem solver who transforms complex operational challenges into scalable solutions
- Passionate about AI-Ops and leveraging AI to revolutionize traditional reliability engineering
- Hands-on technical leader comfortable diving deep into technical details while maintaining strategic perspective
- Relentlessly focused on eliminating toil through intelligent automation
- Data-driven with strong analytical skills and ability to use metrics to drive improvements
- Excellent communicator able to articulate complex technical concepts to diverse audiences
- Collaborative with experience working across teams (engineering, security, business)
- Curious about emerging technologies with commitment to staying current
- Pragmatic with ability to balance ideal solutions with practical constraints and timelines
- Calm under pressure with strong troubleshooting and crisis management skills
------------------------------------------------------
Job Family Group:
Technology
------------------------------------------------------
Job Family:
Architecture
------------------------------------------------------
Time Type:
Full time
------------------------------------------------------
Most Relevant Skills
Please see the requirements listed above.
------------------------------------------------------
Other Relevant Skills
For complementary skills, please see above and/or contact the recruiter.
------------------------------------------------------
Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.
If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.