Who We Are Looking For
This is an L3 Support Engineer / Senior Platform Support Specialist position within the Generative AI Team. Our primary objective is to ensure the reliability, performance, and operational excellence of enterprise-grade Generative AI platforms spanning Azure and AWS ecosystems — specifically Azure OpenAI, Azure AI Foundry, Databricks, and AWS Bedrock. As a member of the Global AI Team, the incumbent will serve as the highest tier of technical support and escalation, partnering with development teams, infrastructure teams, cloud engineering, and business stakeholders across regions to diagnose complex issues, drive root cause analysis, optimize platform performance, and ensure seamless availability of Gen AI services that power critical business functions.
What You Will Be Responsible For
As an L3 Support Engineer, you will serve as the senior-most technical escalation point for all Gen AI platform issues, combining deep platform expertise with strong troubleshooting, communication, and cross-functional collaboration skills.
- Serve as the final escalation tier (L3) for complex production incidents related to Azure OpenAI, Azure AI Foundry, Databricks, and AWS Bedrock, driving resolution with urgency and precision.
- Perform advanced root cause analysis (RCA) on platform outages, performance degradation, model inference failures, token/rate-limiting issues, API errors, and integration breakdowns across multi-cloud Gen AI environments.
- Monitor, maintain, and optimize the health, performance, and cost-efficiency of Gen AI workloads across Azure and AWS, including model deployments, endpoints, fine-tuning pipelines, and orchestration workflows.
- Troubleshoot end-to-end architecture spanning API gateways, networking (VNets, VPCs, Private Endpoints, PrivateLink), IAM/RBAC policies, model serving infrastructure, vector databases, and data pipelines integrated with Gen AI services.
- Collaborate closely with L1/L2 support teams to develop runbooks, knowledge base articles, escalation procedures, and standard operating procedures (SOPs) to reduce mean time to resolution (MTTR) and enable shift-left support.
- Partner with Development, MLOps, and Infrastructure teams to identify recurring issues, recommend platform improvements, and contribute to the reliability and resilience of Gen AI solutions through proactive measures.
- Support deployment and release activities for Gen AI models and applications, including validation of model endpoints, API versioning, configuration changes, and rollback procedures in production environments.
- Manage and respond to incidents following ITIL-aligned incident and problem management processes, including SLA adherence, stakeholder communication, and post-incident reviews.
- Drive automation of support workflows by developing scripts, monitoring dashboards, alerting mechanisms, and self-healing capabilities to improve operational efficiency.
- Stay current with platform updates, deprecations, and new feature releases from Azure OpenAI, Azure AI Foundry, Databricks, and AWS Bedrock, assessing impact on existing deployments and advising teams accordingly.
- Work with Enterprise Architecture and Security teams to ensure Gen AI platform configurations conform to organizational standards, compliance requirements, and data governance policies.
- Provide technical guidance and mentorship to L1/L2 support engineers, fostering skill development and knowledge transfer across the support organization.
- Participate in on-call rotations and provide after-hours support for critical production issues as required.
These skills will help you succeed in this role:
- Deep hands-on expertise with Azure AI services, including Azure OpenAI Service (GPT model deployments, fine-tuning, content filtering, quota management), Azure AI Foundry (prompt flow, model catalog, evaluation tools), and Azure Cognitive Services.
- Deep hands-on expertise with AWS AI/ML services, including AWS Bedrock (foundation model access, custom model import, agents, knowledge bases, guardrails), and associated AWS services (Lambda, S3, IAM, CloudWatch, SageMaker).
- Strong working knowledge of Databricks, including Unity Catalog, MLflow, model serving endpoints, Delta Lake, Spark clusters, workflows/jobs, and integration with Gen AI pipelines.
- Proficiency in Python and scripting (Bash/PowerShell) for troubleshooting, log analysis, API testing, and automation of support tasks.
- Strong understanding of cloud networking, security, and identity management — VNets/VPCs, NSGs/Security Groups, Private Endpoints/PrivateLink, Azure AD/Entra ID, AWS IAM, RBAC, managed identities, and service principals.
- Experience with API troubleshooting — REST APIs, authentication/authorization flows (OAuth 2.0, API keys, SAS tokens), rate limiting, retry logic, and HTTP status code analysis.
- Familiarity with Infrastructure as Code (IaC) tools such as Terraform, ARM Templates, CloudFormation, or Bicep for understanding and troubleshooting deployed infrastructure.
- Experience with monitoring and observability tools — Azure Monitor, Application Insights, Log Analytics, AWS CloudWatch, CloudTrail, Databricks cluster logs, Grafana, or similar platforms.
- Working knowledge of containerization and orchestration — Docker, Kubernetes (AKS/EKS), and container-based model serving architectures.
- Understanding of LLM concepts and Gen AI architectures — prompt engineering, RAG (Retrieval Augmented Generation), embeddings, vector databases (e.g., Azure AI Search, Pinecone, FAISS), token management, and model lifecycle management.
- Experience with CI/CD pipelines and DevOps/MLOps practices — Git, Azure DevOps, GitHub Actions, Jenkins, and deployment automation relevant to AI/ML workloads.
- Strong knowledge of ITIL processes — Incident Management, Problem Management, Change Management, and Service Level Management.
- Excellent analytical skills, critical thinking, and structured problem-solving abilities with the capacity to work under pressure during high-severity incidents.
- Outstanding communication skills — ability to articulate complex technical issues clearly to both technical and non-technical stakeholders, and to produce high-quality incident reports and documentation.
- Strong organizational skills with the ability to manage multiple concurrent issues, prioritize effectively, and meet SLA commitments in a fast-paced, cross-functional environment.
Education & Qualifications
- Bachelor's Degree in Computer Science, Engineering, Information Technology, or a related field.
- 3+ years of experience in production support, platform engineering, cloud operations, or site reliability engineering (SRE), with at least 2+ years focused on AI/ML or Gen AI platforms.
- Demonstrated experience supporting enterprise-scale cloud environments in financial services, consulting, or similarly regulated industries.
Certifications (Preferred — one or more of the following):
- Azure: AZ-104 (Azure Administrator), AZ-305 (Azure Solutions Architect), AI-102 (Azure AI Engineer), AZ-400 (Azure DevOps Engineer)
- AWS: AWS Solutions Architect Associate/Professional, AWS Machine Learning Specialty, AWS Cloud Practitioner
- Databricks: Databricks Certified Data Engineer, Databricks Certified Machine Learning Professional
- ITIL v4 Foundation or higher
Salary Range:
$70,000 - $118,750 Annual
The range quoted above applies to the role in the primary location specified. If the candidate would ultimately work outside of the primary location above, the applicable range could differ.
Employees are eligible to participate in State Street’s comprehensive benefits program, which includes: our retirement savings plan (401K) with company match; insurance coverage including basic life, medical, dental, vision, long-term disability, and other optional additional coverages; paid-time off including vacation, sick leave, short term disability, and family care responsibilities; access to our Employee Assistance Program; incentive compensation including eligibility for annual performance-based awards (excluding certain sales roles subject to sales incentive plans); and, eligibility for certain tax advantaged savings plans.
For a full overview, visit https://hrportal.ehr.com/statestreet/Home.
About State Street
Across the globe, institutional investors rely on us to help them manage risk, respond to challenges, and drive performance and profitability. We keep our clients at the heart of everything we do, and smart, engaged employees are essential to our continued success.
We are committed to fostering an environment where every employee feels valued and empowered to reach their full potential. As an essential partner in our shared success, you’ll benefit from inclusive development opportunities, flexible work-life support, paid volunteer days, and vibrant employee networks that keep you connected to what matters most. Join us in shaping the future.
As an Equal Opportunity Employer, we consider all qualified applicants for all positions without regard to race, creed, color, religion, national origin, ancestry, ethnicity, age, disability, genetic information, sex, sexual orientation, gender identity or expression, citizenship, marital status, domestic partnership or civil union status, familial status, military and veteran status, and other characteristics protected by applicable law.
Discover more information on jobs at StateStreet.com/careers
Read our CEO Statement
Job Application Disclosure:
It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.