Design, architect, and implement scalable, secure, and highly available cloud infrastructure on AWS across multi-account, multi-region environments.
Define and enforce cloud architecture standards, best practices, and governance policies using AWS Organizations, Control Tower, and SCPs.
Build and maintain Infrastructure as Code (IaC) using Terraform and AWS CloudFormation — writing reusable modules consumed across all product teams.
Improve and optimize cloud environments for cost, performance, and reliability — owning FinOps practices including Savings Plans, Spot strategy, and Graviton adoption.
Collaborate with engineering, data, and security teams to build resilient distributed systems.
Drive innovation and continuous improvement initiatives across the platform.
Design, deploy, and manage production EKS clusters at multi-tenant financial-services scale.
Plan and execute cluster upgrades, patching, and Kubernetes version lifecycle management with zero customer impact.
Build and maintain internal Helm chart libraries and GitOps-driven cluster configuration using ArgoCD or Flux.
Implement zero-trust network principles and enforce IAM least-privilege across all AWS accounts.
Drive SRE practices: define and enforce SLOs for EKS, API Gateway etc.
Lead incident response, postmortem analysis, and blameless RCA processes for platform-level outages.
Build chaos engineering exercises and disaster recovery testing across availability zones and regions.
Partner with software engineering teams to deliver end-to-end solutions from design through production.
Evaluate new AWS services and open-source tooling to continuously improve infrastructure capabilities.
Required Qualifications
Strong, hands-on experience with AWS cloud services: EC2, VPC, IAM, EKS, S3, CloudWatch, API Gateway, Route 53, and more.
Proven experience operating Amazon EKS in production: cluster lifecycle, RBAC, IRSA, node groups, and autoscaling.
Proficiency in Infrastructure as Code with Terraform and AWS CloudFormation.
Solid understanding of containerization: Docker, Kubernetes architecture, and container lifecycle management.
Experience with monitoring and logging tools: Prometheus, Grafana, Dynatrace, OpenSearch, ELK/Loki.
Strong Linux/Unix systems administration and scripting in Bash, Python, or similar.
Deep knowledge of cloud security best practices: IAM, RBAC, secrets management, and network security.