Job Description
Job Description
Position Title: Site Reliability Engineer (SRE)
Department: Group ICT – Infrastructure
Division: AirAsia Aviation Group
Location: RedQ
About the Department
Group ICT – Infrastructure, AirAsia Aviation
We architect and govern the core technological framework that empowers AirAsia's business and operational objectives. Our team is dedicated to delivering highly resilient and scalable infrastructure services, ensuring operational continuity and providing strategic support across the entire aviation group.
Key Responsibilities
Manage and maintain Kubernetes infrastructure (preferably Google Kubernetes Engine – GKE) to ensure system uptime, stability, and resilience.
Monitor, analyze & manage system performance using Grafana and Prometheus.
Administer and manage GitLab, including version control, CI/CD pipelines, and integrations.
Implement automation and configuration management using scripting.
Develop and maintain automation scripts using Bash and PowerShell.
Manage and support cloud environments (preferably Google Cloud Platform – GCP).
Conduct system debugging, troubleshooting, and performance optimization.
Collaborate with internal teams to ensure service reliability, scalability, and operational efficiency.
Requirements
Must Have
Proven experience managing Kubernetes infrastructure (preferably GKE).
Experience managing GitLab and CI/CD pipelines.
Understanding of API Gateways (Apigee, Kong).
Proficiency in Bash, PowerShell, and Python scripting.
Practical experience with cloud platforms (GCP preferred).
Exposure to AI tools (Gemini, Cursor, GPT, etc.).
At least 2 years of experience.
Good to Have
Familiarity with Cloudflare services.
Hands-on experience with monitoring tools such as Grafana and Prometheus.
Experience with Terraform for Infrastructure as Code (IaC).
Strong knowledge of Ansible for automation and configuration management.
Hands-on experience with Helm in Kubernetes environments.
Personal Attributes
Analytical and detail-oriented with strong problem-solving skills.
Proactive and self-driven with the ability to work under minimal supervision.
Strong sense of ownership and accountability.
Committed to continuous learning and process improvement.
Excellent debugging and troubleshooting skills.
Strong communication and teamwork abilities.