Continue to make an impact with a company that is pushing the boundaries of what is possible. At NTT DATA, we are renowned for our technical excellence, leading innovations, and making a difference for our clients and society. Our workplace embraces diversity and inclusion – it’s a place where you can continue to grow, belong, and thrive.

Your career here is about believing in yourself and seizing new opportunities and challenges. It’s about expanding your skills and expertise in your current role and preparing yourself for future advancements. That’s why we encourage you to take every opportunity to further your career within our great global team.

Your day at NTT DATA
The Site Reliability Engineer (SRE) is a seasoned subject matter expert, responsible for ensuring the reliability, availability, and performance of company systems and infrastructure.

This Site Reliability Engineer (SRE) works closely with development teams, operations teams, and other stakeholders to enhance system resiliency, automate processes, and improve overall system reliability.

Key responsibilities:

Monitors system health, performance metrics, and alerts to identify and respond to incidents promptly and diagnoses issues, troubleshoots problems, and restores services in a timely manner.
Implements incident response processes to minimize downtime and improve system availability.
Designs, develops, and maintains automation tools, scripts, and processes to streamline system management tasks, deployments, and configuration changes.
Implements infrastructure-as-code principles to ensure consistency and repeatability.
Optimizes system resources, configurations, and processes to enhance performance, scalability, and efficiency.
Uses monitoring tools and performance testing to identify bottlenecks and implement optimizations.
Collaborates with teams to forecast system resource needs, plans for capacity growth, and ensures adequate scalability.
Leads incident response efforts, coordinates with cross-functional teams, and drives the resolution of system issues.
Performs thorough post-incident analysis to identify root causes and implements preventive measures to minimize future incidents.
Identifies opportunities for automation and drives the implementation of self-healing, monitoring, and deployment of automation tools and frameworks.
Continuously improves operational efficiency, system reliability, and availability through process enhancements and automation.
Ensures consistency across environments, tracks changes, and enforces configuration standards.
Works closely with development teams, operations teams, and other stakeholders to ensure effective collaboration, knowledge sharing, and alignment on reliability goals.
Implements security best practices, works with security teams to assess and address vulnerabilities, and ensures compliance with security standards and regulations.
Performs any other related task as required.

To thrive in this role, you need to have:

Seasoned technical expertise in Linux/Unix systems, networking, and system administration.
Seasoned proficiency in scripting or programming languages, such as Python, Go, Java, or Ruby.
Seasoned knowledge of cloud platforms (such as AWS, Azure, or Google Cloud) and associated services.
Seasoned proven expertise in performance monitoring, optimization, and troubleshooting using tools such as Prometheus, Grafana, or New Relic.
Seasoned expertise in incident management, root cause analysis, and post-incident reviews
Excellent problem-solving and analytical skills, with a keen attention to detail.
Excellent communication, collaboration, and leadership skills.
Seasoned ability to optimize system performance, scalability, and reliability. experience with performance monitoring and tuning tools (for example, Prometheus, Grafana, or New Relic) to identify bottlenecks, analyze performance data, and implement optimization strategies.
Seasoned understanding of security principles, best practices, and compliance requirements. experience in designing and implementing security controls, performing security assessments, and ensuring compliance with industry standards.

Academic qualifications and certifications:

Bachelor's degree or equivalent in Computer Science, Information Technology, or a related field.
Relevant certifications, such as AWS Certified DevOps Engineer - Professional, Google Cloud Professional DevOps Engineer, or Certified Kubernetes Administrator (CKA) preferred.

Required experience:

Seasoned hands-on experience in a Site Reliability Engineering role or related roles, including experience in designing and maintaining highly available and scalable systems.
Seasoned hands-on experience with Linux/Unix systems, networking, and system administration is crucial. In-depth knowledge of cloud platforms (such as AWS, Azure, or Google Cloud) and associated services is essential.
Seasoned proficiency in multiple programming languages like Python, Java, Go, or Ruby is important for developing and maintaining automation tools, frameworks, and complex system integrations. Expertise in scripting languages like Bash or PowerShell is beneficial.
Seasoned understanding of complex infrastructure architectures, including scalable and fault-tolerant designs. experience with infrastructure-as-code tools (such as Terraform or CloudFormation) and containerization technologies (such as Docker or Kubernetes) is essential.
Seasoned experience in designing and implementing robust automation frameworks, CI/CD pipelines, and deployment strategies. Proficiency in tools like Jenkins, GitLab CI/CD, or CircleCI to build, test, and deploy applications with a focus on reliability and scalability.
Seasoned experience in incident management, troubleshooting complex system issues, and conducting post-incident analysis. Advanced ability to lead incident response efforts, drive root cause analysis, and implement preventive measures.
Seasoned understanding of DevOps principles, Agile methodologies, and a strong commitment to continuous improvement and learning. experience in promoting a DevOps culture and driving the adoption of best practices

Workplace type:

On-site Working

Equal Opportunity Employer
NTT DATA is proud to be an Equal Opportunity Employer with a global culture that embraces diversity. We are committed to providing an environment free of unfair discrimination and harassment. We do not discriminate based on age, race, colour, gender, sexual orientation, religion, nationality, disability, pregnancy, marital status, veteran status, or any other protected category. Accelerate your career with us. Apply today

Site Reliability Engineer

Related Jobs

Data Engineer - Digital Assets

Senior Associate Sourcing Consultant

Public Sector Account Executive

Solutions Consultant

Principal ASIC Engineer (San Diego, CA)

Human Resources Business Partner