Job Summary
Synechron is seeking an experienced Site Reliability Engineer (SRE) to enhance the stability, resilience, and operational maturity of our critical Financial Crime and Transaction Monitoring platforms. This role is vital in embedding SRE best practices across observability, automation, incident management, and production support. The successful candidate will be responsible for proactively managing service health, reducing operational risks, and supporting regulatory-critical services, thereby enabling the organization to deliver reliable, scalable, and compliant solutions aligned with business objectives.

Software Requirements

Required:
- Strong understanding and hands-on experience managing production-grade systems with high reliability and availability requirements
- Expertise in SRE principles, monitoring, logging, alerting, and defining SLOs/SLA tuning
- Proficiency with AWS services including EC2, S3, RDS, VPC, IAM, and CloudWatch (latest versions or equivalents)
- Linux system administration and troubleshooting skills for enterprise environments
- Experience with Oracle databases, including performance tuning, RAC, or RMAN in large data environments
- Automation scripting skills using Python and Shell (Bash/sh) for operational automation
- Experience with monitoring tools such as Prometheus, Grafana, ELK/EFK, and PagerDuty
- Familiarity with CI/CD tools like Jenkins, GitLab CI, or AWS CodePipeline

Preferred:
- Knowledge of OFSAA, Oracle Rules Engine, or ML-enabled platform support (e.g., TRACE)
- Infrastructure-as-Code tools such as CloudFormation or Terraform
- Experience with support for high-performance Oracle environments (performance tuning, RAC, RMAN)
- Exposure to cloud-native and containerized environments (Kubernetes, Docker)

Overall Responsibilities

Improve the reliability, availability, and recoverability of Financial Crime and Transaction Monitoring platforms.
Define, monitor, and manage SLIs/SLOs to proactively ensure service health and detect anomalies.
Provide Level 1 and Level 2 support for AWS and Oracle-based platforms, handling incident resolution and root cause analysis.
Build and sustain automation solutions for monitoring, logging, alerting, and operational workflows to reduce manual toil.
Lead incident response activities, conduct post-incident reviews, and implement preventative measures.
Develop, operate, and enhance CI/CD pipelines and infrastructure automation across environments.
Collaborate with engineering teams to design scalable, resilient, and secure systems; participate in capacity planning and performance tuning.
Support deployment, patching, and configuration changes, ensuring compliance with policies and standards.
Maintain comprehensive documentation of operational procedures, configurations, and incident resolutions.
Lead continuous process improvements to enhance system reliability, operational efficiency, and compliance adherence.

Technical Skills (By Category)

Systems & Support (Essential):
- Enterprise-level system operation and support for AWS and Oracle environments
- Linux system administration and troubleshooting
- Incident management and escalation procedures

Monitoring & Automation (Essential):
- Monitoring and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch
- Automation scripting with Python and Shell for operational tasks and event handling

Cloud & Infrastructure (Preferred):
- Cloud deployment, scaling, and management (AWS, Azure, GCP)
- Infrastructure-as-Code (Terraform, CloudFormation)

Databases/Data Management (Essential):
- Oracle database management, performance tuning, and recovery
- Data extraction and validation for high-volume transactional data

Development Tools & Methodologies (Essential):
- Jenkins, GitLab CI, AWS CodePipeline for CI/CD pipelines
- Version control with Git

Experience Requirements

Minimum of 8+ years supporting high-availability, mission-critical enterprise systems, particularly in financial services or comparable regulated environments.
Proven experience supporting Oracle databases, Oracle RAC, or RMAN in a high-volume context.
Strong background in enterprise support for Financial Crime and Transaction Monitoring platforms.
Demonstrated ability to lead operational support teams, manage incident escalations, and implement automation solutions.
Experience in cloud-native architecture, infrastructure automation, and observability tools.
Support experience working under regulatory and audit constraints is preferred.

Day-to-Day Activities

Monitor platform dashboards, logs, and alerts to ensure system health and performance.
Troubleshoot and resolve incidents related to operational, performance, or security issues proactively.
Conduct root cause analysis, document incident reports, and lead corrective action plans.
Automate routine operational tasks, alerts, and workflows to improve efficiency.
Collaborate with platform engineers, developers, and security teams on change management and capacity planning.
Participate in on-call rotations, incident reviews, and readiness exercises.
Continuously evaluate and recommend tools, procedures, and automation that improve reliability and reduce manual intervention.
Maintain detailed documentation of configurations, procedures, and lessons learned.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related discipline.
8+ years supporting enterprise-scale, high-availability systems with operational excellence focus.
Experience supporting regulatory-critical platforms in financial services, especially in Fraud, Risk, or Transaction Monitoring.
Certifications in cloud platforms (AWS Certified Solutions Architect, Azure) and SRE foundations (Google SRE or equivalent) are advantageous.
Proven track record of automation, incident management, and operational improvements.

Professional Competencies

Critical thinking and analytical skills to diagnose and resolve complex operational issues.
Leadership and team management skills to guide operational teams and support team development.
Effective communication for stakeholder reporting, incident updates, and cross-team collaboration.
Ability to work under pressure, prioritize multiple tasks, and meet strict SLAs.
Adaptability to evolving technology landscapes and regulatory requirements.
Focus on continuous improvement, automation, and operational excellence.

SYNECHRON’S DIVERSITY & INCLUSION STATEMENT

Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.

All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.

Candidate Application Notice

Site Reliability Engineer (SRE) with AWS, Oracle, and Automation Expertise

Related Jobs

Food Safety & Quality Assurance NSIS Technologist

R&D Software Engineer

BDC Consultant - Audi of Pembroke Pines

Retail Display Installer – Technology

Vehicle Purchasing Specialist (Sales) - Land Rover Denver

Senior Business Operations Specialist