Job Description Summary

The Site Reliability Engineer will be responsible for Observability / Monitoring platforms of GE Healthcare that ensures performance and availability of Compute and Network infrastructure consumed by all business segments. The Site Reliability teams are composed of highly talented individuals obsessively focused with availability through operational excellence. The ideal individual is relentlessly technical, passionate for automating everything and totally committed to delivering amazing customer experiences.

Job Description

Roles and Responsibilities

In this role, you will:

Own, manage and adapt effective monitoring and alerting systems for GEHC
Responsible for developing and managing a single pane of glass that provides for single view of GEHC ecosystem monitoring that includes top critical business applications, Sites and Critical network devices.
Develop automated solutions as needed Including leveraging AI or developing solutions in AI to address potential problems, Automations for self heal etc before they result in a service interruption
Develop availability measures that align with consumer experience to accurately assess the usability of crucial services
Own, Develop and manage world class monitoring data platform that ingests all the monitoring telemetric data across application / infrastructure with GEHC and integrates with AIOPS platform
Develop & product manage automated solutions / SAAS products to maintain and optimize the availability and performance of critical business processes / services and to address potential problems in the infrastructure and application ecosystem before they result in a service interruption
Ensure top critical business applications and their ecosystems are effectively monitored with appropriate alerting mechanisms integrated with event management systems for effective “single Pane of Glass”
Deliver self-service tools that rely on the monitoring platform / SRE – example, logs, and statistics visualization, monitoring dashboards etc.
Collaborate closely with product teams – Both Internal GE product teams and Monitoring/AIOPS tool vendors to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Contribute to SLI, SLO and SLA definition, monitoring, alerting, and reporting efforts.

Partner and Support other operations teams in investigating root cause of Major P1 and escalated P2 incidents through Monitoring lens
Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria
Continuously identify patterns for a larger problem solve to avoid repeat issues. Leverage AI, Partner create appropriate solutions in the monitoring & event management space

Stay abreast of latest trends in application and infrastructure monitoring, provisioning, maintenance, and uptime. Learn, prototype, and apply newest tools and best practices in real life to meet the goals of SRE practice

Education Qualification

Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.
Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with minimum 6+ years of experience

Desired Characteristics / Technical Expertise:

7+ years of relevant experience in IT Operations/Site Reliability engineering domain and should have demonstrable expertise in architecting, designing, and implementing solutions for Availability and/or Performance
Comprehensive understanding in application performance monitoring, cloud technologies and ability to design and implement Dynatrace solutions in complex enterprise environments.
Solid expertise in designing and implementing Dynatrace / Dynatrace extension or managing APM / observability solution.
Proficient in Dynatrace features, architecture design along with installation, fine-tuning, and implementation experience for various environment (Production, Test, Development and Disaster Recovery)
Expertise in Dynatrace platform configuration including host grouping, auto tagging, naming rules, management zones, RUM (Real User Monitoring), Synthetics, session properties, request attributes, user tags, log monitoring alert profile, problem notifications, threshold tuning, & setting up Integrations with other monitoring tools and ServiceNow.
Experience in implementing and configuring Dynatrace tools, set up synthetic and transaction monitoring, ensure comprehensive infrastructure and application monitoring
Create custom extension in Dynatrace using shell, Python and batch script based on rest API and logs.
Setting up Dynatrace extension configurations, Dashboards (including business), Infrastructure, Analytics, Observability logs, metrics data collection and interpret the same.
Proficiency in Dynatrace Query Language (DQL) , creating custom dashboards as required
Establish and foster visible architectural principles and practices to build reusable designs and systems that promote reliability, velocity, scale, security, and efficiency
Understand and improve applications and plan for faster MTTD, MTTR, auto healing
Understand reliability metrics and enhance automation solutions for auto-healing and incident resolution
Experience with full-stack troubleshooting skills across network, application, hardware, management fabric, or distributed services layers.
Exposure and familiarity with Agile & SRE principles, automated deployments and build pipelines
Excellent knowledge of common operating systems (Unix/Linux, Windows)Strong oral and written communication skills.
Demonstrated experience scripting or developing software and services for the cloud Ruby, Python, Go, Java, Node.js, .NET, etc.
Extensive knowledge of network protocols (TCP/IP, SNMP, FTP, syslog, TFTP, etc.
Experience deploying and managing infrastructure on public clouds such as AWS or Azure
Experience using an automated configuration management system (Terraform, Chef, Puppet, Ansible, Salt, etc.)
Strong analytical and problem resolution skills
Excellence in written and verbal communication, presentation, and ability to partner for success across all levels of organization and technical depths.
Enterprise logging/alerting implementations using Splunk and ELK stack Enterprise APM implementation using Dynatrace, AppDynamics, New Relic etc.
Experience managing teams in highly matrixed organizations
Experience with web application development platforms, tools and utilities
Experience managing multiple hosting environments including public and private cloud solutions.
Experience with configuring, customizing, and extending monitoring tools (Sensu, Grafana, Prometheus, Graphite, Splunk, etc.)
Experience in developing / deploying agentic AI solutions in monitoring / observability space will be an advantage.

Business Acumen:

Leadership:
• Proactively engages with cross-functional teams to resolve issues and design solutions using critical thinking and analytics skills and best practices by actively incorporating input from various sources
• Strong analytical and strong problem solving skills - effectively evaluates information/data to make decisions; anticipates obstacles and develops plans to resolve
• Continuous improvement oriented – actively generates process improvements; champions and drives change initiatives
• Ability to deliver results in a rapidly changing dynamic environment

Personal Attributes:
• Emotional Intelligence, ability to influence up and out and the ability to work independently
• Must be a team player with a strong desire to win
• Passionate about continuously learning and able to quickly adapt and pivot to win in dynamic environment
• Highly organized and efficient; able to balance competing priorities and execute accordingly
• Strong oral and written communication skills

Inclusion and Diversity

GE HealthCare is an Equal Opportunity Employer where inclusion matters. Employment decisions are made without regard to race, color, religion, national or ethnic origin, sex, sexual orientation, gender identity or expression, age, disability, protected veteran status, or other characteristics protected by law.

We expect all employees to live and breathe our behaviors: act with humility and build trust; lead with transparency; deliver with focus, and drive ownership – always with unyielding integrity.

Our total rewards are designed to unlock your ambition by giving you the boost and flexibility you need to turn your ideas into world-changing realities. Our salary and benefits are everything you’d expect from an organization with global strength and scale, and you’ll be surrounded by career opportunities in a culture that fosters care, collaboration, and support.

Additional Information

Relocation Assistance Provided: No

Staff Site Reliability Engineer

Job Description Summary

Job Description

Additional Information

Related Jobs

Operations Support Assistant

Logistics Coordinator - Cover Sheets

QA Engineer Contractor (part-time)

Production Technician

Electromechanical Assembly Technician

Enterprise Operations Center Analyst