Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

Join Our Cloud Infrastructure Engineering & Operations Division as a Senior Site Reliability Engineer

We are seeking a highly skilled Senior Site Reliability Engineer to elevate our Cloud Infrastructure Engineering & Operations team. Your primary mission will be to enhance the performance, reliability, and scalability of our platforms by spearheading the development of a world-class observability ecosystem that drives business success.

The Team:

The Logging, Metrics, and Monitoring (LMM) team is at the forefront of building and delivering observability services and tools for our engineering communities within the Cloud Engineering & Operations, and Research & Development zones. Our solutions are critical—used daily by hundreds of developers to develop, monitor, troubleshoot, and optimize our web services. We manage large-scale, distributed, fault-tolerant systems that collect and host vast volumes of log and metric data, enabling data-driven decision-making across the organization.

Our work has a direct, measurable impact on the productivity of our engineering teams across athenaNation, empowering them to innovate faster and operate more reliably.

In this role, you will tackle a diverse set of challenges—from fine-tuning system performance and scaling services to debugging complex issues. You will partner closely with development teams to deliver new monitoring features, improve existing tools, and solve pressing engineering problems—all within an agile environment that leverages both private and public cloud platforms.

Job Responsibilities

Automate the deployment, configuration, and management of logging, metrics, and monitoring services leveraging Puppet and Infrastructure as Code best practices to ensure reliable and scalable operations.
Proactively troubleshoot and resolve complex production incidents, leveraging deep Linux system administration and engineering expertise to minimize downtime.
Lead cross-functional projects from conception through delivery, including designing scalable technical solutions, managing timelines, and ensuring successful implementation.
Architect and implement comprehensive monitoring strategies by developing metrics, dashboards, and alerting criteria to enable proactive service performance management and dynamic scaling.
Collaborate closely with engineering teams during weekly on-call rotations to swiftly diagnose and resolve high-impact issues, fostering a culture of reliability.
Partner with development teams to enhance their logging and telemetry capabilities, improving observability and operational efficiency.
Mentor and guide team members on best practices for incident response, system tuning, and service reliability.

Required Qualifications

5-8 years of hands-on experience managing mission-critical production environments with a focus on Linux system administration and DevOps practices.
Expertise on Amazon Web Services and Cloud Native Approaches.
Experience working on Microservices, production grade infrastructure.
Proven expertise in managing and optimizing large-scale logging and data platforms such as Kafka, OpenSearch/Elasticsearch, and log forwarding agents like Vector or Fluentd.
Extensive experience with configuration management tools such as Puppet or Ansible, automating deployment and operations at scale.
Scripting experience with Python or Bash.
Demonstrated success troubleshooting and resolving issues in Linux-based production services, including participating actively in on-call rotations.
Proficiency in scripting and programming languages including Bash, Python, and Golang for automation, tooling, and integrations.
Strong expertise in Infrastructure as Code using Terraform and AWS CloudFormation to build resilient, repeatable deployment workflows.
Ability to rapidly adapt to evolving technology environments and business priorities with a bias toward reliability and automation.

Additional Qualifications

Experience managing large-scale production server fleets (thousands of nodes) with high availability and fault tolerance.
Deep subject matter expertise in technologies such as Graphite, ClickHouse, Prometheus, Grafana, Docker, Jenkins, and Git.
Familiarity with AWS cloud architecture, deployment, and operational best practices, with hands-on experience deploying scalable cloud-native applications.
Proficiency with protocol analyzers like tcpdump and Wireshark for network troubleshooting and performance diagnostics.

About athenahealth

Our vision: In an industry that becomes more complex by the day, we stand for simplicity. We offer IT solutions and expert services that eliminate the daily hurdles preventing healthcare providers from focusing entirely on their patients — powered by our vision to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

Our company culture: Our talented  employees — or athenistas, as we call ourselves — spark the innovation and passion needed to accomplish our vision. We are a diverse group of dreamers and do-ers with unique knowledge, expertise, backgrounds, and perspectives. We unite as mission-driven problem-solvers with a deep desire to achieve our vision and make our time here count. Our award-winning culture is built around shared values of inclusiveness, accountability, and support.

Our DEI commitment: Our vision of accessible, high-quality, and sustainable healthcare for all requires addressing the inequities that stand in the way. That's one reason we prioritize diversity, equity, and inclusion in every aspect of our business, from attracting and sustaining a diverse workforce to maintaining an inclusive environment for athenistas, our partners, customers and the communities where we work and serve.

What we can do for you:

Along with health and financial benefits, athenistas enjoy perks specific to each location, including commuter support, employee assistance programs, tuition assistance, employee resource groups, and collaborative  workspaces  — some offices even welcome dogs.

We also encourage a better work-life balance for athenistas with our flexibility. While we know in-office collaboration is critical to our vision, we recognize that not all work needs to be done within an office environment, full-time. With consistent communication and digital collaboration tools, athenahealth enables employees to find a balance that feels fulfilling and productive for each individual situation.

In addition to our traditional benefits and perks, we sponsor events throughout the year, including book clubs, external speakers, and hackathons. We provide athenistas with a company culture based on learning, the support of an engaged team, and an inclusive environment where all employees are valued.

Learn more about our culture and benefits here: athenahealth.com/careers

https://www.athenahealth.com/careers/equal-opportunity

Senior Member of Technical Staff - SMTS

Related Jobs

Director IC CIO Mission and Resources Group

Investigative Specialist

MANAGER

Director IC CIO Mission and Resources Group

Investigative Specialist

MANAGER