The Hartford

Staff Engineer, Reliability

India GCC-Puppalaguda Village Full time
IND - Staff Engineer, Reliability - GCC070

We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.

Cloud Services Team is searching for a Reliability Engineer. Candidate must have hands-on experience operating and engineering services on Google Cloud Platform (GCP), including data, compute, and observability services. The team is accountable for the operations, engineering, and governance of 200+ Cloud Technologies across a multiple cloud environment. Role requires helping mature operational practices for GCP workloads as part of our multi-cloud strategy. This is an excellent opportunity for someone who is interested in a mix of strategy and hands-on work The ideal candidate should feel comfortable working with teammates at all levels of the organization including leadership. 

Key Responsibilities 

  • Assists in the development, maintenance and operations of IT services across 200+ infra services across our Cloud transformation landscape. 

  • Develop solutions and drive adoption of enterprise solutions such as Cyber Protection, Disaster Recovery, and Security enhancements, across Line of business teams. 

  • Drive improvement, through automation, of software delivered as a service from an efficiency and simplicity perspective. 

  • Provide clear operational documents and construction/support specifications to IT userbase. 

  • Provide insight into operational Metrics across the entire Cloud Environment.  

  • Consult with customers on any new requirements or design questions or functionality configurations for environments on and off premise 

  • Delivers the tooling and capabilities needed to enable cloud compliance, metrics and reporting and cost management roadmap and strategy. 

  • Participate in incident resolution and change implementation as necessary. This may occasionally include support during non standard hours. 

  • Operate and improve reliability for production workloads running on Google Cloud Platform (GCP), focusing on availability, scalability, and operational readiness rather than application development. 

  • Own daytoday operational concerns for core GCP services including Compute Engine, GKE, Cloud Run, BigQuery, Cloud Storage, and supporting platform services. 

  • Provide operational support for BigQuery platforms including job performance troubleshooting, capacity planning, quota management, dataset permissions, and cost optimization (slot usage, reservations, and quotas). 

  • Support Vertex AI platforms from an operations and reliability standpoint, including environment readiness, access controls, monitoring, pipeline execution health, and incident response (not model development). 

  • Build and maintain observability standards using Cloud Monitoring, Cloud Logging, Error Reporting, and custom SLI/SLO dashboards for GCP workloads. 

  • Implement alerting strategies aligned to error budgets and production reliability goals; reduce alert noise and prevent toil. 

  • Execute incident response, triage, and postincident analysis for GCP services, contributing to PIRs and corrective actions. 

  • Develop and maintain runbooks, operational playbooks, and escalation workflows for GCP services. 

  • Drive automation-first operations, including selfhealing patterns using Cloud Functions, Cloud Run jobs, Scheduler, and eventdriven remediation. 

  • Enforce and operate GCP security and governance controls, including IAM, service accounts, Org Policies, VPC Service Controls, KMS, Secret Manager, and networking guardrails. 

  • Partner with engineering and data teams to review designs for operability, resiliency, and supportability, ensuring workloads meet production readiness standards before launch. 

 

Required Skills & Experience 

  • Expert understanding of how applications should be engineered by following fault tolerate best practices, separation of duties, observability, and being operator friendly. 

  • Expert on being Self-motivated and results-oriented with the ability to work in a team environment and independently 

  • Strong hands-on experience with BigQuery, including performance tuning, cost management, and governance. 

  • Experience with Vertex AI, including pipelines, model deployment, model monitoring, and integration with BigQuery. 

  • Deep knowledge of Cloud IAM, service accounts, Workload Identity Federation, and principle-of-least-privilege controls. 

  • Experience with GKE operations (clusters, node pools, autoscaling, workload identity, Istio/Anthos optional). 

  • Understanding of Cloud Storage, Pub/Sub, Dataflow, Dataproc, and Cloud Composer for data/ML workflows. 

  • Experience building CI/CD pipelines targeting GCP using Cloud Build, Artifact Registry, and Terraform. 

  • Ability to troubleshoot GCP networking: VPCs, firewall rules, private service access, interconnects/VPN. 

 

Nice to Have  

  • Intermediate knowledge of Terraform and Cloud Formation required. 

  • Intermediate Microsoft office skills 

  • Hands-on experience with advanced GCP services such as Vertex AI, BigQuery, Dataflow, Pub/Sub, Cloud Run, and GKE. 

  • Experience creating org-level policies, security baselines, and automation patterns for GCP environments 

 

What We Offer 

  • Collaborative work environment with global teams. 

  • Competitive compensation and comprehensive benefits. 

  • Continuous learning and growth opportunities in geospatial and risk analytics technologies. 

 

About Us | Our Culture | What It’s Like to Work Here