IND - Staff Engineer, Reliability - GCC070

We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.

Cloud Services Team is searching for a Reliability Engineer. Candidate must have hands-on experience operating and engineering services on Google Cloud Platform (GCP), including data, compute, and observability services. The team is accountable for the operations, engineering, and governance of 200+ Cloud Technologies across a multiple cloud environment. Role requires helping mature operational practices for GCP workloads as part of our multi-cloud strategy. This is an excellent opportunity for someone who is interested in a mix of strategy and hands-on work. The ideal candidate should feel comfortable working with teammates at all levels of the organization including leadership.

Key Responsibilities

Assists in the development, maintenance and operations of IT services across 200+ infra services across our Cloud transformation landscape.

Develop solutions and drive adoption of enterprise solutions such as Cyber Protection, Disaster Recovery, and Security enhancements, across Line of business teams.

Drive improvement, through automation, of software delivered as a service from an efficiency and simplicity perspective.

Provide clear operational documents and construction/support specifications to IT userbase.

Provide insight into operational Metrics across the entire Cloud Environment.

Consult with customers on any new requirements or design questions or functionality configurations for environments on and off premise

Delivers the tooling and capabilities needed to enable cloud compliance, metrics and reporting and cost management roadmap and strategy.

Participate in incident resolution and change implementation as necessary. This may occasionally include support during non standard hours.

Operate and improve reliability for production workloads running on Google Cloud Platform (GCP), focusing on availability, scalability, and operational readiness rather than application development.

Own day‑to‑day operational concerns for core GCP services including Compute Engine, GKE, Cloud Run, BigQuery, Cloud Storage, and supporting platform services.

Provide operational support for BigQuery platforms including job performance troubleshooting, capacity planning, quota management, dataset permissions, and cost optimization (slot usage, reservations, and quotas).

Support Vertex AI platforms from an operations and reliability standpoint, including environment readiness, access controls, monitoring, pipeline execution health, and incident response (not model development).

Build and maintain observability standards using Cloud Monitoring, Cloud Logging, Error Reporting, and custom SLI/SLO dashboards for GCP workloads.

Implement alerting strategies aligned to error budgets and production reliability goals; reduce alert noise and prevent toil.

Execute incident response, triage, and post‑incident analysis for GCP services, contributing to PIRs and corrective actions.

Develop and maintain runbooks, operational playbooks, and escalation workflows for GCP services.

Drive automation-first operations, including self‑healing patterns using Cloud Functions, Cloud Run jobs, Scheduler, and event‑driven remediation.

Enforce and operate GCP security and governance controls, including IAM, service accounts, Org Policies, VPC Service Controls, KMS, Secret Manager, and networking guardrails.

Partner with engineering and data teams to review designs for operability, resiliency, and supportability, ensuring workloads meet production readiness standards before launch.

Required Skills & Experience:

Expert understanding of how applications should be engineered by following fault tolerate best practices, separation of duties, observability, and being operator friendly.

Expert on being Self-motivated and results-oriented with the ability to work in a team environment and independently

Strong hands-on experience with BigQuery, including performance tuning, cost management, and governance.

Experience with Vertex AI, including pipelines, model deployment, model monitoring, and integration with BigQuery.

Deep knowledge of Cloud IAM, service accounts, Workload Identity Federation, and principle-of-least-privilege controls.

Experience with GKE operations (clusters, node pools, autoscaling, workload identity, Istio/Anthos optional).

Understanding of Cloud Storage, Pub/Sub, Dataflow, Dataproc, and Cloud Composer for data/ML workflows.

Experience building CI/CD pipelines targeting GCP using Cloud Build, Artifact Registry, and Terraform.

Ability to troubleshoot GCP networking: VPCs, firewall rules, private service access, interconnects/VPN.

Nice to Have

Intermediate knowledge of Terraform and Cloud Formation required.

Intermediate Microsoft office skills

Hands-on experience with advanced GCP services such as Vertex AI, BigQuery, Dataflow, Pub/Sub, Cloud Run, and GKE.

Experience creating org-level policies, security baselines, and automation patterns for GCP environments

What We Offer

Collaborative work environment with global teams.

Competitive compensation and comprehensive benefits.

Continuous learning and growth opportunities in geospatial and risk analytics technologies.

About Us | Our Culture | What It’s Like to Work Here

Staff Engineer, Reliability

Related Jobs

Software Engineer

Director, Analytics & AI Enablement – PDS BI&T

Senior Business Intelligence and Analytics Developer

Associate, Python Backend Developer

SOFTWARE DEVELOPMENT ENGINEER

Data Analytics Senior Developer