Who We Are
Solera is a leading global technology and data solutions provider, specializing in the automotive industry that strives to transform every touchpoint of the vehicle lifecycle into a connected digital experience. In addition, we provide products and services to protect life’s other most important assets: our homes and digital identities. Today, Solera processes over 300 million digital transactions annually for approximately 235,000 partners and customers in more than 90 countries. Our 6,500 team members foster an uncommon, innovative culture and are dedicated to successfully bringing the future to bear today through cognitive answers, insights, algorithms and automation. For more information, please visit solera.com.
The Role
A Senior SRE Engineer responsible for ensuring the reliability, availability, performance, and security of on-prem infrastructure and .NET-based fleet management applications. This role blends operational excellence with strong automation, observability, and incident-response capabilities across high-scale telemetry and real-time data systems. As a core member of the development team, you will work to build and maintain robust, reliable infrastructure and automate operational tasks to reduce toil and improve efficiency. We’re seeking an experienced SRE to deliver insights from massive-scale data in real time.
ESSENTIAL RESPONSIBILITIES AND DUTIES:
- Ensure high availability, scalability, and resilience of production services, including APIs, .NET applications, telemetry ingestion pipelines, and on-prem infrastructure.
- Run and maintain production environments by continuously monitoring system health, availability, error rates, resource saturation, and end-to-end performance.
- Define, implement, and monitor SLIs/SLOs/SLAs for uptime, latency, throughput, error budgets, and system reliability.
- Build and maintain software systems to automate the management of platform infrastructure, deployments, and application operations.
- Measure, analyse, and optimise system performance, proactively identifying bottlenecks and driving architectural improvements.
- Own incident management, including detection, triaging, mitigation, communication, root cause analysis (RCA), and post-mortems.
- Design and maintain monitoring, logging, and observability frameworks (Prometheus, Grafana, Datadog, ELK, APM tools) for distributed services, microservices, telemetry workloads, and on-prem infrastructure.
- Develop and enhance automation, CI/CD pipelines to reduce manual toil and improve deployment reliability.
- Ensure reliability, performance, and best practices are integrated into the SDLC.
- Manage and operate on-prem infrastructure, including Rancher, OpenShift, Kubernetes, virtualisation, storage, networking, and security controls.
- Provision, configure, and maintain infrastructure resources using IaC tooling, automation scripts, and configuration management tools.
- Implement security and compliance best practices, especially around fleet data, driver information, telemetry, GPS, and regulatory requirements.
- Perform capacity planning and performance tuning for backend services, telemetry systems, and high-load ingestion pipelines.
- Provide primary operational support for large-scale distributed .NET applications and fleet-critical systems.
- Maintain detailed documentation on architecture, operational processes, incident playbooks, and system runbook
- Fleet/Telematics-Specific Responsibilities
- Support real-time data ingestion pipelines (vehicle telemetry, IoT/edge devices, GPS/GNSS streams), ensuring low-latency and reliable data delivery.
- Optimise backend systems for load spikes typical in fleet operations (e.g., start-of-day vehicle activations, peak trip windows).
- Monitor the health of vehicle-facing and driver-facing data flows, including connectivity, message delivery, and ingestion reliability.
- Enhance observability for mobile/embedded systems, considering intermittent connectivity, offline sync, and edge constraints.
QUALIFICATIONS:
EDUCATION: Bachelor’s degree in Computer Science or equivalent
EXPERIENCE: 6–8 years of relevant experience in DevOps, or Site Reliability Engineering, with hands-on expertise in operating production systems, CI/CD pipelines, and distributed application platforms.
KNOWLEDGE/SKILLS/ABILITIES:
- Strong expertise in on-prem infrastructure & container orchestration — Rancher, Kubernetes/OpenShift, Docker, virtualisation, networking, storage, IP routing, firewalls, and security controls.
- Deep observability and monitoring skills using Prometheus, Grafana, Datadog, ELK, APMs, log pipelines, distributed tracing, and alerting systems like PagerDuty, with the ability to build end-to-end monitoring for APIs, .NET apps and Java apps, and telemetry pipelines.
- Advanced reliability engineering capabilities — defining/operationalising SLIs, SLOs, SLAs, error budgets, availability models, and capacity/performance planning for large-scale distributed systems.
- Strong automation and CI/CD experience with GitHub, Octopus, Jenkins/Azure DevOps, IaC (Terraform/Helm/Kustomize), and scripting (PowerShell, Bash, Python) to reduce manual toil and improve deployment reliability.
- Production operations mastery — incident management (detection → triage → mitigation → RCA/post-mortem), system health monitoring, performance analysis, scalability improvements, and maintaining high uptime SLAs.
- Backend performance & systems engineering skills — thread/memory profiling for .NET apps, SQL/No-SQL Server/Redis tuning, telemetry ingestion optimisation, and handling high-load fleet/telematics workloads.
- Experience supporting real-time data flows & IoT/telemetry systems, including GPS/GNSS streams, vehicle connectivity, ingestion reliability, offline/edge constraints, and mobility-driven scaling patterns.
- Security and compliance knowledge — secrets management, least-privilege access, vulnerability scanning, data protection practices for fleet data, driver information, and regulated telemetry workloads.
- Experience with cloud platforms such as AWS (EKS, EC2, RDS, S3, VPC, IAM) is a plus, especially in hybrid on-prem + cloud environments.