What you'll be doing
- Improving the reliability of our main product Suite One as we scale to different regions across the globe
- Creating and maintaining innovative, automated solutions, tooling and alerting frameworks to improve the reliability of our production systems
- Proactively addressing application, platform and database performance and reliability issues
- Automating our infrastructure, testing, failover solutions, failure mitigation, and much more
- Maintaining documentation and "runbooks" to assist with operational management
- Conducting post-incident reviews and implementing the findings of such to continually improve
- Assisting with on-call rotations and processes
- Educating, training and promoting our culture of ownership to help our engineering teams better understand the production impact of their changes
- Assisting with the development and maintenance our software delivery frameworks
- Working closely with internal partners and teams to ensure that we ship software that meets security, SLA, and performance requirements
Our Tech stack
- Azure
- Kubernetes (AKS), Docker
- Linux, Terraform, Ansible, Helm
- Gitlab, Gitlab CI/CD pipelines
- Prometheus, Grafana, ELK
- JIRA, Confluence
Product Development Stack:
- .NET Core, Angular, Python
- MSSQL Server, Postgres,
- Redis, Cosmos-DB, Rabbit-MQ
Your skills and experience
- Senior-level engineer with 4+ years of experience across both Development and Operations
- A proven track record of improving application stability and performance
- Experience designing, building, and operating large-scale production systems
- A proven track record of working with cloud computing platforms such as Amazon Web Services, Microsoft Azure, Google Cloud Platform
- Proven experience and appreciation of IaC (Infrastructure-as-Code) practices
- Experience automating infrastructure, testing, and deployments using tools like Ansible, Chef, or Terraform
- Experience working with containers, such as with Docker or Kubernetes
- Experience working across DevSecOps pipelines and tooling
- Experience with monitoring and observability solutions such as New Relic, Opentelemetry
- Experience debugging complex problems
- Desirable: Experience of working with large production data sets
- Desirable: A good understanding of ASP.NET
Your personal attributes
- Great communication and collaboration skills working with other engineers, product managers, and business stakeholders
- Independent, proactive, and able to deliver production-ready solutions with minimal guidance
- Empathetic and authentic
- Inquisitive and interested
- Driven
- Self-motivated and diligent
- Optimistic and courageous