Kyndryl

AgentOps Engineer - Observability

Madrid, Spain Full time

Who We Are

At Kyndryl, we design, build, manage and modernize the mission-critical technology systems that the world depends on every day. So why work at Kyndryl? We are always moving forward – always pushing ourselves to go further in our efforts to build a more equitable, inclusive world for our employees, our customers and our communities.


The Role

We’re looking for exceptional talent to join our AI Agentic Innovation Hub at Kyndryl!

The AI Agentic Innovation Hub stands as Kyndryl’s center of excellence for advanced and agentic artificial intelligence. Our mission is to lead the design and deployment of transformative AI solutions that bridge frontier research with real-world impact — scalable, secure, and driven by measurable value. 
Built upon a team of exceptional talent and cutting-edge technology, the Hub embodies a spirit of bold innovation and disciplined execution — an elite unit within one of the world’s leading technology companies. With national reach and global ambition, we partner with major organizations to tackle their most complex challenges, pioneering the next generation of intelligent, autonomous, and trusted systems that redefine what AI can achieve. 

 

Job Description 
 

As a Senior Observability Engineer at Kyndryl’s AI Innovation Hub, you’ll be at the core of operational excellence for next-generation intelligent and agentic systems. 
Your mission will be to design, implement, and maintain advanced observability and monitoring capabilities that ensure the reliability, traceability, and performance of AI agents and models in production. 
You’ll help build the observability architecture for agentic intelligence — integrating tracing, logging, monitoring, and governance tools that provide a deep understanding of how agents perceive, reason, and act in complex environments. 
Your work will enable early detection of anomalies, data drift, performance degradation, bias, or undesired agent behavior, ensuring compliance with the EU AI Act and Responsible AI principles. 
If you’re passionate about bridging AI systems with operational intelligence, and about creating frameworks that make AI transparent, accountable, and trustworthy, this role offers a unique opportunity to shape the future of intelligent observability. 


Your Mission 

  • Design and implement theobservability architecturefor AI and Agentic systems, enabling end-to-end visibility across models, agents, and data pipelines. 

  • Developinstrumentation frameworksto collect and analyze technical, behavioral, and cognitive metrics for deployed AI systems. 

  • Integrate and configuremonitoring, tracing, and logging tools(Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, etc.) to ensure full operational insight. 

  • Builddashboards and alerting mechanismsto detect data drift, performance issues, hallucinations, or reasoning inconsistencies in LLMs and agents. 

  • Collaborate with MLOps, Data, and Architecture teams to establishmodel lineage, drift detection, and governance pipelines. 

  • Design and maintaincustom metricsfor model and agent reliability — precision, latency, cost, reasoning depth, autonomy, and consistency. 

  • Contribute to theResponsible AI framework, ensuring transparency, fairness, and auditability in AI decision-making. 

  • Continuously research and experiment with new observability tools and practices (AgentOps, LLMOps, RAG Observability). 


Who You Are

Essential Qualifications 

  • 4+ years of professional experience, including at least 2 years in AI, MLOps, or distributed systems projects. 

  • Proven experience designing and implementing monitoring, logging, and performance metrics for production systems. 

  • Hands-on expertise with observability tools such as Prometheus, Grafana, OpenTelemetry, ELK Stack, Loki, Jaeger, or Datadog. 

  • Experience instrumenting AI and ML pipelines, tracking inference latency, throughput, and cost metrics. 

  • Familiarity with MLOps and LLMOps frameworks, including model traceability, drift detection, and prompt or reasoning tracing. 

  • Knowledge of agentic frameworks (LangGraph, AutoGen, CrewAI, OpenDevin, Google ADK) and their monitoring needs. 

  • Experience designing custom metrics for precision, reliability, error rate, and cognitive consistency. 

  • Strong understanding of cloud-native architectures, containers, and IaC tools (Kubernetes, Docker, Helm, Terraform). 

  • Awareness of AI compliance and governance requirements (EU AI Act, Responsible AI, decision traceability). 

 

Education & Certifications 

  • Bachelor’s degree in Computer Engineering, Software Engineering, Data Science, or related field. 

  • Postgraduate or specialized training in MLOps, DevOps, Observability, or Artificial Intelligence is highly valued. 

  • Certifications in Cloud Architecture, Monitoring, or AI Governance are a plus. 

  • Continuous learning mindset and commitment to staying current with emerging AI observability frameworks. 


Preferred Skills 

  • Experience with model observability and data lineage systems. 

  • Understanding of cognitive observability, including reasoning-chain or decision-path tracing in agents. 

  • Familiarity with event-driven architectures and telemetry for real-time AI services. 

  • Knowledge of FinOps metrics and cost optimization for AI workloads. 

  • Experience developing custom dashboards or visualization plugins for monitoring complex systems. 

  • Comfort working in hybrid or multi-cloud environments (Azure, AWS, GCP). 

  • Strong interest in AI reliability engineering and the convergence of AI and DevOps practices. 

 

Soft Skills 

  • Analytical and systemic thinker, understanding the interplay between data, systems, and agent behavior. 

  • Clear communicator, able to convey complex insights and performance findings to both technical and business audiences. 

  • Quality- and reliability-driven, with a preventive mindset focused on operational resilience. 

  • Collaborative and cross-functional, working seamlessly with AI, data, and compliance teams. 

  • Curious and proactive, exploring emerging technologies and methods in AI observability and AgentOps. 

  • Ethical and responsible, aware of the implications and accountability of automated decisions in production AI. 

#AgenticAI


Being You

Diversity is a whole lot more than what we look like or where we come from, it’s how we think and who we are. We welcome people of all cultures, backgrounds, and experiences. But we’re not doing it single-handily: Our Kyndryl Inclusion Networks are only one of many ways we create a workplace where all Kyndryls can find and provide support and advice. This dedication to welcoming everyone into our company means that Kyndryl gives you – and everyone next to you – the ability to bring your whole self to work, individually and collectively, and support the activation of our equitable culture. That’s the Kyndryl Way.


What You Can Expect

With state-of-the-art resources and Fortune 100 clients, every day is an opportunity to innovate, build new capabilities, new relationships, new processes, and new value. Kyndryl cares about your well-being and prides itself on offering benefits that give you choice, reflect the diversity of our employees and support you and your family through the moments that matter – wherever you are in your life journey. Our employee learning programs give you access to the best learning in the industry to receive certifications, including Microsoft, Google, Amazon, Skillsoft, and many more. Through our company-wide volunteering and giving platform, you can donate, start fundraisers, volunteer, and search over 2 million non-profit organizations.  At Kyndryl, we invest heavily in you, we want you to succeed so that together, we will all succeed.

Get Referred!

If you know someone that works at Kyndryl, when asked ‘How Did You Hear About Us’ during the application process, select ‘Employee Referral’ and enter your contact's Kyndryl email address.