Ryz labs

AI Evaluation Engineer

Argentina Full Time
RYZ Labs is looking for an experienced AI Evaluation Engineer to join one of our clients’ teams.

Responsibilities

  • Design and implement evaluation pipelines to measure the performance and reliability of AI models.

  • Develop automated testing frameworks to assess model outputs at scale.

  • Analyze model performance using both traditional statistical metrics and AI-specific evaluation methods.

  • Evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG).

  • Identify potential issues related to accuracy, hallucinations, bias, safety, and model drift.

  • Conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior.

  • Collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance.

  • Monitor model performance in production and help define best practices for AI evaluation and observability.

Requirements

  • Proficiency in Python and experience building scripts or pipelines to evaluate model outputs.

  • Experience working with AI/ML systems, particularly large language models (LLMs) or generative AI applications.

  • Familiarity with concepts such as prompt engineering, prompt optimization, and LLM evaluation.

  • Understanding of evaluation metrics such as precision, recall, F1-score, and AI-specific metrics related to model quality and safety.

  • Experience evaluating RAG systems or knowledge retrieval pipelines is a plus.

  • Experience with modern AI evaluation or observability tools is a plus (e.g., DeepEval, Promptfoo, RAGAS, LangSmith, Arize, Weights & Biases).

  • Strong analytical mindset with the ability to interpret model behavior and propose improvements.

Nice to Have

  • Experience performing adversarial testing or red-teaming of AI systems.

  • Familiarity with AI safety, bias detection, and model alignment practices.

  • Experience working in production environments deploying or monitoring AI systems.