Design and implement evaluation pipelines to measure the performance and reliability of AI models.
Develop automated testing frameworks to assess model outputs at scale.
Analyze model performance using both traditional statistical metrics and AI-specific evaluation methods.
Evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG).
Identify potential issues related to accuracy, hallucinations, bias, safety, and model drift.
Conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior.
Collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance.
Monitor model performance in production and help define best practices for AI evaluation and observability.
Proficiency in Python and experience building scripts or pipelines to evaluate model outputs.
Experience working with AI/ML systems, particularly large language models (LLMs) or generative AI applications.
Familiarity with concepts such as prompt engineering, prompt optimization, and LLM evaluation.
Understanding of evaluation metrics such as precision, recall, F1-score, and AI-specific metrics related to model quality and safety.
Experience evaluating RAG systems or knowledge retrieval pipelines is a plus.
Experience with modern AI evaluation or observability tools is a plus (e.g., DeepEval, Promptfoo, RAGAS, LangSmith, Arize, Weights & Biases).
Strong analytical mindset with the ability to interpret model behavior and propose improvements.
Experience performing adversarial testing or red-teaming of AI systems.
Familiarity with AI safety, bias detection, and model alignment practices.
Experience working in production environments deploying or monitoring AI systems.