Quantiphi

Technical Architect - Machine Learning

USA - Remote Full time

While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning and growth.


If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi!

About Quantiphi:

Quantiphi is an award-winning, AI-First digital engineering and consulting company focused on delivering high-impact Services and Solutions that help organizations solve what truly matters. We partner with enterprises to reimagine their businesses through intelligent, scalable, and transformative AI driving measurable outcomes at the very core of their operations.

Since our founding in 2013, Quantiphi has tackled some of the world’s most complex business challenges by combining deep industry expertise, disciplined cloud and data engineering practices, and cutting-edge applied AI research. Our work is rooted in delivering accelerated, quantifiable business value, not just technology for technology’s sake.

Headquartered in Boston, Quantiphi is a global organization with 4,000+ professionals serving clients across key industry verticals, including BFSI, Healthcare & Life Sciences, CPG, MFG, TME etc. As an Elite and Premier partner to leading cloud and AI platforms such as NVIDIA, Google Cloud, AWS, and Snowflake, we build and deliver enterprise-grade AI services and solutions that create real-world impact. 

We’ve been recognized with:

  • 17x Google Cloud Partner of the Year awards in the last 8 years.

  • 3x AWS AI/ML award wins.

  • 3x NVIDIA Partner of the Year titles.

  • 2x Snowflake Partner of the Year awards.

  • We have also garnered top analyst recognitions from Gartner, ISG, and Everest Group.

  • We offer first-in-class industry solutions across Healthcare, Financial Services, Consumer Goods, Manufacturing, and more, powered by cutting-edge Generative AI and Agentic AI accelerators.

  • We have been certified as a Great Place to Work for the third year in a row- 2021, 2022, 2023.

Be part of a trailblazing team that’s shaping the future of AI, ML, and cloud innovation. 

Your next big opportunity starts here!

For more details, visit: Website or LinkedIn Page.

Role: Technical Architect Machine Learning Engineer - Agentic AI & Multi-Agent Systems
Experience Level: 8-12 years
Location: US / Canada

Job Summary:

We are seeking an experienced Senior Machine Learning Engineer to architect, build, and deploy production-grade agentic AI systems and multi-agent workflows from the ground up. The ideal candidate will have deep expertise in designing autonomous AI systems that can collaborate, reason, and execute complex tasks with minimal human intervention. You will be responsible for creating scalable, robust agentic workflows using cutting-edge frameworks like CrewAI/Langraph, while ensuring enterprise-grade deployment on major cloud platforms.

Roles & Responsibilities:

Agentic System Architecture & Development:

  • Architect & Build Agentic Systems: Design and develop end-to-end multi-agent systems from scratch. You will create the foundational agent harnesses, define communication protocols, and build orchestration layers using frameworks like CrewAI, Langgraph, and AutoGen. Architectural decisions to ensure: 

    • Hierarchical and collaborative multi-agent structures with well-defined agent roles, responsibilities, and communication protocols

    • Dynamic task decomposition, sophisticated tool integration, planning mechanisms (ReAct), and self-correction loops

    • Develop state management systems and memory mechanisms for persistent agent interactions

  • Engineer Advanced Agent Capabilities: Develop custom agent-tools and define specialized agent-skills that empower agents to perform complex, domain-specific tasks.

  • Pioneer Context Engineering: Implement advanced context engineering and memory systems to ensure agents maintain state, learn from interactions, and make informed decisions in dynamic environments.

  • Deploy Production-Grade Solutions: Own the deployment, scaling, and maintenance of robust, low-latency agentic systems on major cloud platforms (GCP, AWS, or Azure). You will implement best-in-class MLOps practices for monitoring, continuous integration/continuous deployment (CI/CD), and system reliability.

  • Integrate and Optimize LLMs: Integrate LLMs to serve as the core reasoning engines for autonomous agents. You will apply advanced techniques like RAG and PEFT to optimize performance.

Tool Development & RAG Integration:

  • Create and maintain comprehensive tool libraries for agents including API integrations, database queries, and external service connections

  • Design and implement RAG systems using vector databases (Pinecone, Weaviate, ChromaDB)

  • Develop custom tools and plugins that enable agents to interact with various enterprise systems and APIs

  • Ensure tool reliability, error handling, and seamless integration within agentic workflows

Observability, Monitoring & Evaluation:

  • Implement comprehensive monitoring and tracing systems for agent behavior, performance, cost optimization, and latency analysis

  • Design novel evaluation frameworks to assess multi-step agentic task success, reliability, and accuracy

  • Utilize advanced observability tools (LangSmith, Arize AI, or custom solutions) to trace agent decision making processes

  • Establish metrics and KPIs for measuring agentic system performance in production environments

Required Skills & Qualifications:

Experience:

  • 6-8 years of hands on experience in machine learning and AI engineering with proven track record of taking ML systems to production

  • Demonstrated expertise in building multi-agent systems and agentic workflows, preferably with Langraph/CrewAI

Technical Skills - Must Have:

  • Programming & ML: Expert-level Python proficiency with ML frameworks (TensorFlow, PyTorch, Transformers). Experience with FastAPI, async programming, and microservices architecture

  • Data & Vector Systems: Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB) and building scalable RAG systems

  • Monitoring & Observability: Experience with LLM application monitoring tools (LangSmith, Weights & Biases, custom telemetry solutions)

  • Proven ability to architect and implement complex AI systems from scratch in production environments

  • Cloud Platform Expertise: Production-level experience with at least one major cloud platform (AWS, GCP, or Azure), including:

    • Compute services (EC2, GCE, Azure VMs)

    • Serverless functions (Lambda, Cloud Functions, Azure Functions)

    • Container orchestration (EKS, GKE, AKS)

    • Managed AI/ML services (SageMaker, Vertex AI, Azure ML)

  • Production & DevOps: Strong skills in Infrastructure as Code (Terraform, CloudFormation), CI/CD pipelines (GitHub Actions, Jenkins), and containerization (Docker, Kubernetes)

Technical Skills - Good to have:

  • Experience with prompt engineering techniques, fine-tuning SLMs (PEFT, SFT, RLHF), and model optimization

  • Knowledge of distributed systems, message queues, and event-driven architectures for agent coordination

  • Familiarity with SDLC best practices, version control (Git), and agile development methodologies

  • Experience with tool-calling agents, multi-step workflows, and stateful orchestration (e.g. graphs, planners, routers).

  • Hands-on evals for agents: trajectory / tool-use checks, golden traces, LLM-as-judge with fixed rubrics, regression suites.

  • Online evals, drift thinking, and clear quality gates before or after deploy (thresholds, alerts, rollback criteria).

  • Safety and abuse: prompt injection via tools, untrusted retrieval, PII handling in prompts and logs, allowlists and guardrails.

  • Cost and latency discipline: budgets per run, timeouts, caps on turns and tool calls.

  • Model lifecycle: routing / gateway patterns, version pinning, fallbacks, and which model for which step.

  • Memory and state: what is persisted, retention, redaction, and what must never be stored

Soft Skills:

  • Exceptional problem-solving and analytical thinking with ability to tackle complex, ambiguous challenges

  • Strong communication skills to explain complex agentic concepts to both technical and non-technical stakeholders

  • Proven ability to work independently and drive large-scale projects to completion with minimal supervision

  • Leadership mindset with experience mentoring team members and driving technical excellence

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!