← Back to Articles & Artefacts
artefactsnorth

Footnote 1 28 Guardrails Multi Agent

IAIP Research
pde-generalization

MARCO: Multi-Agent Real-time Chat Orchestration

Anubhav Shrimal1, Stanley Kanagaraj1, Kriti Biswas1, Swarnalatha Raghuraman1, Anish Nediyanchath1, Yi Zhang2 and Promod Yenigalla1 1Retail Business Services, Amazon 2AWS Bedrock, Amazon {shrimaa, kstanly, kritibw, rswarnal, anishned, yizhngn, promy}@amazon.com

4 2 0 2

t c O 9 2

] I

A . s c [

1 v 4 8 7 1 2 . 0 1 4 2 : v i X r a

Abstract

Large language model advancements have en- abled the development of multi-agent frame- works to tackle complex, real-world problems such as to automate tasks that require interac- tions with diverse tools, reasoning, and human collaboration. We present MARCO, a Multi-Agent Real-time Chat Orchestration framework for automating tasks using LLMs. MARCO ad- dresses key challenges in utilizing LLMs for complex, multi-step task execution. It incorpo- rates robust guardrails to steer LLM behavior, validate outputs, and recover from errors that stem from inconsistent output formatting, function and parameter hallucination, and lack of domain knowledge. Through extensive experi- ments we demonstrate MARCO’s superior per- formance with 94.48% and 92.74% accuracy on task execution for Digital Restaurant Ser- vice Platform conversations and Retail conver- sations datasets respectively along with 44.91% improved latency and 33.71% cost reduction. We also report effects of guardrails in perfor- mance gain along with comparisons of various LLM models, both open-source and proprietary. The modular and generic design of MARCO allows it to be adapted for automating tasks across domains and to execute complex use- cases through multi-turn interactions.

1

Introduction

Advancements in LLM technology has led to a lot of interest in applying Agents framework to realise solutions which require complex interac- tions with the environment including planning, tools usage, reasoning, interaction with humans. Recent works (Wang et al., 2024; Huang et al., 2024) demonstrate potential of LLMs for creating autonomous Agents while there are numerous chal- lenges to overcome and provide a seamless experi- ence for end users who interact with the system at a daily basis. LLMs are probabilistic next token pre- diction systems and by design, non-deterministic

which can introduce inconsistencies in the output generation that can prove challenging for features like function calling, parameter value grounding, etc. There are also challenges of domain specific knowledge which can be an advantage and dis- advantage at the same time. LLMs have biases inherent in them which can lead to hallucinations, at the same time it may not have the right internal domain specific context which needs to be provided to get the expected results from an LLM.

We present our work on building a real time con- versational task automation assistant framework with the following emphasis, (1) Multi-turn In- terface for, (a) User conversation to execute tasks (b) Executing tools with deterministic graphs pro- viding status updates, intermediate results and re- quests to fetch additional inputs or clarify from user. (2) Controllable Agents using a symbolic plan expressed in natural language task execution procedure (TEP) to guide the agents through the conversation and steps required to solve the task (3) Shared Hybrid Memory structure, with Long term memory shared across agents which stores complete context information with Agent TEPs, tool updates, dynamic information and conversa- tion turns. (4) Guardrails for ensuring correctness of tool invocations, recover for common LLM error conditions using reflection and to ensure general safety of the system. (5) Evaluation mechanism for different aspects and tasks of a multi-agent sys- tem.

This is demonstrated in the context of task au- tomation assistant which supports adding usecase tasks to provide users a conversational interface where they can perform their intended actions, mak- ing it easier for them to refer to informational doc- uments, interact with multiple tools, perform ac- tions on them while unifying their interfaces. We provide detailed comparison across multiple foun- dational LLMs as backbone for our assistant tasks like Claude Family models (Anthropic, 2024), Mis-

tral Family models (Jiang et al., 2023, 2024) and Llama-3-8B (AI@Meta, 2024) on Digital Restau- rant Service Platform (DRSP) conversations and Retail conversations (Retail-Conv) datasets.

2 Related Work

Improvements to LLM technology through the re- lease of foundational LLMs like GPT-4 (OpenAI et al., 2024), Claude (Anthropic, 2024) and Mix- tral (Jiang et al., 2024) has led to a flurry of research around autonomous agents and frameworks (Wang et al., 2024; Huang et al., 2024). Zero shot Chain- of-Thought (COT) reasoning (Kojima et al., 2023) allows LLMs to perform task reasoning by mak- ing it think step by step. LLMs can invoke exter- nal tools based on natural language instructions. HuggingGPT (Shen et al., 2023b) can coin series of model invocations to achieve complex tasks mentioned by the user. Toolformer (Schick et al., 2023) demonstrates how LLMs can be used as ex- ternal tools through API invocations selecting the right arguments to be passed from few examples and textual instructions. Agents framework (Zhou et al., 2023) discuss using natural language sym- bolic plans called (Standard Operating Procedures) SOPs which define transition rules between states as the agent encounters different situations to pro- vide more control over agent behavior along with memory to store relevant state information within the prompt (Fischer, 2023; Rana et al., 2023) or long term context externally (Zhu et al., 2023; Park et al., 2023). Amazon Bedrock Agents 1 provide interface to quickly build, configure and deploy au- tonomous agents into business applications leverag- ing the strength of foundational models, while the framework abstracts the Agent prompt, memory, security and API invocations. LangGraph 2 is an ex- tension of LangChain which facilitates the creation of stateful, multi-actor applications using large lan- guage models (LLMs) by adding cycles and per- sistence to LLM applications thus enhancing their Agentic behavior. It coordinates and checkpoints multiple chains (or actors) across cyclic computa- tional steps. While these frameworks present novel ways for LLMs to act in a desired behaviour, they often have accuracy-latency trade-off where to im- prove on the accuracy the system latency increases due to multi-step planning and thinking (Yao et al., 2023; Wei et al., 2023). Our proposed solution,

1Amazon Bedrock Agents User Guide 2LangGraph library

MARCO, not only interacts with user in a multi-turn fashion but also has multi-turn conversation with deterministic multi-step functions which com- prises of pre-determined business logic or task ex- ecution procedure (TEP) requiring agents only at intelligent intervention related steps. Along with the usecase TEPs, multi-step functions and robust guardrails to steer LLM behaviour, MARCO is able to perform complex tasks with high accuracy in less time as detailed in subsequent sections.

3 MARCO: Multi-Agent Real-time Chat

Orchestration

In this section, we discuss our approach for MARCO. Section 3.1 formulates the problem state- ment in terms of Task Automation via real-time chat, followed by components of MARCO in sec- tion 3.2 and the evaluation methods on performance and latency for MARCO in section 3.3.

3.1 Problem Statement

Given an user (Actor), who wishes to perform a task with intent I ∈ {OOD, Info, Action}; where Out-Of-Domain (OOD) intent is defined as any user query which is not in scope of the system such as malicious query to jailbreak (Shen et al., 2023a; Rao et al., 2024) the system, foul lan- guage or unsupported requests, β€œInfo” intent is defined as getting information from predefined data-sources and indexed documents (Dindex), and β€œAction” intent is defined as a performing a use- case related task (ux) which involves following a series of instructions/steps (Task Execution Pro- cedure, T EPx) defined for the usecase and ac- cordingly invoking the right set of tools/functions (F x 2 , ..., F x n }) with the identified re- quired parameters (P x βˆ— = {PF x 1 , F x 2 , ..., F x n }, PF x βˆ— = {PF x 1 , ..., PF x n } for each function respectively. The objective for a task automation system is to, (1) interpret the user intent I for each query, (2) identify the relevant usecase ux, (3) understand the steps mentioned in its T EPx, (4) accordingly call the right sequence of tools F x βˆ— with required parameters P x βˆ— , (5) cor- relate T EPx, tool responses and requirements and conversation context to communicate back with the user, and (6) be fast and responsive for a real-time chat.

An example scenario is shown in figure 1 where User first asks β€œThe sale of certain item is go- ing down in my restaurant. Can you please help me find out why?”, i.e.