ICDE_vision_paper (1)
- Source PDF: 2512.11001v1.pdf
- Extracted: 2026-05-01T06:11:35-04:00
- Extraction method: pdftotext
Extracted content
Query Optimization Beyond Data Systems: The Case for Multi-Agent Systems Zoi Kaoudi
Ioana Giurgiu
IT University of Copenhagen Copenhagen, Denmark zoka@itu.dk
IBM Research Europe Zürich, Switzerland igi@zurich.ibm.com
Abstract—The proliferation of large language models (LLMs) has accelerated the adoption of agent-based workflows, where multiple autonomous agents reason, invoke functions, and collaborate to compose complex data pipelines. However, current approaches to building such agentic architectures remain largely ad hoc, lacking generality, scalability, and systematic optimization. Existing systems often rely on fixed models and single execution engines and are unable to efficiently optimize multiple agents operating over heterogeneous data sources and query engines. This paper presents a vision for a next-generation query optimization framework tailored to multi-agent workflows. We argue that optimizing these workflows can benefit from redesigning query optimization principles to account for new challenges: orchestration of diverse agents, cost efficiency under expensive LLM calls and across heterogeneous engines, and redundancy across tasks. Led by a real-world example and building on an analysis of multi-agent workflows, we outline our envisioned architecture and the main research challenges of building a multi-agent query optimization framework, which aims at enabling automated model selection, workflow composition, and execution across heterogeneous engines. This vision establishes the groundwork for query optimization in emerging multi-agent architectures and opens up a set of future research directions.
I. INTRODUCTION The rapid advancement of Large Language Models (LLMs) has led to a growing interest among organizations and individuals in developing and utilizing agent-based workflows for their data-related applications. The agents used are capable not only of generating syntactically coherent text but also invoking functions, collaborating with one another, and making autonomous decisions. This has given rise to the emergence of multi-agent architectures: systems composed of multiple agents that work together to accomplish complex data tasks. Existing approaches to developing multi-agent architectures are largely ad hoc and fail to generalize across tasks and domains. This leads to fragmented code, suboptimal workflows, redundant development efforts, and costly executions. Moreover, the inherent difficulty of manually identifying and coordinating appropriate agents for complex tasks means that current solutions are typically limited to a single agent or a small set of agents using pre-selected LLMs or Small Language Models (SLMs). This fundamentally limits the potential of leveraging a diverse set of agents which collaborate in an efficient manner to solve highly complex data tasks. At the same time, current multi-agent architectures are further constrained by relying on tools that can invoke a single
execution engine (e.g., a database). This dependency limits the system’s ability to combine heterogeneous data sources and leverage computationally diverse execution engines to gain performance. As a result, such architectures are often restricted to narrowly defined, suboptimal workflows without being able to leverage multiple, specialized execution engines. Drawing on these observations and our own experience on building data systems, it is evident that there is a pressing need for a systematic framework capable of optimizing multiagent workflows along three key dimensions: automated agent selection, effective agents orchestration, and automated execution engine selection. By enabling these, such a framework would address performance inefficiencies, reduce development efforts, and unlock the full potential of multi-agent architectures to tackle complex data tasks at scale. We envision a next-generation class of query optimizers specifically designed to optimize multi-agent workflows. Although we can draw inspiration from query optimizers built for traditional data systems, multi-agent workflows introduce fundamentally new challenges. First, the operations one can perform with agents are theoretically infinite and cannot be predetermined like in relational algebra. This requires a more open and extensible design for query optimizers. Second, besides latency, both monetary cost and accuracy become significant concerns as LLMs can incur substantial cost with each execution and yield non-deterministic approximate answers. Therefore, multi-objective query optimization becomes more appropriate in this setting. Third, such workflows often exhibit high redundancy, with many operations shared across large portions of the workload [1]. To reduce redundant computations, caching should be a main component of query optimizers for multi-agent systems. Addressing these challenges requires rethinking both the architecture and methodologies of traditional query optimization. Although there have been a few works that touch on the subject of multi-agent workflows, they primarily explore how to generate, adapt, and score them dynamically. Frameworks such as AdaptFlow [2], ScoreFlow [3], and ADAS [4] propose meta-learning or preference-optimization techniques to refine task allocation and collaboration strategies within multi-agent ecosystems. Similarly, AutoFlow [5], AutoAgent [6], and AgentVerse [7] automate the composition and coordination of agent workflows, emphasizing emergent reasoning behaviors
A1
A1
A2
A2
A1 A1
A2 A2
. .
...
A10
A10
(a) Chain
(b) Orchestrated DAG
A3
A5
A3
A5
A4
A6
A4
A6
A7
A7
.
.
A10
A10
(c) Classical DAG
(d) Feedback graph
Fig. 1: Examples of different workflow structures. and modular orchestration. While these systems advance the behavioral optimization of agents, they largely operate at the control-logic level, focusing on how agents reason and communicate, without addressing the data-centric aspects of workflow optimization, such as cost modeling, caching, or execution reuse across heterogeneous processing engines. Complementary lines of research have addressed the multimodal data and query processing dimension. Systems such as Palimpzest [8], CAESURA [9], LOTUS [10], and ELEET [11] extend LLM-based query processing to unify structured, unstructured, and vector data through semantic operators and learned cost models. These frameworks advance cross-modal execution and semantic query planning, yet they focus on optimizing single-agent query pipelines rather than the coordination and optimization of multi-agent workflows. Our vision bridges the gap between these two directions by focusing on optimizing the workflow structure and the selection of agents and execution engines through a multiobjective optimization approach. We first motivate the need for such a framework with a real-world example (Section II) and analyze a large set of multi-agent workflows (Section III) to understand the user patterns and requirements. Based on our findings we propose the architecture of our envisioned query optimizer (Section IV) and layout the main challenges that need to be tackled (Section V). II. MOTIVATING EXAMPLE Let us consider the following real example, where a user provides the following request in a declarative interface. Real-time Customer Support Reporting: Find all urgent cases reported by customers from live chats in the last 24 hours. Generate a detailed report and flag cases that need priority handling. Conceptually, the workflow can be decomposed into several tasks, each tackled by an AI agent. Table I describes each of these agents, their input and output, and whether the results are
deterministic or not. This workflow is one of many we have seen from customers requiring multiple disparate agents and combining multiple data modalities, namely structured (extract customer profiles from a data lake), unstructured (sentiment analysis and detection of urgent cases), and streaming (realtime monitoring of chat messages). The potential for workflow automation and optimization for this example is substantial. We identify three dimensions for optimization: (i) how the workflow should be structured, (ii) which models should be used for the tasks on the unstructured data, (iii) which engines should be used for executing each task. We detail these three dimensions next. Choice of workflow structure. First, users typically construct workflows manually by adhering to rigid, experience-driven rules. The simplest workflow is a chain connecting agents sequentially (see Figure 1(a)). Although it is simple to implement and debug, it can be slow (due to no parallelism of tasks) and it does not easily allow rerunning just one branch (e.g., contract checks). A second option is to use an orchestrated DAG, where A1 manages all other agents (see Figure 1(b)). The benefit of such a structure is that it is easy to monitor and retry steps, as well as to parallelize tasks. However, it does impose a more complex orchestration logic and the existence of a centralized agent can incur unnecessary communication. A third option is to use a classical DAG structure (Figure 1(c)). Such a structure is relatively easy to implement and maintain, and it allows for parallelism, where independent branches run concurrently and merge their outputs later in the workflow. While realistic and frequently used in practice, it cannot incorporate the functionality of a feedback loop easily. Perhaps the best option, albeit the most complex, is using a feedback graph structure (Figure 1(d)) which combines the simplicity of chain flow with parallelism, and most importantly feedback, thus allowing continuous improvement. As we will see later, this structure is rarely used as most workflows are built without a continuous optimization aspect, supported by a feedback loop, in mind. The structure alternatives listed above are by no means exhaustive. Manually exploring all possible choices is impractical and complex, thus, requiring a query optimizer to take care of that. The complexity of the task is further exacerbated by the choice of models and engines per individual agent, as we discuss in the following. Choice of models. From the ten agents involved in solving the example problem, seven of them require an LLM/SLM or other model to execute their given task. In particular, agents A3, A4, A6, A7, and A9 rely on core NLP reasoning capabilities to interpret, classify, or summarize text data. The pool of LLMs and SLMs to choose from is enormous. Still, in spite the rich choice, users simply use models they are familiar with. They are concerned about cost aspects (e.g., #tokens used, inference time, resource usage) only when their model becomes extremely expensive (i.e., a reactive decision rather than a cost-effective driven strategy). This behavior highlights the need of a systematic, cost-aware model selection strategy
TABLE I: Decomposition of the example workflow into specialized AI agents. ID A1
Agent Name Orchestrator agent
Input User request
Output Final report
Agent Type Stochastic
Time window Raw transcripts Case transcripts Case transcripts Enriched records
Raw chat logs Case transcripts Annotated cases Enriched records Policy-matched records Priority labels
Deterministic Deterministic Stochastic Deterministic Stochastic
KPI agent
Description / Role Interprets user input, controls workflow, dispatches tasks to other agents, ensures completion. Pulls chat transcripts from the last 24h. Identifies only case-related chats. Classifies urgency (critical/high/normal). Extracts customer data from data lake. Retrieves contract policies and SLA terms per customer. Combines urgency, customer tier, and policy to flag priority. Computes metrics (e.g., “5 complaints in 24h”).
A2 A3 A4 A5 A6
Stream retrieval agent Filtering agent Urgency detection agent Customers retrieval agent Policy check agent
A7
Priority analysis agent
A8
Deterministic
A9
Reporting agent
Produces final summary or report.
A10
Feedback agent
Reviews outcomes, compares predictions to follow-up results, and refines thresholds and decision rules.
Enriched labeled cases Feedback logs
Enriched labeled cases PDF/CSV/ Dashboard Model updates, metadata
for multi-agent workflows. Choice of engines. Each agent in the workflow is a runtime unit with its own task type, resource demands, and execution constraints. An engine (database, inference server, etc.) is the environment that must meet those needs. In our example, the deterministic agents performing tasks on structured data require dedicated query engines: a streaming engine (e.g., StreamSets) for A2, a relational database (e.g., DuckDB) for A5, a vector engine (e.g., Pinecone) for A6, a scalable analytics engine (e.g., Spark) for A8. The choice of engines for each different task is not obvious and highly depends on the metrics the user wants to optimize. For instance, A8’s task of computing KPI metrics may benefit from an in-memory singlenode engine for low latency, yet a distributed system (such as Spark) may be preferable when a large amount of cases was classified as urgent. Similarly, stochastic agents relying on models also require an inference engine – either a local inference server or a cloud-based API. Although the trade-offs between local infrastructure or cloud are well understood, it is not always clear which choice is the best and can depend on many factors including volume of tokens used. Why choosing workflow structure, models, and engines should be an automated and optimized process?. Manual selection of workflow structure, models, and execution engines over different objective criteria constitutes a complex multiobjective optimization problem that is difficult for users to handle. Each agent is subject to multiple, often conflicting factors, such as latency, cost, accuracy, making it challenging to reason about trade-offs or model the Pareto frontier in real time. Conditions in the system are also dynamic: traffic spikes, shifting business priorities (e.g., reducing costs by a fixed percentage), or the availability of new models can change which configurations are optimal. Moreover, choices can be interdependent leading to cascading effects across other agents. For example, using Pinecone for A6 may make LlamaIndex preferable for A7 due to integration consistency. These tradeoffs are further complicated by the fact that engines performing
Policy-matched records Priority labels
Stochastic
Stochastic Stochastic
well in testing may degrade under live load or when applied to new data domains. These challenges highlight the need for a new query optimization framework tailored to determining workflow structure, models, and engines in multi-agent workflows. III. ANALYSIS OF REAL MULTI-AGENT WORKFLOWS To gain understanding on user needs, we analyzed multiagent workflows generated from real and typical examples deployed in industry. The workload we used consisted of over 9000 multi-agent workflows (generated based on tens of concrete workflows), each comprising of at least three tasks1. The tasks ranged from ingesting and integrating data, to schema mapping / linking, feature engineering, performing analytical and reasoning tasks (summarization, sentiment analysis, anomaly detection), reporting and visualization. Figure 2a shows the distribution of the number of tasks per workflow. We observe a higher density with workflows composed of 5-7 tasks but it is not uncommon to find workflows with more than 10 tasks. As technology gets more advanced and there are dashboards, applications and tools to automatically create such workflows, we expect the distribution to shift to larger numbers. This is similar to what we observe with SQL queries and query plans being more complex when created automatically from applications [12]. We further analyze the workload along the following dimensions: (i) task distribution, (ii) engine execution, (iii) usage of LLMs/SLMs, and (iv) workflow structure. These dimensions are crucial in setting the ground towards building a multi-agent workflow optimizer. We detail our findings next. Task distribution. We measure the percentage of tasks appearing in more than one workflow. Figure 2b shows the results. The most common tasks across the workload are: 1) data ingestion (e.g., schema, documents, chat messages, images) 1 Although there were workflows comprising two tasks we excluded them from our analysis due to their simplicity.
(a) Distribution of #tasks per workflow.
(b) Task distribution across workloads.
(d) Most popular LLM families used.
(c) Engine selection of deterministic tasks.
(e) Workflow structures used.
Fig. 2: Analysis of multi-agent workflows. – 100%; 2) connect to various data sources (e.g., databases, storage, etc.) – 100%; 3) entity extraction – 77%; 4) generate report – 70%; 5) send report – 61%. Notably, analysis of unstructured text occurs in more than 50% of workloads. In 44% of workloads, entity linkage as a task occurs at least once. Engine execution. We found that around 43% of the tasks involve a deterministic agent and need to be executed on a query engine. The rest of the tasks involve stochastic agents which execute various LLMs, SLMs and ML models. Out of these 43% of the tasks that require an engine, most tasks execute on a structured database like Postgres, Presto or DuckDB (60%), while the remaining 40% are distributed between streaming engines (8%), analytics engines (23%) – Spark, and 9% on vector databases (Milvus). We show this in Figure 2c. LLM/SLM usage. Out of the 57% of stochastic tasks, 21% of them use an LLM for sentiment analysis, summarization, entity linkage, classification, anomaly detection, etc. However, when it comes to the actual LLMs used across these tasks, we observe the same limited set frequently repeats even if there are many LLMs available (i.e., HuggingFace already hosts hundreds of thousands of LLMs and fine-tuned versions). For example, for summarization and sentiment analysis tasks, the predominant LLM families used are LLaMa3 and Qwen2. For code generation tasks, Claude 3.5 Sonnet and Qwen2.5-Coder are the most popular. For image analysis tasks, LlaMa3 and Granite Vision (SLM) are favorites. We show the most popular families of open LLMs used in our set in Figure 2d (i.e., the pool of LLMs provided is limited to open models, no closed models like Gemini or ChatGPT are used). The size of each bubble corresponds to how many LLMs from that family were used for the tasks. It
is worth mentioning that in general users prefer to use models they know and have used before rather than experiment with other models. We also notice a tendency on using models with fewer parameters rather than large models. Workflow structures. When building multi-agent systems, apart from assigning tasks to an agent, we are also concerned with how agents communicate and in what order. Figure 2e shows the frequency usage of different structural patterns. Unsurprisingly, a simple chain is used the most frequently, in 34% of the cases, followed by DAG (25%), Tree (24%) and Branching Chain (6%). More complex structures, such as Hybrid (4%), Pub-Sub (3%), Cyclic Graph (2%) and Mesh (<1%), are rarely used. However, it is important to note that choosing a simpler structure does not necessarily translate to a more efficient, cost-effective, high-accuracy workflow. Thus, there is a clear need for automated optimization of multi-agent workflows. IV. OUR VISION Our vision is to build a query optimization framework which, given an abstract multi-agent workflow (i.e., without concrete models and execution engines), it determines an optimized multi-agent workflow (i.e., it finds an optimal workflow structure and selects the “best” models and execution engines) according to the user’s optimization criteria. Figure 3 illustrates how an abstract workflow can be transformed into an optimized executable workflow in our motivating example. Our envisioned framework builds upon ideas from crossengine query optimization [13], [14], but extends them to a new domain where multi-agent workflows involve stochastic components and expensive reasoning costs that fall outside the scope of traditional dataflow systems.
A1:Orchestrating
A1:Orchestrating
A2: Stream ingestion
A5: Customer extraction
A3: Filtering
A2 + A3: Stream ingestion + Filtering
A5: Customer extraction
A6: Policy rules check
A4: Sentiment analysis
@Inference server
A7: Entities integration & merge A8: KPIs computation
A4: Sentiment analysis
A6: Policy rules check @Inference server
A7: Entities integration & merge A8: KPIs computation
A9: Report generation A9: Report generation A10: Feedback
Abstract Multi-Agent Workflow
A10: Feedback
@HuggingFace API
@Inference server
Optimized Executable Multi-Agent Workflow
Fig. 3: Example of abstract multi-agent workflow (left) and optimized executable multi-agent workflow (right). A. Problem Statement We first provide some useful definitions of workflows that lead to the problem statement.
executable agents and E∗ is a set of directed edges showing the communication between two instantiated agents. Problem statement. Current multi-agent architectures express both abstract and executable workflows within one implementation. For instance, a summarization agent may be hard-coded to always use a specific model (e.g., Mistral 3B) from a specific API rather than referencing an abstract “summarization” capability. This tight coupling of logic and execution introduces code redundancy, as similar tasks of different workflows must be reimplemented whenever a different model, cost constraint, or execution environment is desired. Moreover, it leads to missing optimization opportunities, as the system cannot systematically explore alternative model choices, execution engines, or shared computation paths. Developers are forced to manually build pipelines, select models, and specify orchestration plans without considering dependencies, data reuse, or cost-efficiency. We define the multi-agent workflow optimization problem as follows: Given an abstract workflow W = (A, E) and a set of optimization objectives O = {o1, o2, ..., om} (e.g., #tokens, monetary cost, accuracy, runtime, energy consumption, etc.), the goal is to find one or more executable workflows W ∗ = (A∗, E∗) that represent a Pareto-optimal solution over the objectives O. The challenge lies in automatically identifying optimal mappings from abstract to executable workflows while accounting for heterogeneous data modalities, execution platforms, and interagent communication overheads.
Abstract multi-agent workflow. At a conceptual level, a multi-agent workflow represents a logical composition of agents collaborating to achieve a common goal. Each agent is characterized by a specific capability, such as retrieval, sentiment analysis, or summarization, and interacts with other agents. In contrast to relational databases where the set of operations one can use is fixed, here the set of tasks one can achieve with agents is significantly larger and undefined. While the space of possible tasks that can be described is infinite in theory, the subset it can execute in practice should be finite. It is thus necessary to assume the existence of a global agent registry R = {a1, a2, ..., an} which defines the available agents together with their functional descriptions and expected inputs/outputs. Formally, we define an abstract multi-agent workflow as a directed graph W = (A, E), where A ⊆ R denotes the set of agents and E ⊆ A × A represents the communication between them. Each directed edge (ai, aj) ∈ E specifies a dependency in which the output of agent ai serves as input to agent aj. The workflow defines what tasks are to be performed and how information (i.e., control and data) flows between agents, but abstracts away from how each agent is implemented or executed. We distinguish between two types of agent nodes: stochastic, which provide a possibly approximate answer which can differ from run to run (e.g., summarization), and deterministic, which consistently return the same output.
B. Envisioned Query Optimizer Architecture
Executable multi-agent workflow. An executable multiagent workflow instantiates the abstract representation by mapping each logical agent to a concrete implementation and execution environment. In practice, an executable workflow specifies (i) the model to be used for a stochastic agent, (ii) the engine on which each agent’s task has to be executed (e.g., databases, large scale analytics engines, API endpoints, inference servers, etc.), and (iii) the orchestration plan that dictates the execution order, parallelism, and communication. We define an executable multi-agent workflow as a directed graph W ∗ = (A∗, E∗) where A∗ is a set of instantiated
Figure 4 shows our envisioned architecture for a multi-agent query optimization framework. The input to the optimizer is (i) an abstract workflow, which is either directly created by an expert user or is automatically built by an NLP component, and (ii) a set of objective criteria that the user wants to optimize for (e.g., monetary cost and accuracy). The multi-agent workflow optimizer consists of the following main components: (i) a multi-layer multi-purpose cache for storing intermediate results, plans, and state, (ii) a multiobjective planner for navigating the search space of different workflow implementations, (iii) a set of unified cost models for
Application user NLP Query
Expert user
NL Parser Criteria
Abstract multi-agent workflow Abstract multi-agent workflow
Criteria
Multi-agent Workflow Optimizer
Multi-layer Multi-purpose Cache
Search Space Manager
Multi-objective Planner
Unified Cost Cost Models Cost Models Model Statistics Statistics Statistics
Monitor Pareto-optimal executable multi-agent workflows
Fig. 4: Multi-agent query optimizer architecture.
estimating costs along the different objective criteria, and (iv) a search space manager for keeping knowledge required by the rest of the components, such as the agents registry, available models and engines. In addition, to the four main components, the optimizer comes with a monitor which logs execution and telemetry to be used for workflow re-optimization on the fly and different statistics which are required by the cost models. Although the architecture may resemble the traditional relational query optimizer architecture [15], our framework departs from it in fundamental ways, as we detail in the following where describing each component. Multi-Objective Planner. The multi-objective planner serves as the core component of the optimizer. Its role is to explore alternative executable multi-agent workflows for a given abstract workflow and identify those that optimally balance the input objective criteria given by the user, such as latency, monetary cost, and accuracy. Similar to a traditional query planner that generates and evaluates alternative join orders or access paths, the multi-objective planner enumerates possible mappings from abstract agents to concrete implementations, along with alternative orchestration strategies and workflow structures. Each candidate executable workflow (or part of a workflow) is evaluated according to the cost models. Instead of producing a single optimal plan, the planner maintains a Pareto frontier of executable workflows, allowing downstream components or user policies to select the most appropriate trade-off for execution. The multi-objective planner is also able to re-optimize whole or parts of a workflow on the fly, leveraging on the fly information from the cost models and the cache. Unified Cost Models and Statistics. A unified cost model is essential for estimating the cost of candidate executable workflows in terms of each optimization objective. To support cross-engine optimization, the cost model needs to integrate
statistics from structured systems (e.g., relational engines), reasoning modules (e.g., LLMs), and hardware metadata (e.g., GPU availability, API throughput, or local inference performance). By unifying these disparate dimensions under a single predictive model, the optimizer is able to make more informed decisions about which platforms and models best satisfy the given multi-objective criteria. Search Space Manager. The search space manager serves as the knowledge backbone of the query optimizer. It maintains the agent registry, which describes the tasks supported by the system, the pool of models and the pool of execution engines, which specify possible instantiations for each agent, as well as the set of optimization objectives that is possible to consider. It provides a view of the system’s capabilities and, thus, implicitly defines the search space over which the multi-objective planner explores for executable workflows. Multi-layer Multi-purpose Cache. Differently from traditional query optimizes, the multi-layer multi-purpose cache (MMCache) has to be a core component due to tasks redundancy and cost of optimization and execution of multi-agent workflows. The goal of the MMCache is threefold. First, it aims at eliminating redundant computation for tasks that are either identical or semantically equivalent. To achieve this, MMCache integrates two complementary mechanisms: an exactmatch symbolic tier and a similarity-based semantic tier. The symbolic cache handles repeated deterministic requests with O(1) lookup complexity. The semantic cache is organized into short-to-medium and long-term tiers and enables approximate nearest-neighbor retrieval for semantically similar queries and capturing reuse patterns without costly recomputation. Second, MMCache stores optimized execution plans or plan fragments (e.g., operator trees, cost estimates, and engine mappings) that can be reused or adapted for structurally similar workflows, reducing optimization overhead and accelerating costly planning decisions. Third, MMCache also keeps coordination strategies, decision traces, and policies that have proven effective for orchestrating multiple agents. The optimizer can use this cache to quickly reproduce prior effective behaviors when faced with similar contexts and thus, adapt faster and incur less training overhead. Moreover, MMCache can serve learned policies and best practices for distinct classes of multi-agent workflows. Monitor. The monitor component maintains runtime performance metrics that inform the optimizer. After each execution, the system logs observed latencies, throughput, token usage, memory consumption, and monetary cost associated with each agent and engine. These statistics feed adaptive cost models that allow the system to make data-driven scheduling and optimization decisions. For example, if an LLM agent becomes slower or more expensive, the optimizer can route tasks to an alternative engine. V. OPEN CHALLENGES We outline several key research challenges that must be addressed to realize our envisioned query optimization framework. We organize the challenges according to the main
architecture components, while noting that the list is by no means exhaustive. A. Unified Cost Models While traditional cost models have served as the foundation of query optimizers for decades, their assumptions no longer hold in multi-agent systems where heterogeneous engines, stochastic reasoning modules, and multi-objective trade-offs coexist. The unified cost models envisioned in our architecture must integrate cost signals from structured databases, vector stores, streaming engines, and inference APIs, while accounting for uncertainty, semantic reuse, and feedback. Developing such a model raises a number of open challenges. Challenge 1: How can cost models be unified across engines and modalities? Analytical cost models in database systems rely on deterministic operator algebra and uniform metrics (CPU, I/O, cardinality). In contrast, multi-agent workflows execute across heterogeneous environments, each exposing incompatible performance metrics and pricing schemes. A token-based LLM API call, for example, cannot be directly compared with a GPU-bound inference engine or a CPUbound relational query. A unified abstraction must normalize these metrics into a common representation that enables fair cross-engine comparison. Moreover, the cost model must remain portable as hardware and pricing evolve, which precludes extensive per-engine calibration. Possible direction: Learned cost models [16]–[18] can capture multi-dimensional per-engine cost vectors, combining traditional runtime predictors with learned estimators for latency, token usage, and monetary cost across heterogeneous execution environments. Hardware-aware latency predictors from compiler auto-scheduling systems, such as AutoTVM [19] or Ansor [20], could supply GPU and accelerator priors for inference tasks, while maintenance cost formulations from vector database systems [21] can parameterize the cost of semantic caching tiers. Challenge 2: How can non-determinism and probabilistic accuracy be modeled? Classical cost models in databases and workflow systems assume deterministic outputs and monotonic latency behavior. However, LLM-based agents violate both assumptions. Execution time, token usage, and even output accuracy vary across runs, despite identical prompts and configurations [22]. Moreover, inference mechanisms, such as batching and prefill/decode disaggregation, create non-monotonic latency distributions, where increased load or request length can unexpectedly reduce tail latency due to scheduler decisions [23]–[26]. These stochastic and dynamic properties render traditional cost models inadequate for optimizing multi-agent pipelines. Possible direction: A unified model must capture not only expected cost and utility but also their variances and correlations. Inspired by robust learned query optimization in databases [27], cost models must account for distributions and not just deterministic scalars. Probabilistic and uncertaintyaware optimization frameworks [28], [29] estimate and propagate cost distributions rather than single values, while con-
formal prediction [30] provides calibrated uncertainty bounds for metrics like latency or accuracy [31]. Expected-utility optimization [32] further extends these ideas by balancing stochastic cost and utility, generalizing deterministic query optimization to probabilistic settings. B. Multi-objective Planner While traditional query optimizers are most often focused on optimizing latency, in multi-agent systems users are concerned with multiple metrics, such as latency, monetary cost, and accuracy. Thus, the planner must simultaneously reason over multiple objectives and be able to return solutions in the Pareto front. At the same time, the problems of model and engine selection have been viewed and solved independently so far. However, optimization over one layer in isolation can yield globally suboptimal plans. For example, selecting a smaller LLM may lower token cost but increase downstream verification latency; offloading a deterministic agent to a slower CPU node may save GPU budget but degrade systemlevel accuracy due to delayed feedback. These interactions and multiple dimensions introduce several open challenges. Challenge 3: How to efficiently navigate the large search space introduced by both layers and multiple objectives? Multi-agent workflows introduce a combinatorial explosion of possible executable workflows due to the joint selection of models and engines and multiple objective criteria. Exploring this space exhaustively is computationally expensive, and na¨ıve heuristics can easily miss globally optimal or nearPareto-optimal solutions. Efficiently pruning and guiding the search while balancing conflicting objectives, accounting for stochastic outputs, and preserving flexibility for runtime adaptation remains an open research problem. Possible direction: One promising direction is to leverage generative models to produce candidate solutions directly in the multi-dimensional optimization space, rather than relying on exhaustive search. Historical workflows could be leveraged to build such generative models which could be iteratively refined by exploring the latent space via Bayesian optimization to find better solutions. While similar approaches have been explored in learned offline query optimization and superoptimization [33], [34], it is not straightforward how to adapt them to the multi-agent world and to multiple optimization objectives. Other proposals could also be explored from the space of multi-dimensional multi-objective optimization, such as genetic algorithms for Pareto-front learning across layers [35], [36], hypernetwork-based scalarization [37], or bilevel Pareto optimization or contextual multi-objective reinforcement learning [38], [39]. C. Semantic Cache A large challenge in our envisioned architecture comes from the semantic layer of the MMCache. Recent efforts [40]– [43] that attempt to integrate semantic caching into LLMbased data systems have revealed that the main performance bottlenecks arise not from similarity search itself, but from the cost of embedding computation and index maintenance. While
tools such as FAISS [44] have become the de-facto standard for vector similarity search, their computational model and update semantics are ill-suited to the highly dynamic, finegrained caching workloads typical of multi-agent architectures. Challenge 4: How can embedding computation cost be reduced? The cost of semantic caching comprises two components: (a) the computation of embeddings and (b) the insertion and update operations on the vector index. For example, generating embeddings using LLMs such as OpenAI’s textembedding-3-large incurs roughly $0.13 per million tokens (2024 pricing) [45], translating to approximately $66 to embed one million 512-token entries once. The latency overhead is equally significant: SLMs yield per-sample times of 5–50 ms, while typical LLMs can exceed hundreds of milliseconds per sample [46]. In highly dynamic multi-agent systems, where thousands of tasks are created and executed in real time, such costs render continuous embedding recomputation impractical. Possible direction: (1) Incremental embedding computation, i.e., deriving updated embeddings from previous model activations, can significantly reduce embedding building cost. This requires understanding how semantic representations evolve across model layers and tasks. (2) Hierarchical or multi-resolution embeddings [47] can compute lightweight coarse vectors first and refine only frequently reused items. (3) Embedding distillation and compression, using smaller student models or quantized representations, can approximate large embedding models at a fraction of the cost [48], [49]. (5) Lazy or on-demand index maintenance defers expensive index reorganization using log-structured or background merge techniques [50]. (5) Selective or representative embedding can embed only high-utility or clustered items to minimize redundancy [51]. Challenge 5: How to measure query equivalence? Unlike in traditional data systems where query equivalence is both syntactic and logical, in agent-based systems, queries are often expressed in natural language, can be open-ended, and are executed by non-deterministic components (e.g., LLMs). This means that exact equivalence is undefined, as outputs are probabilistic and/or context-dependent. Thus, equivalence becomes semantic similarity, i.e., whether two queries are “similar enough” so that a cached answer can be re-used. It requires a careful balancing of precision (avoiding incorrect reuse), recall (maximize reuse opportunities) and cost (minimize embedding and verification overhead). Possible direction: Current techniques for measuring query similarity include (i) embedding-based similarity (vector distance by using cosine or Euclidean distance), (ii) combining structural normalization (parse query structure) with semantic embedding (embed only predicates and intent phrases) to obtain a hybrid signature, (iii) using learned equivalence classifiers, and (iv) output-based similarity (i.e., compare queries’ outputs and cache a semantic equivalence relation). However, using embedding models to compute similarity can lead to false positives, as they may conflate semantically distinct queries (e.g., ”show me top 10 customers by revenue” vs. ”show me top 10 products by revenue” have similar
embeddings but different semantics). Learned classifiers are potentially expensive to train and can be unreliable. Outputbased similarity is the most reliable approach, but also the most expensive. A hybrid method would theoretically be able to reduce, but not entirely eliminate, false positives and seems to be the most practical one for cross-engine caching. VI. CONCLUSION Current approaches to building multi-agent architectures are largely ad hoc, lacking generality, scalability, and systematic optimization. Many systems rely on a single execution engine and cannot efficiently coordinate multiple agents operating over heterogeneous data sources, models, and engines. In this paper, we have presented a vision for a next-generation query optimization framework tailored to multi-agent workflows. We have argued that optimizing these workflows requires rethinking traditional query optimization to address new challenges, including coordination across diverse agents, cost efficiency in the presence of expensive LLM calls and heterogeneous execution engines, and redundancy across tasks. Building on an analysis of multi-agent workflows, we outlined an architecture and key research challenges of such a framework. This vision lays the groundwork for systematic, scalable, and cost-aware optimization in emerging multi-agent systems and opens avenues for future research and practical deployment. REFERENCES [1] S. Liu, S. Ponnapalli, S. Shankar, S. Zeighami, A. Zhu, S. Agarwal, R. Chen, S. Suwito, S. Yuan, I. Stoica, M. Zaharia, A. Cheung, N. Crooks, J. E. Gonzalez, and A. G. Parameswaran, “Supporting our ai overlords: Redesigning data systems to be agent-first,” 10.48550/arXiv.2509.00997, 2025. [Online]. Available: http://arxiv.org/abs/2509.00997 [2] T. Zhu, B. Li, C. Binnig et al., “Adaptflow: Adaptive workflow optimization via meta-learning,” arXiv preprint arXiv:2508.08053, 2025. [3] H. Wang, X. Lin, S. Sun et al., “Scoreflow: Mastering llm agent workflows via score-based preference optimization,” arXiv preprint arXiv:2502.04306, 2025. [4] J. Hu, X. Dong et al., “Automated design of agentic systems (adas),” arXiv preprint arXiv:2408.08435, 2025. [5] B. Li, Y. Chen, C. Binnig et al., “Autoflow: Automated workflow generation for large language model agents,” arXiv preprint arXiv:2407.12821, 2024. [6] J. Tang, H. Li et al., “Autoagent: A fully-automated and zero-code framework for llm agents,” arXiv preprint arXiv:2502.05957, 2025. [7] Z. Chen, Y. Huang et al., “Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviours,” arXiv preprint arXiv:2308.10848, 2023. [8] Z. Liu, L. Patel, C. Binnig et al., “Palimpzest: A declarative framework for llm-powered data analytics,” arXiv preprint arXiv:2502.xxxxx, 2025. [9] M. Urban, C. Binnig et al., “Caesura: Language models as multi-modal query planners,” CIDR Conference, 2024. [10] L. Patel, C. Binnig et al., “Lotus: Enabling semantic queries with llms over tables of unstructured and structured data,” PVLDB, 2025. [11] M. Urban and C. Binnig, “Eleet: Efficient learned query execution over text and tables,” PVLDB, vol. 17, no. 13, pp. 4867–4880, 2024. [12] T. Schmidt, V. Leis, P. Boncz, and T. Neumann, “Sqlstorm: Taking database benchmarking into the llm era,” Proc. VLDB Endow., vol. 18, no. 11, pp. 4144–4157, 2025. [13] K. Beedkar, B. Contreras-Rojas, H. Gavriilidis, Z. Kaoudi, V. Markl, R. Pardo-Meza, and J. Quiane´-Ruiz, “Apache wayang: A unified data analytics framework,” SIGMOD Rec., vol. 52, no. 3, pp. 30–35, 2023.
[14] D. Agrawal, S. Chawla, B. Contreras-Rojas, A. K. Elmagarmid, Y. Idris, Z. Kaoudi, S. Kruse, J. Lucas, E. Mansour, M. Ouzzani, P. Papotti, J. Quiane´-Ruiz, N. Tang, S. Thirumuruganathan, and A. Troudi, “RHEEM: enabling cross-platform data processing - may the big data be with you! -,” Proc. VLDB Endow., vol. 11, no. 11, pp. 1414–1427, 2018. [15] Y. E. Ioannidis, “Query optimization,” ACM Comput. Surv., vol. 28, no. 1, pp. 121–123, 1996. [16] R. Heinrich, X. Li, M. Luthra, and Z. Kaoudi, “Learned cost models for query optimization: From batch to streaming systems,” Proc. VLDB Endow., vol. 18, no. 12, 2025. [17] R. Heinrich, M. Luthra, J. Wehrstein, H. Kornmayer, and C. Binnig, “How good are learned cost models, really? insights from query optimization tasks,” Proc. ACM Manag. Data, vol. 3, no. 3, pp. 172:1– 172:27, 2025. [18] A. Strausz, N. Pardon, and I. Giurgiu, “A learned cost modelbased cross-engine optimizer for sql workloads,” Composable Data Management Systems Workshop (co-located with VLDB), 2025. [Online]. Available: http://arxiv.org/abs/2506.02802 [19] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: an automated end-to-end optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, A. C. ArpaciDusseau and G. Voelker, Eds. USENIX Association, 2018, pp. 578– 594. [20] L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, and I. Stoica, “Ansor: Generating high-performance tensor programs for deep learning,” in 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. USENIX Association, 2020, pp. 863–879. [21] J. Mohoney, D. Sarda, M. Tang, S. R. Chowdhury, A. Pacaci, I. F. Ilyas, T. Rekatsinas, and S. Venkataraman, “Quake: Adaptive indexing for vector search,” in 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025, L. Zhou and Y. Zhou, Eds. USENIX Association, 2025, pp. 153– 169. [22] W. Jeong, D. Kim, and T. K. Whangbo, “SCOPE: stochastic and counterbiased option placement for evaluating large language models,” CoRR, vol. abs/2507.18182, 2025. [23] H. Xia, Z. Yang, Q. Dong, P. Wang, Y. Li, T. Ge, T. Liu, W. Li, and Z. Sui, “Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding,” in Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V. Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 7655–7671. [24] M. Mohanty, G. Bolar, P. Patil, U. Devi, F. George, P. Moogi, and P. Parag, “Deferred prefill for throughput maximization in LLM inference,” in Proceedings of the 5th Workshop on Machine Learning and Systems, EuroMLSys 2025, World Trade Center, Rotterdam, The Netherlands, 30 March 2025- 3 April 2025, E. Yoneki and A. H. Payberah, Eds. ACM, 2025, pp. 100–106. [25] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodputoptimized large language model serving,” in 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, A. Gavrilovska and D. B. Terry, Eds. USENIX Association, 2024, pp. 193–210. [26] C. Wang, P. Zuo, Z. Chen, Y. Liang, Z. Yu, and M. Yang, “Prefill-decode aggregation or disaggregation? unifying both for goodput-optimized LLM serving,” CoRR, vol. abs/2508.01989, 2025. [27] B. Chang, A. Kamali, and V. Kantere, “Reqo: A robust and explainable query optimization cost model,” CoRR, vol. abs/2501.17414, 2025. [28] A. Deshpande, L. Getoor, and L. Mihalkova, “Uncertainty management in data and information systems: A survey of probabilistic database systems,” Information Systems Frontiers, vol. 9, no. 6, pp. 531–552, 2007. [29] N. N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” VLDB J., vol. 16, no. 4, pp. 523–544, 2007. [30] A. N. Angelopoulos and S. Bates, “Conformal prediction: A unified review of theory and new challenges,” Foundations and Trends in
Machine Learning, vol. 16, no. 3, pp. 494–640, 2023. [Online]. Available: https://arxiv.org/abs/2107.07511 [31] H. Liu, S. Giridhara, and I. Sabek, “Conformal prediction for verifiable learned query optimization,” Proc. VLDB Endow., vol. 18, no. 8, pp. 2653–2666, 2025. [Online]. Available: https://www.vldb.org/pvldb/vol18/p2653-liu.pdf [32] N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), 2007, pp. 864–875. [33] J. Tao, N. Maus, H. T. Jones, Y. Zeng, J. R. Gardner, and R. Marcus, “Learned offline query planning via bayesian optimization,” Proc. ACM Manag. Data, vol. 3, no. 3, pp. 179:1–179:29, 2025. [Online]. Available: https://doi.org/10.1145/3725316 [34] R. Marcus, “Learned query superoptimization,” in Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023, ser. CEUR Workshop Proceedings, R. Bordawekar, C. Cappiello, V. Efthymiou, L. Ehrlinger, V. Gadepally, S. Galhotra, S. Geisler, S. Groppe, L. Gruenwald, A. Y. Halevy, H. Harmouch, O. Hassanzadeh, I. F. Ilyas, E. Jime´nez-Ruiz, S. Krishnan, T. Lahiri, G. Li, J. Lu, W. Mauerer, U. F. Minhas, F. Naumann, M. T. O¨ zsu, E. K. Rezig, K. Srinivas, M. Stonebraker, S. R. Valluri, M. Vidal, H. Wang, J. Wang, Y. Wu, X. Xue, M. Za¨ıt, and K. Zeng, Eds., vol. 3462. CEUR-WS.org, 2023. [35] K. Deb and H. Jain, “An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints,” IEEE Trans. Evol. Comput., vol. 18, no. 4, pp. 577–601, 2014. [36] J. D. Knowles and W. Zheng, “Evolutionary multiobjective optimization (EMO),” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2024, Melbourne, VIC, Australia, July 14-18, 2024, X. Li and J. Handl, Eds. ACM, 2024, pp. 1432–1459. [37] A. Navon, A. Shamsian, E. Fetaya, and G. Chechik, “Learning the pareto front with hypernetworks,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [38] L. Kong, C. Yang, S. Neufang, O. D. Beyan, and Z. Boukhers, “EMORL: ensemble multi-objective reinforcement learning for efficient and flexible LLM fine-tuning,” CoRR, vol. abs/2505.02579, 2025. [39] L. Kong, C. Yang, O. D. Beyan, and Z. Boukhers, “Multi-objective reinforcement learning for large language model optimization: Visionary perspective,” CoRR, vol. abs/2509.21613, 2025. [40] S. Regmi and C. P. Pun, “GPT semantic cache: Reducing LLM costs and latency via semantic embedding caching,” CoRR, vol. abs/2411.05276, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.05276 [41] J. J. Pan, J. Wang, and G. Li, “Survey of vector database management systems,” VLDB J., vol. 33, no. 5, pp. 1591–1615, 2024. [Online]. Available: https://doi.org/10.1007/s00778-024-00864-x [42] S. Zhong, D. Mo, and S. Luo, “LSM-VEC: A large-scale disk-based system for dynamic vector search,” CoRR, vol. abs/2505.17152, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.17152 [43] J. Mohoney, D. Sarda, M. Tang, S. R. Chowdhury, A. Pacaci, I. F. Ilyas, T. Rekatsinas, and S. Venkataraman, “Quake: Adaptive indexing for vector search,” in 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025, L. Zhou and Y. Zhou, Eds. USENIX Association, 2025, pp. 153–169. [Online]. Available: https://www.usenix.org/conference/osdi25/presentation/mohoney [44] J. Johnson, M. Douze, and H. Je´gou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019. [45] OpenAI. (2025) Openai platform: Pricing. Accessed: 2025-11-06. [Online]. Available: https://platform.openai.com/docs/pricing [46] N. Bansal. (2025, Jun) Best open-source embedding models benchmarked and ranked. Accessed: 2025-11-06. [Online]. Available: https://supermemory.ai/blog/best-open-source-embedding-modelsbenchmarked-and-ranked/ [47] B. Wu, Y. Kang, D. Zan, B. Guan, and Y. Wang, “Hierarchical and contrastive representation learning for knowledge-aware recommendation,” in IEEE International Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia, July 10-14, 2023. IEEE, 2023, pp. 1050–1055. [48] H. Je´gou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, 2011.
[49] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter,” CoRR, vol. abs/1910.01108, 2019. [50] S. Zhong, D. Mo, and S. Luo, “LSM-VEC: A large-scale disk-based system for dynamic vector search,” CoRR, vol. abs/2505.17152, 2025. [51] O. Sener and S. Savarese, “Active learning for convolutional neural networks: A core-set approach,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.