Learning to Compose for Cross-domain Agentic Workflow Generation

Source PDF: 2602.11114v1.pdf
Extracted: 2026-05-01T06:11:36-04:00
Extraction method: pdftotext

Extracted content

arXiv:2602.11114v1 [cs.MA] 11 Feb 2026

Jialiang Wang1 , Shengxiang Xu3 , Hanmo Liu12 , Jiachuan Wang4 , Yuyu Luo2 , Shimin Di3 , Min-Ling Zhang3 , Lei Chen12 jwangic@connect.ust.hk,xushx@seu.edu.cn,hliubm@connect.ust.hk,wangjc@slis.tsukuba.ac.jp,yuyuluo@hkustgz.edu.cn,shimin.di@seu.edu.cn,zhangml@seu.edu.cn,leichen@cse.ust.hk 1 Hong Kong University of Science and Technology, Hong Kong SAR, China 2 Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 3 Southeast University, Nanjing, China 4 University of Tsukuba, Tsukuba, Japan

Abstract Automatically generating agentic workflows—executable operator graphs or codes that orchestrate reasoning, verification, and repair— has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.

Keywords Agentic Workflows, Large Language Models, Capability Learning, Compositional Generalization

Introduction

Large language models (LLMs) have demonstrated strong zero-shot capabilities in open-domain question answering and code generation [5, 7, 28]. Yet, these capabilities are initially realized through single-pass generation: the model produces a final answer or program in one shot. For complex tasks, single-pass generation often hits a structural ceiling [9, 17]. Beyond being correct, solutions must satisfy constraints, admit external tools, support error correction, and remain reliable under tight latency and cost budgets. To push beyond this ceiling, agentic workflows [23, 45, 52] have emerged as a practical approach after the Chain-of-Thought [41]. By making the solving procedure more explicit, an agentic workflow decomposes the task into an ordered, executable composition of operators and executes them under a control structure. Concretely, workflow generation for complex tasks requires deciding which

operators to invoke and how to compose them into a topology (e.g., a sequence, a branching graph) that determines how intermediate states are produced and consumed [46]. For example, open-domain reasoning benefits from operators that retrieve evidence and compare multiple aspects [38]; mathematical problems require verification and counterexample-checking operators [10]; and code tasks demand the topology of a test-and-repair loop [24]. Recent years have seen rapid advances in building agentic systems with multi-agent workflows. Pioneering works [9, 17] use manually designed operator pipelines or collaboration structures— effective in specific settings but typically fixed regardless of the input task, thereby limiting adaptivity and generality. Motivated by the cost and brittleness of manual design, recent work has moved toward automating workflow generation, aiming to reduce human engineering and tailor workflows to the task. For example, systems such as AFlow [47] and related automated frameworks [18, 39, 43] seek to generate or refine workflows with minimal manual specification, improving scalability across diverse domains. A common strategy behind these automated systems is to place workflow generation inside the inference loop (shown in Fig. 1(a)). Given a task, the model first generates one or more candidate workflows, and then samples and improves them via search or iterative refinement—through best-of-N sampling [20], self-reflection rewriting [39], heuristic structure edits [47], or evolutionary procedures [27] that repeatedly select, mutate, and re-evaluate workflows. Overall, this workflow refinement paradigm treats workflow generation as a trial-and-error inference over a large workflow space, trading high iteration costs for effectiveness and generality. However, the fact that workflows can raise the ceiling does not imply that an LLM can naturally generate effective, task-specific workflows for diverse domains. In fact, the past paradigm resembles attaching an external optimizer at inference time rather than endowing the model with transferable workflow generation capability. Fig. 1(a) exposes two limitations in handling domain shift. First, workflow generation is frequently driven by LLM-only heuristics [18] or stochastic refinement [47]. This limited stability and controllability offer no guarantee about workflow quality when task distribution shifts. Second, good workflow criteria and strategies are difficult to standardize across domains. Workflow criteria and heuristics that work in one domain may fail or even backfire in another [36, 48], yielding pronounced generalization variance. Therefore, workflow generation often amounts to within-task trialand-error to approach a feasible workflow at inference time, rather

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Jialiang Wang et al.

$0.96/iter (AFlow)

Figure 1: Two workflow generation paradigms. Left: inference-time refinement resorts to trial-and-error in a large workflow space. Right: CapFlow internalizes “decompose-recompose-decide” into LLMs, enabling single-pass generation across domains. than learning a transferable mapping from task semantics to workflow structure decisions. We further characterize this missing mechanism through two coupled gaps: (1) a capability decomposition gap: LLMs often represent tasks at the content level but lack a representation that directly exposes which workflow-relevant capabilities are needed; and (2) a capability recomposition gap: even when useful workflow patterns are known, the model lacks a controllable way to select and combine the right capabilities for a new task. Notably, these gaps do not imply that workflow generation is wholly domain-specific. Our study in Sec. 4.1 shows that, despite task-surface differences across domains, many successful workflows repeatedly instantiate similar underlying capability factors (e.g., multifaceted analysis, verification/repair, and aggregation). As illustrated in Fig. 1(b), if a model could represent these capability factors in a parametric, reusable form and compose them on demand for new tasks, cross-domain workflow generation could shift from trial-and-error inference to a single-pass structural decision. To bridge the decomposition and recomposition gaps, we internalize a “decompose-recompose-decide” mechanism into the opensource LLM so that workflow generation does not rely on heuristic trial-and-error inference. To decompose, we learn a compact set of reusable capability bases across diverse domains, capturing recurring workflow factors that generalize in the latent space. To recompose, we map an input task to a sparse composition over these bases, providing a controllable way to reuse capabilities for new tasks. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, capturing which capability bases truly contribute to workflow generation through their marginal effects under domain shifts. The resulting model generates the executable, task-specific agentic workflow in a single pass, thereby avoiding costly refinement during inference. Our contributions are: • We propose Workflow Capability Basis Learning, reframing crossdomain workflow generation from trial-and-error inference into

a learnable problem of capability decomposition and recomposition, enabling single-pass, task-specific workflow generation. • We present a workflow generalization framework, CapFlow, centered on shared capability bases and task-conditioned composition, achieving structured decision-making and compositional transfer without replacing the underlying base model. • We curate multi-domain successful and failed workflow data to construct counterfactual contribution attribution and couple it with preference-driven supervision, aligning basis learning with factors that drive success and improving generalization. • Empirical results across multi-domain, cross-domain, and unseendomain settings show that our 1-pass generation method can exceed inference-time refinement baselines that consume 20 iterations, substantially reducing workflow generation cost.

2 Related Work 2.1 Automated Agentic Workflows Agentic workflows push beyond single-pass answer generation of LLMs by making the solution procedure explicit and executable [9, 17]. A typical workflow maps a user query/task 𝑞 to an executable composition of operators (e.g., multifaceted analysis, verification, aggregation) with a specific topology (e.g., sequences, branching graphs, or repair loops), often maintaining intermediate states and tool invocations. In practice, workflows are instantiated as plans [30], graphs [48], neural networks [22], or programs [18]. However, early approaches often rely on either (i) a manually curated library of workflow templates [9, 17] or (ii) a generic planning strategy reused across tasks [24, 38, 41]. These methods quickly face scalability limits as the space of operator choices, topologies, and prompt strategies grows combinatorially, motivating two lines of research on automated workflow generation. Automated Workflow Refinement. A prevalent paradigm is to construct an external optimizer over workflow space W and treat

Learning to Compose for Cross-domain Agentic Workflow Generation

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

(a) Our 6 most successful agentic workflows for coding, math, and reasoning domains.

(b) Workflow-oriented task distribution (t-SNE).

Figure 2: Cross-domain agentic workflow analysis: (left: structural analysis) highest-success workflows per domain; (right: latent analysis) t-SNE visualization of tasks embedded by learned workflow capabilities. workflow generation as an inference-time optimization problem: 𝑊 ∗ = arg max 𝐺 (𝑊 ; 𝑞) 𝑊 ∈W

(1)

In this paradigm, the model samples candidate workflows 𝑊 and then improves them via search or iterative refinement to maximize the evaluation function 𝐺 for the given task 𝑞. For example, AFlow [47] performs search over code-represented workflows with LLM-in-the-loop refinement and Monte Carlo tree search, while GPTSwarm [52] optimizes graph-structured multi-agent topologies by editing nodes/edges. ADAS [18] and MASS [51] also advocate treating agentic system construction as search over modular building blocks and compositions. Despite strong performance, this trialand-error inference can be costly and domain-specific: (i) it incurs substantial and hard-to-budget workflow execution cost at each iteration with diminishing returns [47]; (ii) workflow generation is often driven by LLM-only heuristics or stochastic edits, yielding limited controllability and unstable performance outcomes [43]; and (iii) effective workflow criteria and editing heuristics are difficult to transfer across different domains, leading to pronounced generalization variance under domain shift [36, 48]. Learning to Generate Workflows. Beyond heuristic refinement, several works learn decision policies for agentic systems from data D by optimizing a workflow generator 𝑝𝜃 (𝑊 | 𝑞), viewing workflows as trajectories or code whose quality is judged by task success: 𝜃 ∗ = arg max E𝑞∼D E𝑊 ∼𝑝𝜃 (· |𝑞) [ 𝐺 (𝑊 ; 𝑞) ]

(2)

𝜃

This learning-based paradigm covers supervised fine-tuning or behavioral cloning to imitate high-quality workflows [34], preferencebased learning that exploits relative judgments among candidate workflows [32], and reinforcement-style optimization when feedback is available. For example, ScoreFlow [39] leverages direct preference optimization to exploit quantitative feedback. FlowReasoner [15] trains a query-level meta-agent with reinforcement learning to generate personalized multi-agent systems. MAS-GPT [46] builds query-workflow pairs via a consistency-oriented data construction pipeline to train workflow generation. This learning-based perspective offers a principled alternative to handcrafted heuristics

by grounding workflow generation in data. However, two limitations remain salient for cross-domain transfer: (i) naive imitation can be brittle under task distribution shift [13], since it does not isolate which workflow factors causally drive success; (ii) learned behaviors are often entangled in monolithic parameters, lacking a controllable mechanism for compositional transfer [29] when new tasks require recombining familiar factors in novel ways. Our work aligns with this learning-based paradigm, while explicitly targeting controllable recomposition for transferable workflow generation under domain shift.

2.2

Parameter-efficient and Modular Adaptation

Parallel to workflow-centric systems, substantial progress has been made in parameter-efficient LLM adaptation, including low-rank parameterizations [34] that enable specialization without full finetuning. Relatedly, modular and conditional-computation approaches route inputs to a subset of parameters, often framed as mixtures of experts [6] or mixtures of adapters [42, 50], with connections to low-dimensional predictive subspaces [1, 3]. Crucially, however, most prior work on routing and modular adaptation targets efficient computation, capacity scaling, or multi-task specialization, and many multi-domain methods use explicit domain identifiers to drive discrete specialization [11, 49]. In contrast, our problem requires learning a structured decision mechanism aligned with executable workflow success under domain shift. We interpret modularity not as an engineering device for compute allocation, but as a vehicle for capability recomposition: the routed components correspond to reusable workflow capability factors, and the routing decision is trained to dominate single-pass workflow generation. This distinction is central to our work: we learn a “decompose-recomposedecide” policy that captures transferable structural decision rules, rather than performing predictive fusion across experts.

Preliminary

Motivational Study. To validate our motivation for reusable workflow capabilities, we conduct both structural and latent analyses of successful workflows across domains. We visualize the most

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Jialiang Wang et al.

successful workflows from each domain in our dataset in Fig. 2a and analyze their operators and topologies. We observed recurring workflow patterns such as multifaceted analysis (blue nodes) and ensemble aggregation (purple nodes) across domains (B, D, E). In contrast, domain-specific behaviors mainly arise from specialized operators, e.g., code testing for coding tasks (red nodes in A, B) and multi-perspective task solving for QA tasks (orange nodes in E). To further examine the significance of aligning the task semantics to workflow behaviors, we embed each task using the learned workflow capabilities and visualize the resulting workflow-oriented task distribution via t-SNE in Fig. 2b. Without using any domain identifier in learning, tasks from the same domain naturally cluster together, suggesting that (1) same-domain tasks often require similar workflow capabilities. Meanwhile, tasks from different domains can overlap in the workflow space, indicating that (2) successful workflows often share latent factors despite surface task differences. These observations motivate explicitly modeling reusable workflow capabilities and recomposing them for new tasks. Problem Setup and Data. Each open-domain task contains a question text 𝑞 and a formatted prompt (including instruction and operator descriptions). The target output is an executable workflow program 𝑊 that contains both (i) workflow topology (control flow) and (ii) customized operator prompts. Since existing benchmarks rarely provide executable workflow programs together with both success and failure outcomes, we curate a multi-domain dataset containing 180+ unique workflows with performance records over 6 datasets from 3 distinct domains (coding, math, and reasoning): 𝑁

𝑤 D = {(𝑞𝑖 , {(𝑊 𝑗 , 𝑠𝑖,𝑗 )}𝑁𝑗=1 )}𝑖=1𝑞 ,

𝑠𝑖,𝑗 ∈ [0, 1]

(3)

where 𝑠𝑖,𝑗 indicates whether the workflow solves the task. We formulate workflow generation as conditional code generation: 𝑊 ∼ 𝑝 Θ(𝑞) (𝑊 | 𝑥 (𝑞))

(4)

where 𝑥 (𝑞) is the formatted prompt instruction, and Θ(𝑞) denotes task-conditioned model parameters induced by our capability composition mechanism.

Methodology

The existing trial-and-error paradigm does not produce transferable decision rules—hence the cost explosion, instability, and crossdomain variance discussed in Sec. 2.1. We instead ask a different question: Can workflow generation be turned into a learned decision mechanism that maps task semantics to customized workflow structure in a single pass? There are two key challenges behind: (i) a capability decomposition gap—tasks are not naturally represented in a space aligned with effective workflow factors; and (ii) a capability recomposition gap—even if reusable patterns exist, the model lacks a controllable way to recombine them under domain shift. To bridge these gaps, we internalize a “decompose-recomposedecide” mechanism into the model by (1) introducing a small set of reusable capability bases over latent workflow factors and (2) training a task-conditioned capability composer that selects a sparse combination of these bases for each new task. CapFlow is a singlepass generator that generates executable, task-specific agentic workflow programs without iterative refinement at inference time.

4.1

Workflow Capability Bases

We operationalize the findings in Sec. 3—that many good workflows share latent capability factors despite surface-level task differences— by parameterizing these factors as a compact set of reusable capability bases. Concretely, starting from a frozen weight LLM 𝑓𝜃 0 , we introduce a small set of 𝐾 lightweight adaptation bases into a chosen subset of linear transformations (e.g., attention and MLP projections). Each basis serves as a reusable latent workflow factor that can bias the model toward certain operator choices, control-flow patterns, and prompt-writing behaviors that repeatedly emerge in successful workflows. Formally, consider a linear transformation with weight matrix 𝑀 ∈ R𝑑out ×𝑑in . We augment it with 𝐾 rank-𝑟 basis updates: Δ𝐵𝑘 = 𝑐𝑘 𝑈𝑘 𝑉𝑘⊤,

𝑈𝑘 ∈ R𝑑out ×𝑟 , 𝑉𝑘 ∈ R𝑑in ×𝑟 , 𝑐𝑘 ∈ R+,

(5)

where 𝑈𝑘 and 𝑉𝑘 parameterize a capability basis and 𝑐𝑘 is a learnable per-basis scale. These parameters are module-specific but shared across all tasks. Intuitively, task-specific workflow generation emerges from selectively composing these bases.

4.2

Task-Conditioned Capability Composer

Capability bases provide a workflow-aligned representation space, but we still need a mechanism that maps task semantics into that space and selects how to recombine factors under domain shift. We model this as a task-conditioned capability composer: 𝛼 (𝑞) = 𝑔𝜓 𝑧(𝑞) , (6) where 𝑧 (𝑞) is a task embedding and 𝑔𝜓 outputs mixture weights over bases. This design makes the composer depend on the pure task rather than on accidental artifacts of the LLM heuristic, which empirically stabilizes training and mitigates hallucinations. Adaptive Composition. A core difficulty in composition is trading off specialization and reuse. If composition collapses too early, a few bases dominate and the mechanism stops being compositional; if composition is too diffuse, the model behaves like a weak ensemble and loses controllability [21]. To address this, we design a composer that predicts (a) composition logits and (b) a per-task temperature adjustment with coupled multi-layer perceptrons ℎ𝜓 (·) and Δ𝑡𝜓 (·): 𝑢 (𝑞) = ℎ𝜓 (𝑧(𝑞)) ∈ R𝐾 , 𝑇 (𝑞) = exp log𝑇0 + Δ𝑡𝜓 (𝑧 (𝑞)) , (7) and produces probabilities via 𝛼 (𝑞) = Softmax

𝑢 (𝑞) ∈ Δ𝐾 −1, 𝑇 (𝑞)

(8)

Intuitively, 𝑇 (𝑞) controls how decisive the composer is: tasks that admit multiple workflow capabilities can retain higher entropy, while tasks with clear patterns can yield sharper compositions.

4.3

Workflow Capability Decomposition

Combining the capability bases (Sec. 4.1) and the task-conditioned composer (Sec. 4.2), we induce a task-specific parameterization of the agentic workflow generator. Given a task 𝑞, the weight of each adapted linear transformation is composed as: 𝑀 (𝑞) = 𝑀 +

𝐾 ∑︁ 𝑘=1

𝛼𝑘 (𝑞) Δ𝐵𝑘 .

(9)

Learning to Compose for Cross-domain Agentic Workflow Generation

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Figure 3: Overview of workflow capability composition (CapFlow): the task-conditioned composer decomposes each query into a sparse mixture of reusable capability bases, steering the LLM toward successful workflows and away from failures. Here 𝛼 (𝑞) ∈ Δ𝐾 −1 is a probability simplex vector produced by a task-conditioned composer. Crucially, the same 𝛼 (𝑞) is broadcast to all adapted layers, so that a global task-level capability decision coherently shapes the entire workflow generator (operators and topologies) rather than producing layer-wise, uncoordinated gating. To encourage recompositional behavior and keep inference efficient, we use a sparse instantiation of Eq. (9) by activating only the top-𝑚 bases per task and renormalizing their weights, which provides a controlled inductive bias toward few-factor workflow generations (consistent with the empirical patterns in Fig. 2a). Overall, this mechanism addresses the capability decomposition gap: instead of forcing the model to costly refine an entire workflow structure from scratch for each task, we represent tasks in a space where the relevant capability factors are explicitly addressable and composable.

4.4

Learning Basic Workflow Capabilities

Beyond Sec. 2.1, training purely by imitation is ill-suited for workflow generation because (i) each task often admits multiple correct workflows, and (ii) what matters is not stylistic similarity to a reference but whether the workflow succeeds. Our dataset fills this missing supervision with both successful and failed workflows for each task. We leverage it to learn (a) capability bases that increase the likelihood of successful workflow code and (b) a composer whose task decomposition aligns with success causally rather than correlationally (Sec. 4.5). Let 𝑊 denote the workflow program (code) and 𝑥 (𝑞) the formatted input. We use the length-normalized causal LM log-likelihood over target tokens 𝑤𝑡 to avoid bias toward shorter references: 1 ∑︁ ℓ (𝑊 , 𝑥 (𝑞)) = log 𝑝 Θ(𝑞) (𝑤𝑡 | 𝑥, 𝑤 <𝑡 ), (10) |𝑊 | 𝑡 Multi-reference Success Likelihood. For a task 𝑞𝑖 , let P𝑖 denote the set of successful workflows and N𝑖 the set of failures. Since multiple workflows may succeed, we do not force the model to

imitate a single reference. Instead, we maximize the total probability mass assigned to the successful set via a multi-reference objective: ∑︁ ∑︁ LMR = − log exp ℓ (𝑊 , 𝑥 (𝑞𝑖 )) . (11) 𝑖

𝑊 ∈ P𝑖

This objective encourages the generator to cover diverse successful patterns, which is essential for learning transferable capability factors rather than overfitting to a narrow template. Within-task Contrastive Separation. To explicitly suppress known failure modes for the same task, we add a group-wise contrastive loss: ∑︁ 1 ∑︁ exp(ℓ (𝑊 +, 𝑥 (𝑞𝑖 ))/𝜏) LNCE = − log Í , |P𝑖 | + 𝑊 ∈ P𝑖 ∪N𝑖 exp(ℓ (𝑊 , 𝑥 (𝑞𝑖 ))/𝜏) 𝑖 𝑊 ∈ P𝑖

(12) where 𝜏 is a temperature. Unlike pairwise preference objectives that operate on isolated (prompt, chosen, rejected) comparisons, Eq. (12) performs task-local, listwise discrimination. It contrasts multiple successful workflows against task-matched failures under the same semantics and constraints, yielding a stable listwise training signal.

4.5

Counterfactual Capability Attribution for Controllable Recomposition

The objectives above train the generator to model successful workflows, but they do not by themselves guarantee that the composer learns a structured and transferable decision rule. In particular, composition can become a black box [12]: it may correlate with superficial features in 𝑞 without selecting capability factors that actually contribute to success. To address this, we introduce a counterfactual attribution mechanism that turns which capability bases matter into a trainable credit assignment signal based on marginal effects. Counterfactual Capability Attribution. For a sample (𝑞,𝑊 , 𝑠), let 𝜶 = 𝛼 (𝑞) be the composer output and let ℓ main denote the lengthnormalized log-likelihood under 𝜶 . For each basis Δ𝐵𝑘 , we form a

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Jialiang Wang et al.

Table 1: Comparison of workflow Solve rate and Executability (%) across datasets under multi-, cross-, & unseen-domain settings. Reasoning Type

Coding

Math

Science (unseen)

Overall

Setting

Methods

HotpotQA

DROP

HumanEval

MBPP

GSM8K

MATH

SciBench

GPQA

Solve

Exec.

Manual

–

GPT-4o-mini CoT CoT-SC Self-Refine SPP

81.54 81.75 82.07 81.18 81.62

81.37 83.25 83.28 82.50 81.62

90.07 91.60 92.36 90.83 92.36

70.67 70.96 72.72 69.50 72.72

90.14 89.95 90.23 88.53 91.18

51.64 53.29 53.70 50.20 52.88

28.40 24.83 27.27 27.03 28.22

34.16 35.73 35.95 30.25 33.72

66.00 66.42 67.20 65.00 66.79

— — — — —

Refinement

Single-domain

ADAS (20 iter) AFlow (20 iter)

82.84 85.87

82.23 85.75

90.83 93.89

70.08 82.99

92.03 93.93

52.67 57.40

30.36 41.40

36.12 42.18

67.14 72.93

79.87 88.13

ScoreFlow (5 iter) ScoreFlow (5 iter) CapFlow (1-pass)

86.00 85.37 88.12

86.14 85.16 87.12

95.41 93.89 96.18

82.69 81.23 83.28

94.21 93.45 94.97

59.25 57.81 59.87

34.20 34.20 41.60

38.69 38.69 42.41

72.07 71.22 74.19

97.36 95.89 98.03

72.43

96.87

Single-domain Multi-domain Learning

Code,Math->Reason Cross-domain

CapFlow (1-pass)

86.37

Reason,Math->Code

86.25

93.89

counterfactual composition vector by masking 𝛼𝑘 and renormalizing: 𝜶 (−𝑘 ) ∝ 𝜶 ⊙ 1 − e𝑘 , (13) and compute the counterfactual score ℓ (−𝑘 ) under 𝜶 (−𝑘 ) . We define the marginal contribution of basis Δ𝐵𝑘 as Δ𝑘 = ℓ main − ℓ (−𝑘 ) .

(14)

A large positive Δ𝑘 indicates that removing basis Δ𝐵𝑘 hurts the probability of generating this workflow, suggesting that Δ𝐵𝑘 encodes a capability factor relevant to the success of workflow. Preference-weighted Policy Gradient for Composition. We use the workflow outcome 𝑠 ∈ [0, 1] to assign a signed supervision signal sgn(𝑠) = 2𝑠 − 1 ∈ [−1, 1]. We then update the composer by encouraging high composition probability for bases with positive marginal contributions on successful workflows, and discouraging such composition on failed ones. Concretely, for a set of bases K (𝑞), we optimize: h ∑︁ i LCCA = −E sgn(𝑠) · Δ𝑘 · log 𝛼𝑘 (𝑞) + 𝜆dead Ldead, (15) Δ𝐵𝑘 ∈ K (𝑞)

where Ldead penalizes allocating probability mass to bases whose contributions remain near zero (preventing dead or redundant bases). Eq. (15) instantiates a structured, counterfactual form of credit assignment: instead of relying on indirect likelihood gradients, the composer is trained to place mass on capability factors whose removal would counterfactually reduce success. Full Objective. Putting together, at training time, we optimize: min Φ,𝜓

𝜆MR LMR + 𝜆NCE LNCE + 𝜆CCA LCCA + Lreg

(16)

where Φ denotes all capability basis parameters and 𝜓 the composer parameters. The regularization term Lreg includes orthogonality, entropy, and temperature deviation (detailed in Appx. A.1.2). At inference time, CapFlow performs no iterative refinement. In Fig. 3, given a new task 𝑞, we compute 𝛼 (𝑞) once, activate the top-𝑚 capability bases across adapted layers, and decode the workflow program 𝑊 in a single generation pass as Eq. (4). This yields an executable workflow program with both control-flow topology and customized

82.11

Reason,Code->Math

All->Science

92.51

41.60

54.32

42.41

operator prompts, directly addressing the cost explosion, instability, and cross-domain variance of workflow refinement.

5 Experiments 5.1 Experimental Settings Tasks and Datasets. We evaluate on 8 benchmarks spanning 4 domains—reasoning [14, 44], coding [4, 8], math [10, 16], and science [33, 37]—using established data-splitting practices [46, 47]. Science is the held-out domain for unseen-domain study. For the 3 training domains, the single-domain setting involves training a separate model for each dataset and then testing it; the multi-domain setting trains a universal model on all reasoning, coding, and math datasets; and the cross-domain setting evaluates transfer via leaveone-dataset-out (LODO): for each held-out dataset, we train on the remaining datasets and test exclusively on the held-out one. Baselines. We compare against three families of methods. (1) Manual prompting methods are designed by human experts, including direct invocation, Chain-of-Thought [41], CoT with SelfConsistency [38], Self-Refine [24], and SPP [40]. (2) Workflow refinement methods iteratively generate and improve workflows at inference time with a maximum budget of 20 iterations, including ADAS [18] and AFlow [47]. One iteration corresponds to executing/evaluating candidate workflows on the validation set once and applying the refinement rule to produce the next candidate. (3) Learning-based workflow generators are trained from workflow data and generate a final workflow directly. We consider both the singledomain and multi-domain model settings for ScoreFlow [39]. Implementation Details. We employ Llama-3.2-3B-Instruct [26, 35] as the finetuning model, with Llama-3.1-8B-Instruct [25] and Qwen2.5-Coder-3B-Instruct [19, 31] for comparison. The main result is reported with 8 bases and Top-m=3, trained using 1 A100 GPU for 35 epochs at a learning rate of 2e-4 (bases) and 3e-4 (composer). For ADAS and AFlow, we use Claude-3.5-sonnet [2] as the optimizer. Metrics. All generated workflows are executed by GPT-4o-mini with 0 temperature for a fair comparison. All results are averaged

Learning to Compose for Cross-domain Agentic Workflow Generation

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

0.96 0.960 0.93 Solve Rate

Solve Rate

0.945 0.90 0.87

0.930 0.915

0.84 0.900 AFlow (refine, 20 iters) ScoreFlow (learn, 5 iters)

0.81 1

10 Iteration Round

CapFlow (1-pass) CapFlow (transfer)

AFlow (refine, 20 iters) ScoreFlow (best)

0.885 20

(a) Iteration Round vs. Solve Rate on HumanEval.

30 40 50 Inference Time (minutes)

CapFlow (1-pass)

(b) Inference Time vs. Solve Rate on HumanEval.

Figure 4: Workflow generation trade-offs on HumanEval. Left: strong baselines are driven by stochastic refinement that incurs substantially higher evaluation cost for comparable gains. Right: refinement baselines improve with additional rounds but exhibit diminishing returns, whereas CapFlow achieves strong solve rates in a single generation pass.

Figure 5: Capability basis usage across domains. CapFlow maintains non-collapsed basis selection and exhibits meaningful cross-dataset overlap, supporting the intended “reusable bases” behavior. over 5 runs. We report the Solve rate (%) for solving tasks and the Executability rate (%)—no runtime errors during workflow execution— on each test dataset to quantify effectiveness and stability. For efficiency, we additionally report the solve rate as a function of iteration and inference time to characterize the trade-off.

5.2

Main Results

Effectiveness. Tab. 1 summarizes test-set results across multidomain, cross-domain, and unseen-domain settings. Overall, CapFlow achieves the best solve rate across datasets while maintaining strong executability for reliable workflow generation. Our 1-pass workflow generation consistently outperforms workflow refinement baselines, even when they consume 20 iteration budgets per dataset. Notably, in the cross-domain setting, CapFlow’s performance on held-out datasets can still match, and even surpass, baselines that are trained or refined on those datasets. This supports our claim that learning a transferable decompose–recompose–decide mechanism yields more effective, task-specific workflow generation than heuristic trial-and-error refinement at inference time or purely imitation learning at training time. Budget-Quality Trade-off. Workflow refinement is typically a stochastic process that is sensitive to iteration budgets. Fig. 4a plots the solve rate as a function of refinement iterations. We observe diminishing returns beyond moderate budgets on AFlow: additional iterations improve performance only marginally while increasing

evaluation cost at least linearly. In contrast, CapFlow achieves competitive performance without additional evaluation cost. Inference-time Cost and Latency. Although refinement baselines are budgeted by iterations, iterations translate to substantial inference-time overhead due to repeated workflow executions and re-generation. These operations are commonly priced by LLM api providers at different levels. Beyond pricing, Fig. 4b and Tab. 2 report cost proxies, showing that CapFlow significantly reduces inference-time latency of workflow generation compared with other datasets, while achieving a higher solve rate across domains.

5.3

Capability Usage and Attribution

We inspect whether CapFlow learns reusable capability bases and structured recomposition under domain shift. Concretely, we count how often each basis is selected by the Top-𝑚 composer on each dataset, and visualize the resulting bar chart and heatmaps in Fig. 5. Reusable Bases without Collapse. Fig. 5 shows clear transferable usage patterns: the composer activates multiple bases rather than collapsing to a single one, while several bases (e.g., B0, B2) are repeatedly used across different datasets. This indicates that CapFlow captures transferable workflow factors that can be reused under domain shift, instead of learning purely dataset-specific adapters. Meaningful Cross-Dataset Overlap. To interpret overlap, we compute cosine similarity between datasets based on their basis usage frequencies (Fig. 5). The similarity matrix is non-uniform:

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Jialiang Wang et al.

Table 3: Recompose ablation. Sparse composition improves controllability in multi-domain and unseen-domain settings.

Figure 6: Basis co-activation network. Nodes are bases. Edges show positive PMI between Top-3 basis-selection events; probabilities are estimated from empirical co-activation counts over 3706 tasks, and we visualize the top-20 pairs. Table 2: Decompose ablation and cost proxies across domains. Para.: additional trainable parameter storage (MB). Train/epoch and Infer (on HumanEval) are runtimes in minutes. 𝐾 and Top-𝑚 are different configurations of CapFlow. Variant AFlow ScoreFlow 𝐾=1, LoRA 𝐾=8, LoRAs 𝐾=4, Top-𝑚=3 𝐾=8, Top-𝑚=3 𝐾=16, Top-𝑚=6 𝐾=32, Top-𝑚=9

0.00 188.40 19.95 20.13 28.95 34.03 45.75 76.13

73.23 47.45 14.17 17.25 14.93 16.13 20.60 20.92

85.81 88.44 75.67 41.79 86.07 89.05 76.73 36.45 87.08 87.95 76.41 37.47 86.74 87.95 76.21 36.65 88.22 89.67 75.03 38.15 87.62 89.73 77.42 42.00 87.72 87.72 75.54 37.66 87.16 87.91 76.03 37.88

datasets within the same domain are generally more similar (darker diagonal), yet there remains noticeable cross-domain overlap, consistent with the intended “reusable bases” behavior rather than isolated per-domain routing. This claim is consistent with our motivational study in Sec. 3 (Fig. 2). Structured Recomposition and Attribution. To move beyond marginal usage, we analyze synergistic basis interactions through a positive-PMI co-activation network in Fig. 6. PMI explicitly controls for each basis’s marginal selection rate, highlighting basis pairs that are activated together more often than expected under independence. The resulting network is sparse and non-uniform, with a small number of consistently high-PMI edges, indicating that CapFlow’s composer recombines capability bases in a coordinated rather than independently toggled manner. Importantly, these PMI-identified pairs provide an interpretable view of which capability combinations the model relies on to construct effective workflows, supporting our claim that CapFlow learns structured recomposition rather than heuristic trial-and-error routing.

5.4

Ablations: Decompose-Recompose-Decide

We conduct ablations aligned with our decompose-recompose-decide mechanism to isolate which components drive performance. Decompose: number of capability bases. Tab. 2 varies the number of capability bases 𝐾 to test whether a compact library of reusable capability factors is necessary. We find that too few bases

𝑚

Adaptive

Reas.

Code

Math

Sci.

Single Sparse Sparse Sparse Dense

1 2 3 3 8

✓ ✓ ✓

86.61 87.12 87.62 86.47 86.71

84.7 88.08 89.73 88.49 87.28

74.31 75.82 77.42 77.78 77.58

39.68 39.13 42.00 36.27 33.48

✓

Table 4: Decide ablation. Counterfactual attribution improves transfer and stabilizes recomposition under domain shift. Variant

Decide Signal

Reas.

Code

Math

Sci.

CapFlow (Full) w/o NCE w/o CCA SFT

CCA + preference CCA + imitation preference imitation

87.62 86.54 85.71 86.97

89.73 86.94 86.34 85.52

77.42 75.90 74.56 76.63

42.00 40.44 36.44 35.65

Table 5: Backbone LLM ablation.

Para. (MB) Train/epoch Infer Reas. Code Math Sci. 0.00 13.02 15.28 96.44 50.07 96.45 189.20 374.73

Variant

Backbone

Reas.

Code

Math

Sci.

Llama-3.2-3B-Instruct Llama-3.1-8B-Instruct Qwen2.5-Coder-3B-Instruct

87.62 86.65 87.90

89.73 88.70 87.29

77.42 76.13 76.84

42.00 40.55 41.70

underfit diverse workflow factors and degrade transfer, while overly expanding 𝐾 leads to diminishing returns (and potential attribution difficulty) under domain shift. We adopt 𝐾 = 8 and Top-𝑚 = 3. Recompose: sparse vs. dense composition. Tab. 3 compares sparse top-𝑚 composition with dense mixtures and tests sensitivity to 𝑚. Sparse composition yields better controllability and transfer, while dense mixing tends to blur capability factors and increases shift variance. Moreover, using adaptive temperature with pertask adjustment in Sec. 4.2 is better than standard temperature scheduling, as it captures task-specific variance across domains. Decide: counterfactual contribution attribution. Tab. 4 evaluates the impact of the decide component by removing counterfactual capability attribution (CCA) or group-wise contrastive loss (NCE). CCA significantly improves the solve rate across all datasets, indicating that structured credit assignment is crucial for learning transferable capability recomposition. Backbone LLM. As a sanity check, we compare several small-scale, instruction-tuned backbones on our evaluation suite (Tab. 5). Llama3.1-8B-Instruct and Qwen2.5-Coder-3B-Instruct are also competitive choices for LLM finetuning across domains. We select Llama3.2-3B-Instruct for efficiency concerns.

Conclusion

This work addresses a central efficiency and transferability bottleneck in agentic workflow generation under domain shift. We present Workflow Capability Basis Learning, which internalizes a decompose-recompose-decide mechanism into an open-source LLM to generate executable, task-specific workflows in a single pass. Our CapFlow decomposes workflow patterns into a compact set of reusable capability bases, recomposes them via a task-conditioned

Learning to Compose for Cross-domain Agentic Workflow Generation

sparse composer for new tasks, and decides with counterfactual attribution that aligns basis selection with capabilities that truly drive workflow success, leveraging successful and failed workflows as preference supervision beyond imitation. Across multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses strong 20-iteration refinement baselines while substantially reducing inference-time cost and maintaining high reliability. Future work includes extending to evolving operator libraries and tool availability, incorporating continual adaptation from execution feedback, and developing stronger diagnostics and theory for generalizing the learned capability factors to longer-horizon, more complex workflow topologies.

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

References [1] Rie Kubota Ando, Tong Zhang, and Peter Bartlett. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of machine learning research 6, 11 (2005). [2] Anthropic. 2024. Claude 3.5 Sonnet Model Card Addendum. https://wwwcdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_ Claude_3_Addendum.pdf Model card addendum, accessed 2026-02-09. [3] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2006. Multitask feature learning. Advances in neural information processing systems 19 (2006). [4] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021). [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901. [6] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2025. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering (2025). [7] Mark Chen. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). [8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG] [9] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. 2024. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors.. In ICLR. [10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021). [11] Yongxing Dai, Xiaotong Li, Jun Liu, Zekun Tong, and Ling-Yu Duan. 2021. Generalizable person re-identification with relevance-aware mixture of experts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16145–16154. [12] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. 2022. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research 23, 226 (2022), 1–61. [13] Pim De Haan, Dinesh Jayaraman, and Sergey Levine. 2019. Causal confusion in imitation learning. Advances in neural information processing systems 32 (2019). [14] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2368–2378. [15] Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. 2025. Flowreasoner: Reinforcing query-level meta-agents. arXiv preprint arXiv:2504.15257 (2025). [16] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. [n. d.]. Measuring Mathematical Problem Solving With the MATH Dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). [17] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. 2023. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations. [18] Shengran Hu, Cong Lu, and Jeff Clune. [n. d.]. Automated Design of Agentic Systems. In The Thirteenth International Conference on Learning Representations. [19] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. CoRR abs/2409.12186 (2024). arXiv:2409.12186 doi:10.48550/

Jialiang Wang et al.

ARXIV.2409.12186 [20] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. 2024. Dspy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations. [21] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. [n. d.]. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations. [22] Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170 (2023). [23] Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2024. A dynamic llmpowered agent network for task-oriented agent collaboration. In First Conference on Language Modeling. [24] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2023), 46534–46594. [25] Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama3.1-8B-Instruct Model card on Hugging Face, accessed 2026-02-09. [26] Meta. 2024. Llama-3.2-3B-Instruct. https://huggingface.co/meta-llama/Llama3.2-3B-Instruct Model card on Hugging Face, accessed 2026-02-09. [27] Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131 (2025). [28] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744. [29] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2021. Adapterfusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume. 487–503. [30] Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. [n. d.]. Benchmarking Agentic Workflow Generation. In The Thirteenth International Conference on Learning Representations. [31] Qwen Team. 2024. Qwen2.5-Coder-3B-Instruct. https://huggingface.co/Qwen/ Qwen2.5-Coder-3B-Instruct Model card on Hugging Face, accessed 2026-02-09. [32] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems 36 (2023), 53728–53741. [33] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. [n. d.]. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling. [34] Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, et al. [n. d.]. Lora: Low-rank adaptation of large language models. ([n. d.]). [35] Llama Team. 2024. The Llama 3 Herd of Models. CoRR abs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783 [36] Patara Trirat, Wonyong Jeong, and Sung Ju Hwang. 2025. Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding. arXiv preprint arXiv:2505.19764 (2025). [37] Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. 2024. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. In International Conference on Machine Learning. PMLR, 50622– 50649. [38] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. [n. d.]. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations. [39] Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. 2025. Scoreflow: Mastering llm agent workflows via score-based preference optimization. arXiv preprint arXiv:2502.04306 (2025). [40] Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 257–279. [41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.

Learning to Compose for Cross-domain Agentic Workflow Generation [42] Xun Wu, Shaohan Huang, and Furu Wei. 2023. Mole: Mixture of lora experts. In The Twelfth International Conference on Learning Representations. [43] Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min-Ling Zhang. 2025. RobustFlow: Towards Robust Agentic Workflow Generation. arXiv preprint arXiv:2509.21834 (2025). [44] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380. [45] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations. [46] Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, and Jing Shao. [n. d.]. MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems. In Forty-second International Conference on Machine Learning. [47] Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. [n. d.]. AFlow: Automating Agentic Workflow Generation. In The Thirteenth International Conference on Learning Representations. [48] Yuanshuo Zhang, Yuchen Hou, Bohan Tang, Shuo Chen, Muhan Zhang, Xiaowen Dong, and Siheng Chen. 2025. Gnns as predictors of agentic workflow performances. arXiv preprint arXiv:2503.11301 (2025). [49] Zijian Zhang, Shuchang Liu, Jiaao Yu, Qingpeng Cai, Xiangyu Zhao, Chunxu Zhang, Ziru Liu, Qidong Liu, Hongwei Zhao, Lantao Hu, et al. 2024. M3oe: Multi-domain multi-task mixture-of experts recommendation framework. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 893–902. [50] Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. 2024. Multi-LoRA Composition for Image Generation. Transactions on Machine Learning Research 2024 (2024). [51] Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, and Sercan Ö Arık. 2025. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533 (2025). [52] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. 2024. Gptswarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning.

Conference acronym ’XX, February 12, 2026, Hong Kong, CN

Jialiang Wang et al.

A Appendix A.1 Training and Regularization Details

Algorithm 1 Workflow Capability Basis Learning (Training)

This appendix provides implementation-level details for the regularization terms in Lreg and the overall training procedure. A.1.1 Two-timescale optimization via detached routing. A practical stability detail is to decouple how the generator (capability bases) and the composer (routing) receive gradients. During workflow generation, we inject a detached snapshot of routing weights into all adapted layers, i.e., 𝛼˜ (𝑞) = stopgrad(𝛼 (𝑞)). This prevents the composer from receiving high-variance token-level gradients through every adapted module. Instead, the composer is primarily optimized by counterfactual attribution (Eq. (15)) plus lightweight routing regularizers, while the capability bases are optimized by LMR and LNCE . A.1.2 Regularization. We add lightweight regularizers to prevent collapse and encourage a compositional capability library. Consistent with the main paper, Lreg includes (i) basis orthogonality, (ii) routing entropy, and (iii) basis dropout. (i) Basis orthogonality (diversity across bases). For each adapted layer ℓ, we encourage different bases to encode distinct directions by penalizing correlations between basis-specific parameter vectors. Let B𝑈(ℓ ) ∈ R𝐾 ×𝐷𝑈 and B𝑉(ℓ ) ∈ R𝐾 ×𝐷𝑉 denote the stackedand-flattened LoRA factors across bases, where the 𝑘-th row is the flattened parameter vector of basis 𝑘 at layer ℓ. We impose: ∑︁ 2 2 Lortho = B𝑈(ℓ ) B𝑈(ℓ ) ⊤ − 𝐼 + B𝑉(ℓ ) B𝑉(ℓ ) ⊤ − 𝐼 , (17) 𝐹

ℓ∈L

𝐹

where 𝐼 is the 𝐾 × 𝐾 identity. (ii) Routing entropy (avoid premature collapse). We regularize routing entropy to encourage exploration early in training: Lent = −𝜆ent E𝑞 [𝐻 (𝛼 (𝑞))] ,

𝐻 (𝛼) = −

𝐾 ∑︁

𝛼𝑘 log 𝛼𝑘 .

(18)

𝑘=1

Minimizing Lent increases entropy and prevents the composer from collapsing to a single basis too early. In practice, 𝜆ent is scheduled to transition from exploration to specialization. (iii) Basis dropout (compositional robustness). We apply basis dropout to 𝛼 (𝑞) during training by randomly masking routing weights while always keeping the argmax basis active, followed by renormalization: 𝛼˜ (𝑞) ∝ 𝛼 (𝑞) ⊙ 𝑚(𝑞), 𝑚𝑘 (𝑞) ∼ Bernoulli(1 − 𝑝 drop ),

(19)

𝑚 arg max𝑘 𝛼𝑘 (𝑞) (𝑞) = 1, and 𝛼˜ (𝑞) is renormalized to sum to 1. This forces the generator to remain functional under missing capability factors and reduces single-basis dominance. Optional stabilizers. Our implementation additionally includes two stabilizers that can be enabled at early exploration without changing the core method. First, we encourage global utilization balance by matching the batch-mean routing distribution to uniform using a Jensen-Shannon penalty: Lbal = 𝜆bal JS(Ebatch [𝛼 (𝑞)] ∥ u) ,

(20)

Require: Dataset D = {(𝑞𝑖 , P𝑖 , N𝑖 )}; frozen base LLM 𝑓𝜃 0 ; capability bases Φ; composer 𝑔𝜓 ; sparsity top-𝑚; attribution interval 𝐸. Ensure: Trained Φ,𝜓 . 1: Initialize Φ,𝜓 ; freeze 𝜃 0 . 2: for each training step do Sample a task-local mini-batch from one task id, containing 3: both successes P𝑖 and failures N𝑖 . 4: Compute task embedding 𝑧(𝑞) and routing weights 𝛼 (𝑞) = 𝑔𝜓 (𝑧 (𝑞)). 5: Apply basis dropout (Eq. (19)); set 𝛼˜ (𝑞) = stopgrad(𝛼 (𝑞)) for adapter injection. 6: Select top-𝑚 bases from 𝛼˜ (𝑞) and broadcast to all adapted layers. 7: Update capability bases Φ: compute LMR and LNCE , add Lortho (Eq. (17)), and take a gradient step on Φ. 8: if step mod 𝐸 = 0 then Counterfactual attribution: for bases 𝑘 ∈ K (𝑞), com9: pute Δ𝑘 via Eq. (14) using counterfactual routings 𝛼 (−𝑘 ) . 10: Update composer 𝜓 : minimize LCCA (Eq. (15)) plus routing regularizers (Eq. (18) and Eq. (20) temperature regularization). 11: end if 12: end for where u is the uniform distribution over 𝐾 bases. Second, when using adaptive temperature (Sec. 4.2), we use it to regularize extreme temperature variation. A.1.3 Counterfactual attribution: implementation details. We instantiate Eq. (15) using counterfactual routing ablations. For each basis 𝑘, we form 𝛼 (−𝑘 ) by setting 𝛼𝑘 to 0 and renormalizing, then recompute the log-likelihood ℓ (−𝑘 ) and the marginal effect Δ𝑘 = ℓ main − ℓ (−𝑘 ) (Eq. (14)). To reduce variance, we apply per-example centering and clipping to {Δ𝑘 } before using them as credit signals. Dead-basis penalty. To operationalize Ldead in Eq. (15), we penalize allocating probability mass to bases whose marginal effects remain near zero:  ∑︁   max(0, 𝛾 − |Δ𝑘 |)  (21) Ldead = E  𝛼𝑘 (𝑞) · , 𝛾  𝑘 ∈ K (𝑞)   where 𝛾 is a margin threshold. Intuitively, if removing a basis does not change likelihood (|Δ𝑘 | ≈ 0), then allocating routing mass to it is discouraged. Computational knobs. Since counterfactual attribution requires additional forward passes, we compute it every 𝐸 steps and optionally on a subsampled set of examples in a batch. In our implementation, we can attribute over all 𝐾 bases, while each counterfactual forward still activates only top-𝑚 bases to control cost.

A.2

Training Procedure

Algorithm 1 summarizes the overall training procedure. Received 12 February 2026; revised 12 February 2026; accepted 12 February 2026