A New Agentic Framework

Heterogeneous Scientific Foundation Model Collaboration

Fiction becomes reality in the digital world.
Eywa coordinates fundamentally different models in a unified framework.

Zihao Li, Jiaru Zou, Feihao Fang, Xuying Ning, Mengting Ai, Tianxin Wei,
Sirui Chen, Xiyuan Yang, Jingrui He
University of Illinois Urbana-Champaign
+6.6%
EywaAgent utility gain over the single-LLM baseline
+4.9%
EywaMAS utility gain over the homogeneous MAS baseline
-29.8%
Average token reduction from specialist delegation
-10.6%
Average running time reduction with foundation models
Overview

Abstract

Agentic large language model systems have demonstrated strong capabilities, but their reliance on language as the universal interface fundamentally limits their applicability to scientific problems with structured, non-linguistic inputs. Eywa addresses this gap by augmenting domain-specific foundation models with a language-model-based reasoning interface, enabling LLMs to guide specialized inference over heterogeneous scientific modalities.

The framework supports three levels of integration: EywaAgent as a drop-in replacement for a single-agent pipeline, EywaMAS as a plug-and-play extension of multi-agent systems, and EywaOrchestra as a planner that dynamically coordinates traditional agents and Eywa agents across tasks.

Across physical, life, and social science tasks, Eywa improves the quality-cost trade-off by combining generalized reasoning with specialized acting, reducing the language overhead of scientific problem solving while improving overall utility.

Overall Eywa results across scientific domains

Project snapshot. Eywa improves the utility-cost frontier across physical, life, and social science tasks. The left side shows the overall Pareto trade-off, and the right-side panels show consistent gains in utility, token efficiency, and execution time across domains.

Framework

Eywa at a Glance

Eywa starts from FM-LLM "Tsaheylu" interface, then scales into multi-agent systems and adaptive orchestration.

Eywa framework overview

Framework overview. The Pandora analogy introduces the three-step progression: a reasoning-augmented specialist, a plug-and-play heterogeneous MAS, and a conductor that orchestrates experts dynamically.

EywaAgent

Why LLM-only reasoning is not enough

Scientific tasks often depend on structured data like time series and tables. A pure language agent has to serialize those inputs into text, reason over them token by token, and hope that the language interface preserves the task-relevant signal. The paper's core argument is that this creates an information bottleneck.

  • Problem: the FM may not natively understand natural language, while the LLM lacks the specialist inductive bias.
  • Tsaheylu interface: a query compiler turns task state into a structured invocation, and a response adapter turns FM output into planner-consumable context.
  • Why tokens drop: modality-specific computation happens inside the specialist, so the LLM no longer spends long token traces simulating the prediction itself.
Takeaway. EywaAgent turns a domain-specific foundation model into a reasoning-participating agent with a stable interface.
EywaAgent figure

EywaAgent. The LLM parses the task, configures the specialist through Tsaheylu, delegates the core domain computation, then checks and realizes the final response in the required output format.

EywaMAS figure

EywaMAS. Sequential, looped, hierarchical, and heterogeneous agent systems can all be upgraded by replacing selected language-only workers with EywaAgents.

EywaMAS & EywaOrchestra

How collaboration scales beyond one specialist

EywaMAS preserves the topology of an existing multi-agent system while replacing selected workers with specialist-backed agents. EywaOrchestra goes one step further and asks which configuration should be used for this task at all.

  • Plug-and-play replacement: a planner, summarizer, or worker can stay in place while only the task-relevant workers become EywaAgents.
  • Heterogeneous topologies: the framework supports sequential, looped, hierarchical, and mixed-agent collaboration rather than one fixed structure.
  • Planner necessity: not every domain benefits equally from heavy multi-agent computation, so EywaOrchestra chooses model mix and topology based on the sample.
Takeaway. EywaMAS gives you a modular upgrade path; EywaOrchestra adds task adaptivity on top of that.
Benchmark

EywaBench Covers Real Scientific Structure

EywaBench is built to test heterogeneous scientific reasoning across domains, modalities, and task sources instead of flattening everything into one generic QA setting.

200
Released EywaBench-V1 tasks

The current split is a representative slice sampled from a larger, fully extensible benchmark construction pipeline.

9
Scientific sub-domains

Material, energy, space, biology, clinic, drug, economy, business, and infrastructure are all populated.

27
Domain-modality cells

All 27 combinations of sub-domain and modality are covered, avoiding the usual domain-collapse problem.

EywaBench source diversity

Figure 8. EywaBench mixes natural-language, time-series, and tabular sources with a long-tailed source distribution rather than one dominant dataset.

EywaBench hierarchical view

Figure 9. The benchmark is explicitly organized as parent domain → sub-domain → modality, which is exactly the kind of structure Eywa is meant to exploit.

Main Results

Whole-System Performance

The table below reproduces the main EywaBench comparison, and the follow-up panels summarize what those numbers actually mean.

Method Metrics Physical Science Life Science Social Science Overall
Material Energy Space Biology Clinic Drug Economy Business Infrastructure
Single-Agent Setting
Single-LLM-Agent Utility (↑) 0.56160.82020.52350.34020.45820.60040.76890.65280.67580.6154
Time (↓) 34.4827.0126.0034.6822.3721.1322.6722.2818.4225.22
Tokens (↓) 6367485445126164361835714097391533274469
EywaAgent (Ours) Utility (↑) 0.58710.83900.61230.37180.50850.61990.80480.73710.70600.6558
Time (↓) 34.8824.4223.1230.8420.3215.8419.7120.9815.9922.78
Tokens (↓) 5040316733294858233322102791244422483137
Multi-Agent Setting
Refine MAS [2023] Utility (↑) 0.56870.86670.62440.36230.45040.62150.75230.68800.63620.6294
Time (↓) 72.7664.2279.6575.2151.8950.6362.3348.5447.4960.59
Tokens (↓) 1101390091004310497702974988924699774388673
Debate MAS [2024] Utility (↑) 0.56020.86560.65430.34380.47380.61980.77290.69070.72370.6460
Time (↓) 82.0679.4674.75101.6478.1963.9892.7272.4660.7378.22
Tokens (↓) 16652142781361417007111591044714694109531031113216
MoA [2025] Utility (↑) 0.59090.80690.58630.35800.47220.56860.74990.70040.69380.6273
Time (↓) 90.1556.9569.3259.1046.5344.3157.3548.2947.3457.75
Tokens (↓) 25327164531733215980110141034416114116901236515317
X-MAS [2025] Utility (↑) 0.58310.80570.57230.37370.44900.62110.69230.63900.71800.6188
Time (↓) 104.4886.6379.0688.2067.9459.7675.5072.8262.9577.42
Tokens (↓) 24149198081658418451125491190716499140071405616537
EywaMAS (Ours) Utility (↑) 0.63810.87420.68990.37980.50860.62480.79590.72840.74060.6761
Time (↓) 77.2575.9672.51111.9259.9759.2368.4058.1146.4972.11
Tokens (↓) 1452911709117871650294078078110449470891211214
Dynamic Orchestration
EywaOrchestra (Ours) Utility (↑) 0.62490.87110.71870.36820.51590.63190.78300.73880.72980.6746
Time (↓) 61.7839.9275.4767.8845.3845.9449.1334.1828.8048.16
Tokens (↓) 1153577231081011315705064957117726468928335

Full EywaBench comparison across utility, time, and token usage. Best values are bolded, and second-best values are underlined to mirror the paper table.

01

EywaAgent improves both quality and efficiency

Under the same backbone, EywaAgent raises utility while cutting latency and reducing tokens by nearly 30% through specialist delegation.

02

EywaMAS beats homogeneous MAS baselines

EywaMAS achieves the best overall fixed-system utility and outperforms Refine and Debate in scientific settings.

03

LLM-only heterogeneity is not enough

Methods that only combine multiple language models do not consistently beat strong homogeneous MAS baselines on EywaBench.

04

Heavier MAS is not always necessary

On some domains such as economy and business, a single EywaAgent is already highly competitive, which motivates adaptive orchestration.

05

EywaOrchestra gets close at lower cost

The planner reaches utility close to expert-designed EywaMAS while lowering token and latency cost and removing expert configuration overhead.

Utility and token trade-off for Eywa methods

Figure 5. On the global utility-token plot, Eywa methods move the frontier upward and left: higher utility with fewer tokens than language-only baselines.

Per-domain trade-offs on EywaBench

Figure 13. The same trade-off pattern largely persists when broken out across the nine scientific sub-domains.

Further Analysis

Robustness and Backbone Effects

The paper's ablations show that Eywa's gains are not tied to one narrow prompt or hyperparameter setting.

Stable
Across sampling temperatures

Performance remains broadly steady as LLM temperature varies, suggesting the framework is not brittle to one decoding setting.

Robust
Across FM calibration

Changing the TabPFN softmax temperature does not erase the benefit of specialist delegation.

Structured
Prompts help a bit more

Detailed, chain-of-thought, and ReAct-style prompting all work, with more structured designs usually helping slightly.

Cite

BibTeX

@misc{li2026heterogeneous,
  title   = {Heterogeneous Scientific Foundation Model Collaboration},
  author  = {Zihao Li and Jiaru Zou and Feihao Fang and Xuying Ning and
             Mengting Ai and Tianxin Wei and Sirui Chen and Xiyuan Yang and
             Jingrui He},
  year    = {2026},
  note    = {Preprint},
  url     = {https://github.com/Violet24K/Eywa},
}