A New Agentic Framework

Heterogeneous Scientific Foundation Model Collaboration

Fiction becomes reality in the digital world.
Eywa coordinates fundamentally different models in a unified framework.

Zihao Li, Jiaru Zou, Feihao Fang, Xuying Ning, Mengting Ai, Tianxin Wei,
Sirui Chen, Xiyuan Yang, Jingrui He

University of Illinois Urbana-Champaign

HuggingFace Daily Paper arXiv GitHub Contact Cite

+6.6%

EywaAgent utility gain over the single-LLM baseline

+4.9%

EywaMAS utility gain over the homogeneous MAS baseline

-29.8%

Average token reduction from specialist delegation

-10.6%

Average running time reduction with foundation models

Overview

Abstract

Agentic large language model systems have demonstrated strong capabilities, but their reliance on language as the universal interface fundamentally limits their applicability to scientific problems with structured, non-linguistic inputs. Eywa addresses this gap by augmenting domain-specific foundation models with a language-model-based reasoning interface, enabling LLMs to guide specialized inference over heterogeneous scientific modalities.

The framework supports three levels of integration: EywaAgent as a drop-in replacement for a single-agent pipeline, EywaMAS as a plug-and-play extension of multi-agent systems, and EywaOrchestra as a planner that dynamically coordinates traditional agents and Eywa agents across tasks.

Across physical, life, and social science tasks, Eywa improves the quality-cost trade-off by combining generalized reasoning with specialized acting, reducing the language overhead of scientific problem solving while improving overall utility.

Overall Eywa results across scientific domains

Project snapshot. Eywa improves the utility-cost frontier across physical, life, and social science tasks. The left side shows the overall Pareto trade-off, and the right-side panels show consistent gains in utility, token efficiency, and execution time across domains.

Framework

Eywa at a Glance

Eywa starts from FM-LLM "Tsaheylu" interface, then scales into multi-agent systems and adaptive orchestration.

Framework overview. The Pandora analogy introduces the three-step progression: a reasoning-augmented specialist, a plug-and-play heterogeneous MAS, and a conductor that orchestrates experts dynamically.

EywaAgent

Why LLM-only reasoning is not enough

Scientific tasks often depend on structured data like time series and tables. A pure language agent has to serialize those inputs into text, reason over them token by token, and hope that the language interface preserves the task-relevant signal. The paper's core argument is that this creates an information bottleneck.

Problem: the FM may not natively understand natural language, while the LLM lacks the specialist inductive bias.
Tsaheylu interface: a query compiler turns task state into a structured invocation, and a response adapter turns FM output into planner-consumable context.
Why tokens drop: modality-specific computation happens inside the specialist, so the LLM no longer spends long token traces simulating the prediction itself.

Takeaway. EywaAgent turns a domain-specific foundation model into a reasoning-participating agent with a stable interface.

EywaAgent. The LLM parses the task, configures the specialist through Tsaheylu, delegates the core domain computation, then checks and realizes the final response in the required output format.

EywaMAS. Sequential, looped, hierarchical, and heterogeneous agent systems can all be upgraded by replacing selected language-only workers with EywaAgents.

EywaMAS & EywaOrchestra

How collaboration scales beyond one specialist

EywaMAS preserves the topology of an existing multi-agent system while replacing selected workers with specialist-backed agents. EywaOrchestra goes one step further and asks which configuration should be used for this task at all.

Plug-and-play replacement: a planner, summarizer, or worker can stay in place while only the task-relevant workers become EywaAgents.
Heterogeneous topologies: the framework supports sequential, looped, hierarchical, and mixed-agent collaboration rather than one fixed structure.
Planner necessity: not every domain benefits equally from heavy multi-agent computation, so EywaOrchestra chooses model mix and topology based on the sample.

Takeaway. EywaMAS gives you a modular upgrade path; EywaOrchestra adds task adaptivity on top of that.

Benchmark

EywaBench Covers Real Scientific Structure

EywaBench is built to test heterogeneous scientific reasoning across domains, modalities, and task sources instead of flattening everything into one generic QA setting.

200

Released EywaBench-V1 tasks

The current split is a representative slice sampled from a larger, fully extensible benchmark construction pipeline.

Scientific sub-domains

Material, energy, space, biology, clinic, drug, economy, business, and infrastructure are all populated.

Domain-modality cells

All 27 combinations of sub-domain and modality are covered, avoiding the usual domain-collapse problem.

Figure 8. EywaBench mixes natural-language, time-series, and tabular sources with a long-tailed source distribution rather than one dominant dataset.

Figure 9. The benchmark is explicitly organized as parent domain → sub-domain → modality, which is exactly the kind of structure Eywa is meant to exploit.

Main Results

Whole-System Performance

The table below reproduces the main EywaBench comparison, and the follow-up panels summarize what those numbers actually mean.

Method	Metrics	Physical Science			Life Science			Social Science			Overall
Method	Metrics	Material	Energy	Space	Biology	Clinic	Drug	Economy	Business	Infrastructure	Overall
Single-Agent Setting
Single-LLM-Agent	Utility (↑)	0.5616	0.8202	0.5235	0.3402	0.4582	0.6004	0.7689	0.6528	0.6758	0.6154
	Time (↓)	34.48	27.01	26.00	34.68	22.37	21.13	22.67	22.28	18.42	25.22
	Tokens (↓)	6367	4854	4512	6164	3618	3571	4097	3915	3327	4469
EywaAgent (Ours)	Utility (↑)	0.5871	0.8390	0.6123	0.3718	0.5085	0.6199	0.8048	0.7371	0.7060	0.6558
	Time (↓)	34.88	24.42	23.12	30.84	20.32	15.84	19.71	20.98	15.99	22.78
	Tokens (↓)	5040	3167	3329	4858	2333	2210	2791	2444	2248	3137
Multi-Agent Setting
Refine MAS [2023]	Utility (↑)	0.5687	0.8667	0.6244	0.3623	0.4504	0.6215	0.7523	0.6880	0.6362	0.6294
	Time (↓)	72.76	64.22	79.65	75.21	51.89	50.63	62.33	48.54	47.49	60.59
	Tokens (↓)	11013	9009	10043	10497	7029	7498	8924	6997	7438	8673
Debate MAS [2024]	Utility (↑)	0.5602	0.8656	0.6543	0.3438	0.4738	0.6198	0.7729	0.6907	0.7237	0.6460
	Time (↓)	82.06	79.46	74.75	101.64	78.19	63.98	92.72	72.46	60.73	78.22
	Tokens (↓)	16652	14278	13614	17007	11159	10447	14694	10953	10311	13216
MoA [2025]	Utility (↑)	0.5909	0.8069	0.5863	0.3580	0.4722	0.5686	0.7499	0.7004	0.6938	0.6273
	Time (↓)	90.15	56.95	69.32	59.10	46.53	44.31	57.35	48.29	47.34	57.75
	Tokens (↓)	25327	16453	17332	15980	11014	10344	16114	11690	12365	15317
X-MAS [2025]	Utility (↑)	0.5831	0.8057	0.5723	0.3737	0.4490	0.6211	0.6923	0.6390	0.7180	0.6188
	Time (↓)	104.48	86.63	79.06	88.20	67.94	59.76	75.50	72.82	62.95	77.42
	Tokens (↓)	24149	19808	16584	18451	12549	11907	16499	14007	14056	16537
EywaMAS (Ours)	Utility (↑)	0.6381	0.8742	0.6899	0.3798	0.5086	0.6248	0.7959	0.7284	0.7406	0.6761
	Time (↓)	77.25	75.96	72.51	111.92	59.97	59.23	68.40	58.11	46.49	72.11
	Tokens (↓)	14529	11709	11787	16502	9407	8078	11044	9470	8912	11214
Dynamic Orchestration
EywaOrchestra (Ours)	Utility (↑)	0.6249	0.8711	0.7187	0.3682	0.5159	0.6319	0.7830	0.7388	0.7298	0.6746
	Time (↓)	61.78	39.92	75.47	67.88	45.38	45.94	49.13	34.18	28.80	48.16
	Tokens (↓)	11535	7723	10810	11315	7050	6495	7117	7264	6892	8335

Full EywaBench comparison across utility, time, and token usage. Best values are bolded, and second-best values are underlined to mirror the paper table.

EywaAgent improves both quality and efficiency

Under the same backbone, EywaAgent raises utility while cutting latency and reducing tokens by nearly 30% through specialist delegation.

EywaMAS beats homogeneous MAS baselines

EywaMAS achieves the best overall fixed-system utility and outperforms Refine and Debate in scientific settings.

LLM-only heterogeneity is not enough

Methods that only combine multiple language models do not consistently beat strong homogeneous MAS baselines on EywaBench.

Heavier MAS is not always necessary

On some domains such as economy and business, a single EywaAgent is already highly competitive, which motivates adaptive orchestration.

EywaOrchestra gets close at lower cost

The planner reaches utility close to expert-designed EywaMAS while lowering token and latency cost and removing expert configuration overhead.

Utility and token trade-off for Eywa methods

Figure 5. On the global utility-token plot, Eywa methods move the frontier upward and left: higher utility with fewer tokens than language-only baselines.

Figure 13. The same trade-off pattern largely persists when broken out across the nine scientific sub-domains.

Further Analysis

Robustness and Backbone Effects

The paper's ablations show that Eywa's gains are not tied to one narrow prompt or hyperparameter setting.

Stable

Across sampling temperatures

Performance remains broadly steady as LLM temperature varies, suggesting the framework is not brittle to one decoding setting.

Robust

Across FM calibration

Changing the TabPFN softmax temperature does not erase the benefit of specialist delegation.

Structured

Prompts help a bit more

Detailed, chain-of-thought, and ReAct-style prompting all work, with more structured designs usually helping slightly.

Cite

BibTeX

@misc{li2026heterogeneous,
  title   = {Heterogeneous Scientific Foundation Model Collaboration},
  author  = {Zihao Li and Jiaru Zou and Feihao Fang and Xuying Ning and
             Mengting Ai and Tianxin Wei and Sirui Chen and Xiyuan Yang and
             Jingrui He},
  year    = {2026},
  note    = {Preprint},
  url     = {https://github.com/Violet24K/Eywa},
}