Beyond SMILES — Agentic Drug Discovery Benchmark

The Agent Reality Gap

Agentic AI systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. But how well do they generalize beyond small-molecule workflows at well-resourced pharmaceutical companies?

This paper evaluates six agentic frameworks — ChemCrow, Coscientist, PharmAgents, ChatInvent, MADD, and DiscoVerse — against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource-constrained settings. The result: no framework fully supports any of the 15 task classes. Coverage is sparse, with the highest framework achieving just 16.7% weighted coverage. Five critical capability gaps emerge from the analysis.

A paired knowledge-probing experiment reveals the bottleneck is architectural, not epistemic: four frontier LLMs reason about peptides at levels comparable to small molecules. The knowledge exists — no current agent exposes it.

Full task coverage

16.7%

Best framework coverage

Capability gaps identified

−0.115

SM-Peptide knowledge gap

Key Findings

Gap 1

SMALL-MOLECULE BIAS

All six frameworks assume SMILES strings and molecular fingerprints. Zero coverage for peptide-specific tasks (generative design, PLM fine-tuning, enzyme stability). Protein language models are absent as first-class components.

Gap 2

NO IN VIVO–IN SILICO BRIDGE

No agent integrates longitudinal in vivo data (behavioral scores, histology, RNA-seq, clinical notes). Six task classes spanning recovery modeling, imaging, transcriptomics, and behavioral phenotyping receive zero coverage.

Gap 3

LLM-ONLY PARADIGM

Agents rely on LLM inference with no pathway to ML training, reinforcement learning, or simulation. No framework supports training task-specific classifiers on proprietary data or running RL optimization loops.

Gap 4

BIG-PHARMA ASSUMPTIONS

Frameworks assume large datasets, compute clusters, and dedicated teams. Small biotechs with 50–500 sequences, single GPUs, and one-person teams have no pathway for transfer learning or few-shot adaptation.

Gap 5

SINGLE-OBJECTIVE OPTIMIZATION

Real drug discovery navigates safety-efficacy-stability trade-offs. Current agents optimize single metrics or weighted sums, ignoring Pareto frontiers, uncertainty quantification, and constraint satisfaction.

The "Stranded Knowledge" Problem

Framework Coverage Comparison (%)

Task Coverage by Gap Category

Capability Matrix

Each framework–task pair was assessed at three levels: full support (✓), partial support (◦), and not supported (×). Partial support received a weight of 0.5 for coverage scores. No framework achieves full support on any task.

Task Class	CC	CS	PA	CI	MA	DV	Gap

Evaluation Dimensions

ID	Dimension	Description
D1	Molecular representation	Support for peptides, proteins, and biologics beyond SMILES strings and molecular fingerprints
D2	Computational paradigm	Capacity for ML training, RL, simulation, and constrained optimization beyond LLM inference
D3	Data modality integration	Handling of in vivo longitudinal data, imaging, transcriptomics, and behavioral data
D4	Resource assumptions	Alignment with varying data volumes, compute budgets, and team sizes
D5	Optimization framework	Multi-objective optimization, uncertainty quantification, and constraint satisfaction

Partial vs No Support by Framework

Five Capability Gaps

These gaps are analytically distinct but practically intertwined: multi-paradigm orchestration (Gap 3) is a prerequisite for peptide-aware workflows (Gap 1), and multi-objective reasoning (Gap 5) is needed for both in vivo translation (Gap 2) and resource-constrained decision-making (Gap 4).

Small-Molecule Representation Bias

Tasks T2, T3, T5, T6, T7 — zero coverage across all frameworks

All six frameworks are architected around SMILES strings, molecular fingerprints, and docking scores. This works for medicinal chemistry but breaks down for peptide therapeutics (5–50 amino acids) requiring conformational sampling, aggregation prediction, and protease resistance modeling.

No framework supports protein language models (ESM-2, ProtBERT, ProGen) as first-class components. The diagnostic experiment shows this is not a knowledge limitation — LLMs reason about peptides at levels comparable to small molecules. The peptide expertise is stranded behind small-molecule-only architectures.

Required capabilities: PLM fine-tuning pipelines, structural biology integration (AlphaFold, flexible docking, MD), sequence-aware data management, diversity-aware generation with mode-collapse prevention.

T2 Generative peptide design T3 Peptide-receptor binding T5 Enzyme stability T6 PLM receptor prediction T7 Monte Carlo optimization T14 RL peptide generation

Absence of In Vivo–In Silico Integration

Tasks T4, T8, T9, T10, T11, T12, T13 — zero or minimal coverage

Current agents excel at in vitro automation but critical validation happens in vivo, where candidates confront pharmacokinetics, biodistribution, metabolism, and long-term efficacy. Animal studies generate longitudinal, multi-modal data: behavioral scores over weeks, tissue histology, RNA sequencing, and clinical notes.

No agent integrates these temporal data streams. In neurological injury models, efficacy manifests through staged recovery endpoints — early motor improvements, subsequent neuroinflammation reduction, and longer-term neurogenesis. This gap between in vitro promise and in vivo reality is where most development cost and risk actually sit.

Required capabilities: Temporal data pipelines, multi-modal integration (imaging + omics + behavioral), predictive bridging models between in vitro and in vivo endpoints.

T4 In vivo recovery modeling T8 RNA-seq / scRNA-seq T9 Image quantification T10 Immune profiling T11 Functional annotation T12 Behavioral phenotyping T13 In vivo/in vitro bridging

Limited Computational Paradigm Support

Tasks T1, T7, T14 — LLM inference only, no ML training or RL

The architectural pattern is consistent: an LLM orchestrates tool calls, synthesizes results, and generates explanations. This works for text-based reasoning but excludes gradient-based ML training, reinforcement learning, Monte Carlo simulation, and constrained optimization — all fundamental to drug discovery.

Developing a receptor binding classifier involves curating training sets, extracting ESM-2 embeddings, training supervised classifiers, validating performance, and iterating hyperparameters. This is gradient-based ML, not API calls. Current agents treat models as inference-only endpoints.

Required capabilities: Orchestration of training workflows (data → model → evaluation), RL loops with reward shaping, hyperparameter optimization, model checkpointing, and compute resource management.

T1 ML bioactivity prediction T7 Monte Carlo optimization T14 RL peptide generation

Misalignment with Small-Biotech Constraints

Cross-cutting — affects all tasks in resource-constrained settings

Small biotechs face fundamentally different constraints than AstraZeneca: 50–500 proprietary sequences vs. millions, single GPU vs. clusters, one person handling design, modeling, and analysis. Transfer learning and few-shot adaptation are prerequisites, not luxuries.

Current agents assume abundant resources and long, interactive cycles. There is no pathway for efficient adaptation from pre-trained models to small proprietary datasets, no compute-aware workflow planning, and no support for the multi-role practitioner who is simultaneously drug designer and computational biologist.

Required capabilities: Transfer learning from pre-trained PLMs, few-shot adaptation, compute-aware scheduling, batch pipeline execution, and single-operator workflow patterns.

Cross-cutting: all 15 tasks

Single-Objective Optimization Assumptions

Task T15 — partial coverage only, no true MOO

A peptide with tenfold higher bioactivity may have narrower safety margins or reduced stability. Real drug discovery navigates multi-objective trade-offs under uncertainty. Current agents optimize single metrics or weighted sums, ignoring Pareto frontiers and uncertainty quantification.

Practitioners manually reason about these trade-offs, which slows iteration and increases decision risk. Curriculum learning — initially rewarding bioactivity, then progressively adding stability and toxicity constraints — prevents local optima and maintains diversity. No current agent supports this.

Required capabilities: Pareto frontier computation, uncertainty-aware selection, curriculum-based optimization, constraint satisfaction with hard bounds, and interactive trade-off visualization for human-in-the-loop decisions.

T15 Safety/toxicology MOO Implicit in T2, T5, T14

Gap Interconnection: Dimension Coverage by Gap

LLM Knowledge Probing

Do frontier LLMs actually lack peptide knowledge, or is the bottleneck in agent architectures? A paired diagnostic experiment reveals the answer.

Four frontier LLMs from independent training pipelines were tested on 50 matched question pairs spanning five pharmaceutical knowledge categories. Each pair tested the same cognitive skill in two modality contexts — one small-molecule, one peptide — controlling for question difficulty and format.

200

Paired observations

p=1.0

Bonferroni-adjusted

4/5

Categories favor peptides

κ=0.78

Judge model stability

Per-Model Comparison

KIMI K2.5 (Moonshot)

2.18

Small Mol.

2.06

Peptide

SAR

ADMET

Gen. Design

Optimization

Assay Interp.

DEEPSEEK V3.2

2.14

Small Mol.

2.10

Peptide

SAR

ADMET

Gen. Design

Optimization

Assay Interp.

QWEN 3 NEXT 80B (Alibaba)

2.22

Small Mol.

2.16

Peptide

SAR

ADMET

Gen. Design

Optimization

Assay Interp.

GEMINI 3 FLASH (Google)

2.08

Small Mol.

2.12

Peptide

SAR

ADMET

Gen. Design

Optimization

Assay Interp.

Per-Category Analysis

Small-Molecule vs Peptide Scores by Category

Score Gap (Peptide − Small Molecule)

Methodology Notes

Scoring: 0–3 rubric (wrong → expert-level). Primary scoring by Claude Sonnet 4.5 as LLM-as-judge; domain expert validated stratified 20% subset (N=80) under blind protocol. Sensitivity check with Claude Opus confirmed stability (quadratic-weighted κ=0.78).

Statistics: Paired Wilcoxon signed-rank tests per model (Bonferroni α=0.0125), matched-pairs rank-biserial correlation for effect size, Friedman test for cross-model consistency (χ²=0.454, p=0.929), bootstrap 95% CIs.

Key finding: Expert validation revealed models produce directionally correct but quantitatively overconfident reasoning in both domains equally — a depth limitation, not a modality-specific knowledge gap.

Framework Profiles

Six frameworks representing distinct design paradigms were evaluated: single-agent, multi-agent, and tool-augmented systems. Selection required published documentation, demonstrated drug discovery application, and distinct architectural paradigm.

ChemCrow

Tool-Augmented Single Agent

Orchestrates 18 chemistry tools via LLM routing. Routes requests to RDKit, PubChem, and reaction prediction APIs. Effective for small-molecule property calculation, retrosynthesis, and safety assessment.

6.7%

RDKit PubChem Reaction API Safety Lookup

Coscientist

Lab-Automation Agent

Autonomously plans chemical syntheses and interfaces with laboratory automation. Strong at protocol generation and experimental execution in automated lab environments. Narrowly scoped to synthesis workflows.

0.0%

Lab Hardware Protocol Gen Synthesis Plan

PharmAgents

Knowledge Graph Multi-Agent

Integrates knowledge graphs for target identification and drug repurposing. Multi-agent collaboration with specialized agents for different pipeline stages. Best coverage (16.7%) due to broader tool integration.

16.7%

Knowledge Graph Target ID Drug Repurposing ADMET

ChatInvent

Molecular Design Agent (AstraZeneca)

Deployed at AstraZeneca for molecular design and synthesis planning. Generates molecular designs informed by literature. Optimized for large-pharma workflows with extensive proprietary data and compute resources.

10.0%

Literature Mining Mol. Design Synth. Routes

MADD

Multi-Agent Drug Discovery

Multi-agent collaboration for drug discovery with specialized agents for different tasks. Promises end-to-end pipeline coverage through agent coordination. Limited by small-molecule-centric tool integrations.

10.0%

Multi-Agent Pipeline Coordination

DiscoVerse

Collaborative Discovery Platform

Multi-agent collaboration platform for scientific discovery. Broader scope than pure drug discovery, with agents for literature synthesis, hypothesis generation, and experimental design. Second-best coverage (13.3%).

13.3%

Literature Hypothesis Experiment Design Collaboration

Workflow Complexity: Small Molecules vs. Peptides

Small Molecule Workflow (Linear)

📝 SMILES representation

↓

⚗️ RDKit property calculation

↓

🎯 Docking score

↓

🔬 Retrosynthesis

↓

✅ Candidate selection

Peptide Workflow (Branching)

🧬 FASTA sequence

↓ ↓ ↓ ↓ ↓ ↓

🏗️ Structure prediction (AlphaFold)

💧 Aggregation propensity

⏰ Protease stability

🛡️ Immunogenicity (MHC binding)

🚪 Membrane permeability

🧪 Conformational sampling

↓ convergence

⚖️ Multi-objective Pareto frontier

Framework Architecture Comparison

Design Requirements for Next-Generation Frameworks

From the five capability gaps, the paper derives concrete design requirements for agentic systems that function as computational partners under realistic constraints — not just demo-ware optimized for small-molecule benchmarks.

Protein Language Model Integration Accept FASTA files, extract PLM embeddings (ESM-2, ProtBERT), fine-tune on proprietary data (≥50 sequences), return calibrated classifiers with per-class AUC-ROC and uncertainty estimates. Support generative models (ProtGPT2, PepTune) with diversity metrics and mode-collapse prevention.
Structural Biology Pipeline Chain AlphaFold structure prediction → flexible docking → molecular dynamics → binding analysis into multi-step workflows. Support peptide conformational sampling, not just rigid-body docking. Feed structural results back into sequence optimization loops.
In Vivo Data Integration Ingest longitudinal multi-modal data: behavioral scores, tissue histology, RNA-seq, and clinical notes. Build temporal models for staged recovery endpoints. Bridge in vitro predictions to in vivo outcomes through predictive transfer models.
Multi-Paradigm Orchestration Go beyond LLM inference to orchestrate ML training (supervised, self-supervised), reinforcement learning (reward shaping, KL regularization), Monte Carlo sampling, and constrained optimization. Manage compute resources, checkpoints, and hyperparameter search.
Resource-Adaptive Workflows Provide pathways for small biotechs: transfer learning from pre-trained models to 50–500 sequence datasets, single-GPU training pipelines, batch execution modes, and compute-aware scheduling. Support the multi-role practitioner workflow.
Multi-Objective Optimization Implement Pareto frontier computation, uncertainty-aware selection, and curriculum-based optimization (progressive constraint addition). Support interactive trade-off visualization for human-in-the-loop decisions with hard constraint satisfaction.
Active Learning Loops Integrate acquisition functions (expected improvement, uncertainty sampling) for peptide selection under limited experimental budgets. Support iterative design-test cycles with closed-loop feedback from experimental results to model updates.
Sequence-Aware Data Management Track peptide modifications, non-natural amino acids, synthesis constraints, and assay provenance across iterative design cycles. Maintain versioned datasets with lineage tracking, not flat SMILES databases.

Gap → Requirement Mapping

References

Boiko, D. A. et al. "Autonomous chemical research with large language models." Nature 624, 570–578 (2023). doi:10.1038/s41586-023-06792-0
Bran, A. M. et al. "ChemCrow: Augmenting large-language models with chemistry tools." Nature Machine Intelligence 6, 525–535 (2024). doi:10.1038/s42256-024-00832-8
Landreth, A. et al. "ChatInvent: molecular design and synthesis planning at AstraZeneca." ChemRxiv (2025).
Zhang, H. et al. "PharmAgents: AI agents for drug target identification and knowledge graphs." arXiv (2025).
Lee, J. et al. "TxGemma: therapeutics-focused language understanding." Google Research (2025).
Wang, Y. et al. "MADD: Multi-Agent Drug Discovery." arXiv (2025).
Chen, L. et al. "DiscoVerse: Collaborative multi-agent scientific discovery." arXiv (2025).
Zheng, S. et al. "Large language models for drug discovery." Drug Discovery Today 29(4), 103938 (2024).
Savage, J. et al. "Drug discovery and development with AI agents." Nature Reviews Drug Discovery (2025).
Lin, Z. et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379, 1123–1130 (2023). doi:10.1126/science.ade2574
Elnaggar, A. et al. "ProtTrans: Toward understanding the language of life through self-supervised learning." IEEE TPAMI 44(10), 7112–7127 (2022).
Wei, J. et al. "BioPlanner: Automatic evaluation of LLMs on protocol planning in biology." arXiv (2023).
Liu, F. et al. "ChemToolAgent: A comprehensive benchmark for LLM-based chemistry tools." arXiv (2024).
Park, S. et al. "Agentomics: Autonomous ML for omics data." arXiv (2025).
Kim, J. et al. "ML-Agent: Automated machine learning pipelines." arXiv (2025).
Schneider, P. et al. "Rethinking drug design in the artificial intelligence era." Nature Reviews Drug Discovery 19, 353–364 (2020). doi:10.1038/s41573-019-0050-3
Vamathevan, J. et al. "Applications of machine learning in drug discovery and development." Nature Reviews Drug Discovery 18, 463–477 (2019).
Zheng, L. et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." NeurIPS (2023).
Madani, A. et al. "Large language models generate functional protein sequences across diverse families." Nature Biotechnology 41, 1099–1106 (2023).
Jain, N. et al. "PepTune: De novo generation of therapeutic peptides with multi-objective optimization." NeurIPS (2025).
Guntuboina, C. et al. "PepMLM: Target sequence-conditioned generation of peptide binders via masked language modeling." arXiv (2024).
Ferruz, N. et al. "ProtGPT2 is a deep unsupervised language model for protein design." Nature Communications 13, 4348 (2022).
Jumper, J. et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596, 583–589 (2021). doi:10.1038/s41586-021-03819-2
Wijaya, E. "Beyond SMILES: Evaluating Agentic Systems for Drug Discovery." arXiv:2602.10163 (2026). arXiv:2602.10163