Research on the Next Generation of Agentic Infrastructure

Benchmarking, evaluating, and optimizing real-world AI systems across quality, latency, and cost

Product-driven Research. Research-driven Product.

We believe the next wave of AI progress comes not just from stronger models, but from better systems around them: how workflows are routed, how performance is measured in production, and how it improves over time. OpenMesh treats product and research as one loop. Real-world usage exposes the failure modes and cost-quality tradeoffs that isolated benchmarks miss, and research turns them into better routing and more reliable systems.

OpenMesh: A Lab for Production AI Systems

May 17, 2026

Distribution-Free Uncertainty for Continuous Agent Evaluation

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation Brings conformal prediction to live agent scoring, giving forecasted quality scores honest, distribution-free confidence intervals that widen automatically when an agent ships a release.

Accepted to International Conference in Machine Learning (ICML) 2026 AgenticUQ Workshop

May 7, 2026

DecisionBench: Benchmarking Skill-Aware Emergent Orchestration in Long-Horizon Agentic Workflows

A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows A benchmark for emergent delegation that measures not just task quality but how an orchestrator routes each step, finding that delivery channel beats description content and that perfect delegation sits 15 to 31 points above current performance.

In Submission to NeurIPS 2026 D&B Track

May 1, 2026

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

A Continuous Benchmark Unifying Energy and Cognition in AI Inference Measures inference at the endpoint rather than the model, scoring joules and dollars per correct answer, and shows the same model varies up to 6x in energy and reorders entirely once you price it by real workload.

April 28, 2026

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Scores deployed agents from 18 live signals across benchmark, adoption, sentiment, and ecosystem health, showing that task-completion capability barely predicts which tools developers actually adopt in practice.

Accepted to Conference on AI and Agentic Systems (CAIS) 2026 AID-Wild Workshop

April 22, 2026

IntelligenceArena: A Quantitative Framework for Real-Time Scoring of Frontier LLMs and AI Agents

Continuously ranks agents on capability, adoption, sentiment, and ecosystem health, revealing where benchmark-only leaderboards diverge from the tools developers actually choose and trust in production.

Accepted to Conference on AI and Agentic Systems (CAIS) 2026 AgenticSE Workshop

Interested in researching with us?

We are interested in speaking with researchers, engineers, and early partners who care about the future of agentic infrastructure systems.