Auditable LLM Arbiter for DeFi Security
The gap between what you tell an agent and what it actually executes onchain is real, exploitable, and unsolved. Nava's Arbiter combines deterministic rules with semantic reasoning to verify intent-to-transaction alignment before funds move. Peer-reviewed at NDSS 2026.
A Hybrid Graph-of-Thoughts Approach to Intent–Transaction Alignment
We've recently published our research behind Nava's core verification engine. The paper, accepted at the LAST-X Workshop at NDSS 2026, introduces two things: a new benchmark dataset for evaluating intent-transaction alignment in DeFi, and Arbiter, our hybrid validation framework that sits between what an agent is told to do and what it actually executes onchain.
The problem we set out to solve
When an AI agent receives an instruction like "swap 1,500 USDC to WETH, maximum 0.5% slippage, keep gas under 20 gwei," something has to verify that the transaction it constructs actually honors those constraints before funds move. Today, nothing does this reliably.
Existing approaches each fail in a distinct way. Rule-based validators are precise about protocol-level checks: address format, decimal scaling, deadline expiry. But they cannot reason about whether a transaction actually matches what a user meant. A transaction can pass every local rule and still silently violate the user's intent. LLM-based validators can reason about natural language goals, but they hallucinate technical facts, mishandle numerical precision, and have no grounding in protocol invariants. Worse, chaining multiple LLM reasoning steps together compounds these errors rather than correcting them.
Neither approach is enough on its own. We needed something that combined both.
INTENT-TX-18K: the benchmark
Before we could evaluate any solution, we needed a rigorous dataset. No large-scale benchmark existed for intent-transaction alignment in DeFi, so we built one.
INTENT-TX-18K contains 18,000 intent-transaction pairs drawn from real onchain activity, collected over a representative month of Ethereum mainnet activity in June 2025. Each example includes a natural language intent, a structured proposed transaction with resolved contract addresses and token decimals, and a ground-truth label: ACCEPT or REJECT, with a failure reason when applicable.
The dataset spans five categories: aligned transactions, intent misalignment, technical violations, adversarial manipulation, and legal violations. We deliberately constructed tone-diverse intents across power-user shorthand, cautious retail phrasing, multilingual inputs, and bullet-style checklists, to ensure the benchmark reflects how real users actually express what they want.
INTENT-TX-18K is publicly available here: https://github.com/duanyiyao/intent-tx-18k
Arbiter: a hybrid Graph-of-Thoughts validator
Arbiter decomposes intent-transaction validation into a directed acyclic graph of atomic checks. Some nodes are deterministic rule-based functions. Others are LLM-based semantic checks. The graph structure captures their dependencies: low-level checks on address correctness, decimal scaling, and protocol structure must pass before higher-level checks on intent alignment and adversarial detection can proceed.
This matters because it preserves the strengths of both approaches where they apply. Deterministic nodes handle exactly the checks where precision is critical and hallucination is intolerable. Semantic nodes handle the checks that require understanding natural language goals, cross-field consistency, and manipulative intent patterns. Neither operates alone, and neither undermines the other.
When any safety-critical node fails, such as a sanctions screening check or a token mismatch, Arbiter immediately returns a REJECT verdict and stops. It does not continue running unnecessary checks. The result is both a binary decision and a structured explanation: the exact nodes that failed, with human-readable reasons for each.
What the results show
In our research experiments, Arbiter substantially outperformed both pure rule-based and pure LLM-based baselines across all five violation categories. Pure LLM approaches collapsed on technical violations and adversarial manipulations. Rule-only approaches missed intent misalignment and adversarial cases entirely. Arbiter kept all five violation categories above 90% recall, something neither approach achieves alone.
Beyond classification accuracy, Arbiter correctly identifies where a transaction fails, not just that it fails. For auditors and operators, this means the system surfaces actionable, localized failure reasons rather than opaque verdicts.
On latency, Arbiter completes verification faster than multi-step LLM approaches while delivering substantially higher accuracy in our benchmark evaluations.
What this means for Nava
This research is the foundation of what we are building. Production performance and protocol coverage will evolve as we expand beyond our initial testnet, and we'll publish results as we go. What the research establishes is the core principle: closing the intent-to-execution gap requires something that neither pure rules nor pure LLMs can provide on their own.
Nava's Arbiter is our answer, and this paper is the evidence behind the approach.
Read the full paper here, and join the private testnet waitlist to build with Nava.
Read the paper → https://www.ndss-symposium.org/wp-content/uploads/lastx2026-46.pdf
Join the waitlist → https://navalabs.ai/
