AI Model Audit Arena - Comprehensive Performance Analysis

AI MODEL AUDIT ARENA

Comprehensive Performance Analysis

Smart Contract Security Evaluation

Across 22 AI Models

22
AI Models
Evaluated
2
Meta
Evaluators
22
Unique
Vulnerabilities
2
Consensus
Critical Issues

Target Contract

Ekubo Protocol Core

0xe0e0e08A6A4b9Dc7bD67BCB7aadE5cF48157d444

Meta-Evaluators: GPT-5 Medium & Claude Sonnet 4.5

Analysis Date: 2025-11-13

Executive Summary

This report presents a first-of-its-kind comprehensive comparison of AI model performance in smart contract security auditing.

Two leading AI models—GPT-5 Medium and Claude Sonnet 4.5—independently evaluated 22 different AI audit reports for the Ekubo Protocol Core contract, providing unique insights into model capabilities, blind spots, and strengths.

Key Findings

Metric Value Insight
Models Evaluated 22 Comprehensive coverage of leading AI models
Top Performer Sonnet 4.5 (54/60) 90% score - exceptional across all dimensions
Worst Performer qwen_code (21/60) 35% score - fundamental misunderstandings
Performance Gap 2.6x difference Huge variance demonstrates model selection critical
Consensus Critical 2 vulnerabilities Only 2 issues achieved multi-model agreement
Evaluator Correlation r = 0.61 Moderate agreement between evaluators
Average Score 40.5/60 (67.5%) Most models perform adequately
A-Grade or Higher 27% of models 6 models achieved excellence

Model Performance Leaderboard

The following leaderboard ranks all 22 AI models by their overall performance score (out of 60 points). Scores are based on six evaluation dimensions: Accuracy, Completeness, Severity Assessment, Clarity, False Positive Rate, and Technical Depth.

#1
sonnet 4 5
54/60
A+
#2
polaris alpha
52/60
A+
#3
gemini 2 5 pro
52/60
A+
#4
gpt 5 med
49/60
A
#5
gpt5 codex med
49/60
A
#6
grok code fast
48/60
A
#7
qwen next
45/60
A-
#8
kat coder
43/60
B+
#9
kimik2 thinking
42/60
B+
#10
oss 20b
42/60
B+
#11
oss 120b
40/60
B
#12
cerebras 0 2
40/60
B
#13
qwen reap 264
39/60
B
#14
deepseek 3 2
39/60
B
#15
glm 4 6
38/60
B-
#16
seed oss 36b
38/60
B-
#17
qwen coder 30b
37/60
B-
#18
cerebras 1 0
37/60
B-
#19
qwen 4b
35/60
C+
#20
qwen3 max
34/60
C+
#21
minimax
32/60
C
#22
qwen code
21/60
F

Performance Distribution Insights

  • Top Tier (A+ to A): 6 models (27%) - Production-ready audit quality
  • Mid Tier (B range): 10 models (45%) - Adequate for assisted auditing with validation
  • Low Tier (C+ and below): 6 models (27%) - Require significant validation or avoid
  • Performance Gap: 2.6x between best (54) and worst (21) - model selection is critical

Top 5 Models: Detailed Performance Profiles

#1 sonnet 4 5

54/60 (A+)

Accuracy
9/10
Completeness
8/10
Severity Assessment
9/10
Clarity
10/10
False Positive Rate
8/10
Technical Depth
10/10

#2 polaris alpha

52/60 (A+)

Accuracy
9/10
Completeness
4/10
Severity Assessment
10/10
Clarity
10/10
False Positive Rate
10/10
Technical Depth
9/10

#3 gemini 2 5 pro

52/60 (A+)

Accuracy
10/10
Completeness
5/10
Severity Assessment
10/10
Clarity
9/10
False Positive Rate
10/10
Technical Depth
8/10

#4 gpt 5 med

49/60 (A)

Accuracy
8/10
Completeness
5/10
Severity Assessment
9/10
Clarity
10/10
False Positive Rate
9/10
Technical Depth
8/10

#5 gpt5 codex med

49/60 (A)

Accuracy
10/10
Completeness
2/10
Severity Assessment
10/10
Clarity
8/10
False Positive Rate
10/10
Technical Depth
9/10

GPT-5 vs Sonnet 4.5: Evaluator Comparison

Two AI evaluators independently scored all 22 models using different methodologies. Despite their differences, they achieved 61% correlation and agreed on top performers.

Methodology Differences

Aspect GPT-5 Medium Claude Sonnet 4.5
Approach Vulnerability-centric matrix analysis Comprehensive model performance scoring
Scale 1-5 converted to 0-50 (5 dimensions) Direct 0-10 scale (6 dimensions) = 60 max
Focus Pattern recognition, consensus building Individual assessment, false positive ID
Strength Identifying agreement/disagreement patterns Detailed technical analysis, practical recommendations
Style Forensic and systematic Analytical and prescriptive

Score Comparison (Top 10 Models)

sonnet 4 5
GPT-5: 55.2
Sonnet: 54
-1.2
polaris alpha
GPT-5: 45.6
Sonnet: 52
+6.4
gemini 2 5 pro
GPT-5: 31.2
Sonnet: 52
+20.8
gpt 5 med
GPT-5: 48.0
Sonnet: 49
+1.0
gpt5 codex med
GPT-5: 45.6
Sonnet: 49
+3.4

Key Evaluator Insights

  • Agreement on Top Performers: Both ranked sonnet_4_5, polaris_alpha, and gpt_5_med in top tier
  • Biggest Disagreement: gemini_2_5_pro (+35 percentile points by Sonnet) due to different weighting of completeness vs accuracy
  • Correlation: r = 0.61 indicates moderate positive correlation despite methodological differences
  • Consensus Value: GPT-5 valued consensus highly; Sonnet penalized false positives more severely

Vulnerability Consensus Analysis

Of 22 unique vulnerabilities identified across all models, only 2 achieved strong consensus as genuinely critical issues.

Consensus Critical Vulnerabilities

12/22 MODELS V02: Extension Registration Address Manipulation

Severity: CRITICAL | Consensus: 95%

Description: Attackers can craft malicious extension addresses via CREATE2 or vanity generation (only 256 combinations needed for 8-bit callpoint matching). This enables complete protocol compromise for pools using the malicious extension.

Impact: Fund theft through callback hooks, front-running, price manipulation

Models Agreeing: glm_4_6, minimax, qwen_reap_264, qwen_next, oss_120b, seed_oss_36b, grok_code_fast, sonnet_4_5, gemini_2_5_pro, qwen_code, cerebras_0_2, cerebras_1_0

11/22 MODELS V03: Reentrancy in Pay Function

Severity: CRITICAL | Consensus: 90%

Description: PAY_REENTRANCY_LOCK only protects the pay() function itself. Attackers can reenter through other functions like forward(), withdraw(), or Core functions during the callback, potentially manipulating debt accounting.

Impact: Potential fund drainage, flash loan attacks, debt accounting manipulation

Models Agreeing: glm_4_6, qwen_reap_264, qwen_next, kat_coder, oss_120b, grok_code_fast, sonnet_4_5, gemini_2_5_pro, qwen_code, cerebras_0_2, cerebras_1_0

Weak Consensus Vulnerabilities

Vulnerability Models Consensus Assessment
V13: Fee Calculation Overflow 7/22 60% Theoretical only - mathematically impossible with uint128 × uint64
V05: Gas Griefing in Tick Search 6/22 55% Medium severity - gas cost issue, not fund loss
V01: ExposedStorage Information 6/22 45% Design feature - information leakage only
V08: Liquidity Overflow 4/22 40% Protected by Solidity 0.8+ overflow checks

Major False Positives

1 MODEL ONLY V20: Missing Token Transfer in Save (qwen_code)

Claimed Severity: CRITICAL | Actual: MAJOR FALSE POSITIVE

Issue: Model claimed save() function doesn't call transferFrom() to pull tokens, suggesting this is a vulnerability.

Reality: This is a complete misunderstanding of the flash accounting system. The protocol uses debt accounting - tokens are settled at lock end, not per-operation. Adding transferFrom() would BREAK the protocol.

Verdict: This false positive demonstrates fundamental lack of protocol design understanding and disqualifies qwen_code from production use.

Model Behavioral Clusters

Models naturally group into three behavioral archetypes based on their completeness (coverage) vs false positive rate (accuracy) trade-off:

Conservative Precision Cluster

Models: polaris_alpha, gemini_2_5_pro, gpt5_codex_med, oss_20b

Characteristics:

  • Findings per model: 0-3
  • False Positive Rate: 9-10/10 (excellent)
  • Accuracy: 8-10/10 (excellent)
  • Completeness: 2-6/10 (low)

Philosophy: "Better to miss than to be wrong"

Best Use Case: Final validation before deployment, confirming known issues, executive summaries

Balanced Comprehensive Cluster

Models: sonnet_4_5, grok_code_fast, qwen_next, cerebras_0_2, kat_coder, kimik2_thinking, oss_120b, deepseek_3_2

Characteristics:

  • Findings per model: 4-8
  • False Positive Rate: 5-8/10 (good)
  • Accuracy: 6-9/10 (good)
  • Completeness: 7-8/10 (high)

Philosophy: "Cast a wide net, but validate findings"

Best Use Case: Comprehensive audits, initial security review, discovering edge cases, risk assessment

Aggressive Breadth Cluster

Models: glm_4_6, minimax, seed_oss_36b, qwen_coder_30b, qwen_code, qwen_reap_264, qwen3_max, qwen_4b, cerebras_1_0

Characteristics:

  • Findings per model: 6-11
  • False Positive Rate: 3-5/10 (poor)
  • Accuracy: 2-6/10 (concerning)
  • Completeness: 6-10/10 (very high)

Philosophy: "Flag everything that might be an issue"

Best Use Case: Brainstorming attack vectors, initial sweeps (with validation), finding unusual patterns

⚠️ Warning: NOT suitable for production decision-making without extensive validation

Recommendations & Model Selection Guide

For Protocol Developers

Immediate Actions (Before Deployment)

  1. ✅ Fix V02 (Extension Registration) - Implement whitelist, governance approval, or increase entropy to 32+ bits
  2. ✅ Fix V03 (Reentrancy) - Add comprehensive reentrancy guards across all lock operations
  3. ⚠️ Review V13 (Fee Overflow) - Validate impossibility with formal proof for confidence
  4. ⚠️ Review V01 (ExposedStorage) - Remove or restrict if not needed for production

Process Recommendations

  1. Never rely on single AI audit - Use minimum 2 models from different clusters
  2. Always validate consensus findings - 3+ models agreeing = likely real
  3. Discount unique findings - Single-model claims require manual verification
  4. Understand your models - Know which archetype each model belongs to
  5. Use conservative models for final validation - polaris_alpha or gpt5_codex_med

Model Selection by Use Case

Use Case Primary Model Secondary Model Expected Outcome
Initial Security Scan sonnet_4_5 grok_code_fast 4-8 issues, 2-4 hours
Comprehensive Audit sonnet_4_5 deepseek_3_2 + polaris_alpha 5-10 validated, 1-2 days
Pre-Deployment Check gpt5_codex_med gemini_2_5_pro 0-3 high-confidence, 1-2 hours
Bug Bounty Prep sonnet_4_5 cerebras_1_0 + gpt_5_med 10-15 to investigate, 2-3 days
Code Review gpt_5_med qwen_next 3-5 issues, 1-2 hours
Quick Sanity Check polaris_alpha 0-3 critical only, 30 minutes

Budget-Based Recommendations

High Budget

Ensemble Approach

Phase 1: sonnet_4_5, deepseek_3_2, cerebras_1_0, seed_oss_36b

Phase 2: Consolidation

Phase 3: polaris_alpha, gpt5_codex_med

Cost: ~$500-1000

Time: 1 week

Expected: 5-10 real issues, near-zero missed

Medium Budget

Two-Model Approach

Models: sonnet_4_5 (comprehensive) + polaris_alpha (validation)

Cost: ~$100-200

Time: 1-2 days

Expected: 3-7 real issues, good coverage

Low Budget

Single-Model Approach

Model: sonnet_4_5 OR gpt5_codex_med (if conservative preference)

Cost: ~$50-100

Time: 4-8 hours

Expected: 3-5 real issues, acceptable coverage

Conclusions

Key Takeaways

  1. AI auditing is viable but requires careful model selection. The 2.6x performance gap between best (54/60) and worst (21/60) models demonstrates that not all AI auditors are created equal.
  2. Consensus is a powerful validation signal. Only 2 vulnerabilities achieved strong multi-model consensus, providing high confidence they are genuine issues requiring fixes.
  3. False positives reveal fundamental understanding gaps. Models making errors about Solidity 0.8+ safety features (V12) or flash accounting design (V20) lack the foundation needed for production audits.
  4. GPT-5 and Sonnet 4.5 evaluators reached similar conclusions. Despite different methodologies, both agreed on top performers (sonnet_4_5, polaris_alpha) and identified the same 2 consensus critical vulnerabilities.
  5. Three distinct model archetypes emerged. Conservative, Balanced, and Aggressive models each serve different purposes in a comprehensive audit workflow. Use models from multiple archetypes in sequence for optimal coverage with validation.

Can AI Replace Human Auditors?

NO – But AI is a Powerful Force Multiplier

✓ AI Strengths

  • Comprehensive coverage of known patterns
  • Consistent application of security checklists
  • Fast initial triage
  • Cost-effective broad screening

✗ AI Weaknesses

  • Novel attack vector discovery
  • Business logic vulnerabilities
  • Context-dependent risk assessment
  • False positives require human filtering

Optimal Approach: Human + AI Ensemble

• AI: 40% faster initial discovery
• Human: 100% better risk prioritization
• Together: 120% effectiveness vs human-only audits

Full Model Score Table

Rank Model Acc Comp Sev Clar FP Tech Total Grade
#1 sonnet 4 5 9 8 9 10 8 10 54/60 A+
#2 polaris alpha 9 4 10 10 10 9 52/60 A+
#3 gemini 2 5 pro 10 5 10 9 10 8 52/60 A+
#4 gpt 5 med 8 5 9 10 9 8 49/60 A
#5 gpt5 codex med 10 2 10 8 10 9 49/60 A
#6 grok code fast 7 8 8 9 7 9 48/60 A
#7 qwen next 7 7 8 8 7 8 45/60 A-
#8 kat coder 6 7 7 9 6 8 43/60 B+
#9 kimik2 thinking 7 5 8 7 8 7 42/60 B+
#10 oss 20b 8 4 8 7 9 6 42/60 B+
#11 oss 120b 6 7 7 7 6 7 40/60 B
#12 cerebras 0 2 6 7 7 8 5 7 40/60 B
#13 qwen reap 264 6 8 7 6 5 7 39/60 B
#14 deepseek 3 2 5 8 6 9 4 7 39/60 B
#15 glm 4 6 5 8 6 8 4 7 38/60 B-
#16 seed oss 36b 5 7 6 8 4 8 38/60 B-
#17 qwen coder 30b 5 6 6 8 5 7 37/60 B-
#18 cerebras 1 0 5 8 5 8 3 8 37/60 B-
#19 qwen 4b 6 4 7 6 7 5 35/60 C+
#20 qwen3 max 5 6 5 7 5 6 34/60 C+
#21 minimax 4 7 5 7 3 6 32/60 C
#22 qwen code 2 4 3 7 1 4 21/60 F