Google CloudMongoDB AtlasHotpotQA

Quality-Gated Multi-Hop Retrieval with Episodic Memory

GREM recovers BM25 hard failures using multi-agent reasoning, verification-aware aggregation, and distilled cross-encoder reranking — backed by MongoDB Atlas vector memory.

Performance Results

HotpotQA Bridge Failures (n=228)

Key Metrics

80.26%
Hits@1
92.54%
Hits@2
70.61%
Recall@2
0.8851
MRR
0.7280
nDCG@2
0.8475
nDCG@5
100.0%
Ground Rate
0.0%
Lucky Rate
4.8%
Adaptive Atlas

Failure Mode Recovery

Chain break
84.2%
(16/19)
Distractor confusion
81.0%
(64/79)
Entity drift
79.2%
(103/130)

System Comparison

MetricBM25 BaselineLLM Re-rankingGREM (Distilled)GREM (Adaptive Atlas)
Hits@10.000~0.850.80260.8026
Hits@2~0.930.92540.9254
Recall@2~0.720.70610.7061
MRR~0.880.88510.8851
Latency5 ms~2 s2 ms2-50 ms
Cost per query$0$0.003$0.000003$0.000003
API calls01 LLM00-1 vector

MongoDB Atlas Vector Search invoked on 4.8% of queries (11 out of 228) — exactly when the re-ranker is uncertain.95.2% of queries complete in 2ms with no database round-trip.

See It In Action

Replay cached BM25 → GREM inference traces with verified reasoning chains.

Live From MongoDB Atlas