Google CloudMongoDB AtlasHotpotQA
Quality-Gated Multi-Hop Retrieval with Episodic Memory
GREM recovers BM25 hard failures using multi-agent reasoning, verification-aware aggregation, and distilled cross-encoder reranking — backed by MongoDB Atlas vector memory.
Performance Results
HotpotQA Bridge Failures (n=228)
Key Metrics
80.26%
Hits@1
92.54%
Hits@2
70.61%
Recall@2
0.8851
MRR
0.7280
nDCG@2
0.8475
nDCG@5
100.0%
Ground Rate
0.0%
Lucky Rate
4.8%
Adaptive Atlas
Failure Mode Recovery
Chain break
84.2%
(16/19)
Distractor confusion
81.0%
(64/79)
Entity drift
79.2%
(103/130)
System Comparison
| Metric | BM25 Baseline | LLM Re-ranking | GREM (Distilled) | GREM (Adaptive Atlas) |
|---|---|---|---|---|
| Hits@1 | 0.000 | ~0.85 | 0.8026 | 0.8026 |
| Hits@2 | — | ~0.93 | 0.9254 | 0.9254 |
| Recall@2 | — | ~0.72 | 0.7061 | 0.7061 |
| MRR | — | ~0.88 | 0.8851 | 0.8851 |
| Latency | 5 ms | ~2 s | 2 ms | 2-50 ms |
| Cost per query | $0 | $0.003 | $0.000003 | $0.000003 |
| API calls | 0 | 1 LLM | 0 | 0-1 vector |
MongoDB Atlas Vector Search invoked on 4.8% of queries (11 out of 228) — exactly when the re-ranker is uncertain.95.2% of queries complete in 2ms with no database round-trip.
See It In Action
Replay cached BM25 → GREM inference traces with verified reasoning chains.
Live From MongoDB Atlas
Pipeline 01
Training
Walk through BM25 mining, the 3-agent reasoning layer, verification gate, and how high-confidence traces become Atlas episodic memory.
Explore training →
Pipeline 02
Inference
Run a query through hybrid retrieval, Atlas memory fetch, and the distilled cross-encoder reranker. Watch metrics update live.
Try inference →