Google CloudMongoDB AtlasHotpotQA

Quality-Gated Multi-Hop Retrieval with Episodic Memory

GREM recovers BM25 hard failures using multi-agent reasoning, verification-aware aggregation, and distilled cross-encoder reranking — backed by MongoDB Atlas vector memory.

Performance Results

HotpotQA Bridge Failures (n=228)

Key Metrics

80.26%

Hits@1

92.54%

Hits@2

70.61%

Recall@2

0.8851

MRR

0.7280

nDCG@2

0.8475

nDCG@5

100.0%

Ground Rate

0.0%

Lucky Rate

4.8%

Adaptive Atlas

Failure Mode Recovery

Chain break

84.2%

(16/19)

Distractor confusion

81.0%

(64/79)

Entity drift

79.2%

(103/130)

System Comparison

Metric	BM25 Baseline	LLM Re-ranking	GREM (Distilled)	GREM (Adaptive Atlas)
Hits@1	0.000	~0.85	0.8026	0.8026
Hits@2	—	~0.93	0.9254	0.9254
Recall@2	—	~0.72	0.7061	0.7061
MRR	—	~0.88	0.8851	0.8851
Latency	5 ms	~2 s	2 ms	2-50 ms
Cost per query	$0	$0.003	$0.000003	$0.000003
API calls	0	1 LLM	0	0-1 vector

MongoDB Atlas Vector Search invoked on 4.8% of queries (11 out of 228) — exactly when the re-ranker is uncertain.95.2% of queries complete in 2ms with no database round-trip.

See It In Action

Replay cached BM25 → GREM inference traces with verified reasoning chains.

Live From MongoDB Atlas

Pipeline 01

Training

Walk through BM25 mining, the 3-agent reasoning layer, verification gate, and how high-confidence traces become Atlas episodic memory.

Explore training →

Pipeline 02

Inference

Run a query through hybrid retrieval, Atlas memory fetch, and the distilled cross-encoder reranker. Watch metrics update live.

Try inference →