GREM Architecture

This page explains the production-ready pipeline used by the demo: BM25 retrieval, multi-agent reasoning, verification gating, and MongoDB Atlas episodic memory.

1. Retrieval + Failure Detection

BM25 retrieves candidate passages. If the gold answer is not surfaced early, the system marks the query as a failure mode and triggers the GREM reasoning workflow.

BM25 top-k baseline
Failure modes: chain_break, entity_drift, distractor_confusion
Cached initial ranking state stored in demo_traces

2. Multi-Agent Reasoning

GREM runs multiple reasoning agents to produce a verified reasoning chain that re-ranks retrieval results with a distilled cross-encoder.

Agents inspect BM25 candidate evidence
Verified chain is built as an aggregate decision
Gold elevation is measured by reranker rank improvement

3. Atlas Episodic Memory

High-confidence verification traces are persisted to MongoDB Atlas episodic_memory. The live feed on the home page renders those verified chains.

Collection: episodic_memory
Fields: query, failure_mode, q_final, aggregator_chain, first_gold_rank
Filtered for quality with q_final >= 0.7

4. Metrics + Monitoring

Final evaluation metrics are stored in final_metrics and reused across homepage cards and the Results section.

Single document collection: final_metrics
Metrics loaded once via a shared hook
Performance displayed in both hero cards and results dashboard

Production Notes

Deploying to Vercel requires a working MONGO_URI environment variable. Atlas must allow the deployment IP range or 0.0.0.0/0 during development. The frontend consumes cache collections directly from Atlas for the live demo.