GREM Architecture

This page explains the production-ready pipeline used by the demo: BM25 retrieval, multi-agent reasoning, verification gating, and MongoDB Atlas episodic memory.

1. Retrieval + Failure Detection

BM25 retrieves candidate passages. If the gold answer is not surfaced early, the system marks the query as a failure mode and triggers the GREM reasoning workflow.

  • BM25 top-k baseline
  • Failure modes: chain_break, entity_drift, distractor_confusion
  • Cached initial ranking state stored in demo_traces

2. Multi-Agent Reasoning

GREM runs multiple reasoning agents to produce a verified reasoning chain that re-ranks retrieval results with a distilled cross-encoder.

  • Agents inspect BM25 candidate evidence
  • Verified chain is built as an aggregate decision
  • Gold elevation is measured by reranker rank improvement

3. Atlas Episodic Memory

High-confidence verification traces are persisted to MongoDB Atlas episodic_memory. The live feed on the home page renders those verified chains.

  • Collection: episodic_memory
  • Fields: query, failure_mode, q_final, aggregator_chain, first_gold_rank
  • Filtered for quality with q_final >= 0.7

4. Metrics + Monitoring

Final evaluation metrics are stored in final_metrics and reused across homepage cards and the Results section.

  • Single document collection: final_metrics
  • Metrics loaded once via a shared hook
  • Performance displayed in both hero cards and results dashboard

Production Notes

Deploying to Vercel requires a working MONGO_URI environment variable. Atlas must allow the deployment IP range or 0.0.0.0/0 during development. The frontend consumes cache collections directly from Atlas for the live demo.