GREM Architecture
This page explains the production-ready pipeline used by the demo: BM25 retrieval, multi-agent reasoning, verification gating, and MongoDB Atlas episodic memory.
1. Retrieval + Failure Detection
BM25 retrieves candidate passages. If the gold answer is not surfaced early, the system marks the query as a failure mode and triggers the GREM reasoning workflow.
- BM25 top-k baseline
- Failure modes: chain_break, entity_drift, distractor_confusion
- Cached initial ranking state stored in
demo_traces
2. Multi-Agent Reasoning
GREM runs multiple reasoning agents to produce a verified reasoning chain that re-ranks retrieval results with a distilled cross-encoder.
- Agents inspect BM25 candidate evidence
- Verified chain is built as an aggregate decision
- Gold elevation is measured by reranker rank improvement
3. Atlas Episodic Memory
High-confidence verification traces are persisted to MongoDB Atlas episodic_memory. The live feed on the home page renders those verified chains.
- Collection:
episodic_memory - Fields: query, failure_mode, q_final, aggregator_chain, first_gold_rank
- Filtered for quality with
q_final >= 0.7
4. Metrics + Monitoring
Final evaluation metrics are stored in final_metrics and reused across homepage cards and the Results section.
- Single document collection:
final_metrics - Metrics loaded once via a shared hook
- Performance displayed in both hero cards and results dashboard
Production Notes
Deploying to Vercel requires a working MONGO_URI environment variable. Atlas must allow the deployment IP range or 0.0.0.0/0 during development. The frontend consumes cache collections directly from Atlas for the live demo.