Skip to main content
The benchmark runner evaluates Reflect in three phases:
  1. Pre-training test without memory
  2. Training with judged reviews that create memories
  3. Post-training test with memory enabled
This lets you measure whether the training phase improves accuracy on held-out questions. Benchmarks does well, both with Agentic behaviour like, LongBench and multi-hop Question answering like HotPotQA.

Coming Soon