Run the HotPotQA benchmark to compare Reflect before and after learning.
The benchmark runner evaluates Reflect in three phases:
Pre-training test without memory
Training with judged reviews that create memories
Post-training test with memory enabled
This lets you measure whether the training phase improves accuracy on held-out questions.Benchmarks does well, both with Agentic behaviour like, LongBench and multi-hop Question answering like HotPotQA.