Benchmark

The benchmark runner evaluates Reflect in three phases:

Pre-training test without memory
Training with judged reviews that create memories
Post-training test with memory enabled

This lets you measure whether the training phase improves accuracy on held-out questions. Benchmarks does well, both with Agentic behaviour like, LongBench and multi-hop Question answering like HotPotQA.

Coming Soon

EvaluationExamples for recording evaluation outcomes with an LLM judge or a human review step.

⌘I

Coming Soon

Getting started

Guides

Examples

Evaluation

Benchmark

Coming Soon

Getting started

Guides

Examples

Evaluation

Benchmark

​Coming Soon

Coming Soon