Evaluation

Reflect does not require a single evaluation pipeline. You can score runs automatically with an LLM judge, or record the trace first and submit a human review later. In both cases, the SDK ultimately stores the same review signal:

review_result="pass" or review_result="fail"
Optional feedback_text explaining why the run passed or failed

Option 1: LLM as judge

Use this pattern when you already have a rubric the model can apply consistently, such as exact-answer checks, formatting checks, or lightweight factual grading.

Flow

Retrieve memories and run your agent.
Ask a second model to judge the result.
Save the trace with the judge reflection inline.

import json

from openai import OpenAI
from reflect_sdk import ReflectClient

reflect = ReflectClient(
    base_url="http://localhost:8000",
    api_key="rf_live_...",
    project_id="my-project",
)

judge = OpenAI()
task = "Answer the support question using the refund policy."

augmented = reflect.augment_with_memories(task=task, limit=3)

trajectory = [
    {"role": "user", "content": augmented.augmented_task},
]

# Replace this with your own generation step.
final_response = "The customer is eligible for a refund within 30 days of purchase."
trajectory.append({"role": "assistant", "content": final_response})

judge_prompt = f"""
You are grading an assistant response.

Task:
{task}

Candidate response:
{final_response}

Return JSON with this schema:
{{"passed": true, "feedback": "short reason"}}
"""

judge_response = judge.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[{"role": "user", "content": judge_prompt}],
)

reflection = json.loads(judge_response.choices[0].message.content)

trace = reflect.create_trace_and_wait(
    task=task,
    trajectory=trajectory,
    final_response=final_response,
    retrieved_memory_ids=[memory.id for memory in augmented.memories],
    model="gpt-5.4-mini",
    metadata={"source": "llm-judge"},
    review_result="pass" if reflection["passed"] else "fail",
    feedback_text=reflection["feedback"],
)

print(trace.review.result)
print(trace.created_memory_id)

This works well when you want evaluation to happen in the same job that generated the answer.

Option 2: Human review

Use this pattern when correctness depends on domain expertise, the task is subjective, or you want reviewers to inspect the answer before creating a memory.

Flow

Store the trace immediately after the run finishes.
Show the result in your own review queue or internal tool.
Submit the human reflection later with review_trace(...).

from reflect_sdk import ReflectClient

client = ReflectClient(
    base_url="http://localhost:8000",
    api_key="rf_live_...",
    project_id="my-project",
)

task = "Write release notes for the latest API changes."
trajectory = [
    {"role": "user", "content": task},
    {"role": "assistant", "content": "Here are draft release notes..."},
]
final_response = "Here are draft release notes..."

submission = client.create_trace(
    task=task,
    trajectory=trajectory,
    final_response=final_response,
    retrieved_memory_ids=[],
    model="gpt-5.4-mini",
    metadata={"source": "human-review-queue"},
)

print(submission.id)

# Later, after a reviewer checks the result:
trace = client.review_trace(
    trace_id=submission.id,
    result="fail",
    feedback_text="Missed the breaking authentication change and included an unsupported endpoint.",
)

print(trace.review.result)
print(trace.created_memory_id)

This is the best fit when a reviewer needs to approve or reject work after the original run has completed.

Choosing a workflow

Workflow	Best for
LLM judge	Fast automated eval loops, regression checks, rubric-based scoring
Human review	Subjective tasks, high-stakes outputs, expert verification

You can also mix both patterns: run an LLM judge first, then send borderline or failed cases to a human reviewer before submitting the final review outcome to Reflect.

Getting started

Guides

Examples

Benchmark

Evaluation

Option 1: LLM as judge

Flow

Option 2: Human review

Flow

Choosing a workflow

Getting started

Guides

Examples

Evaluation

Benchmark

​Option 1: LLM as judge

​Flow

​Option 2: Human review

​Flow

​Choosing a workflow

Option 1: LLM as judge

Flow

Option 2: Human review

Flow

Choosing a workflow