Skip to main content
Reflect does not require a single evaluation pipeline. You can score runs automatically with an LLM judge, or record the trace first and submit a human review later. In both cases, the SDK ultimately stores the same review signal:
  • review_result="pass" or review_result="fail"
  • Optional feedback_text explaining why the run passed or failed

Option 1: LLM as judge

Use this pattern when you already have a rubric the model can apply consistently, such as exact-answer checks, formatting checks, or lightweight factual grading.

Flow

  1. Retrieve memories and run your agent.
  2. Ask a second model to judge the result.
  3. Save the trace with the judge reflection inline.
import json

from openai import OpenAI
from reflect_sdk import ReflectClient

reflect = ReflectClient(
    base_url="http://localhost:8000",
    api_key="rf_live_...",
    project_id="my-project",
)

judge = OpenAI()
task = "Answer the support question using the refund policy."

augmented = reflect.augment_with_memories(task=task, limit=3)

trajectory = [
    {"role": "user", "content": augmented.augmented_task},
]

# Replace this with your own generation step.
final_response = "The customer is eligible for a refund within 30 days of purchase."
trajectory.append({"role": "assistant", "content": final_response})

judge_prompt = f"""
You are grading an assistant response.

Task:
{task}

Candidate response:
{final_response}

Return JSON with this schema:
{{"passed": true, "feedback": "short reason"}}
"""

judge_response = judge.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[{"role": "user", "content": judge_prompt}],
)

reflection = json.loads(judge_response.choices[0].message.content)

trace = reflect.create_trace_and_wait(
    task=task,
    trajectory=trajectory,
    final_response=final_response,
    retrieved_memory_ids=[memory.id for memory in augmented.memories],
    model="gpt-5.4-mini",
    metadata={"source": "llm-judge"},
    review_result="pass" if reflection["passed"] else "fail",
    feedback_text=reflection["feedback"],
)

print(trace.review.result)
print(trace.created_memory_id)
This works well when you want evaluation to happen in the same job that generated the answer.

Option 2: Human review

Use this pattern when correctness depends on domain expertise, the task is subjective, or you want reviewers to inspect the answer before creating a memory.

Flow

  1. Store the trace immediately after the run finishes.
  2. Show the result in your own review queue or internal tool.
  3. Submit the human reflection later with review_trace(...).
from reflect_sdk import ReflectClient

client = ReflectClient(
    base_url="http://localhost:8000",
    api_key="rf_live_...",
    project_id="my-project",
)

task = "Write release notes for the latest API changes."
trajectory = [
    {"role": "user", "content": task},
    {"role": "assistant", "content": "Here are draft release notes..."},
]
final_response = "Here are draft release notes..."

submission = client.create_trace(
    task=task,
    trajectory=trajectory,
    final_response=final_response,
    retrieved_memory_ids=[],
    model="gpt-5.4-mini",
    metadata={"source": "human-review-queue"},
)

print(submission.id)

# Later, after a reviewer checks the result:
trace = client.review_trace(
    trace_id=submission.id,
    result="fail",
    feedback_text="Missed the breaking authentication change and included an unsupported endpoint.",
)

print(trace.review.result)
print(trace.created_memory_id)
This is the best fit when a reviewer needs to approve or reject work after the original run has completed.

Choosing a workflow

WorkflowBest for
LLM judgeFast automated eval loops, regression checks, rubric-based scoring
Human reviewSubjective tasks, high-stakes outputs, expert verification
You can also mix both patterns: run an LLM judge first, then send borderline or failed cases to a human reviewer before submitting the final review outcome to Reflect.