Skip to main content

Overview

A memory is a concise, LLM-generated reflection distilled from a past agent run. It captures what the agent did, what worked, what went wrong, and what to do differently next time - then stores that knowledge so future runs can benefit from it. Memories exist because LLMs are stateless. Each call starts from scratch with no awareness of what happened last time. Reflect solves this by maintaining a project-level memory bank that accumulates experience across runs, users, and sessions. Before each task, your agent queries this bank and receives the most relevant past reflections ranked by both semantic similarity (is this about the same kind of problem?) and utility (did this advice actually lead to good outcomes?). This dual ranking is the core design choice behind Reflect’s memory system. Pure semantic search returns relevant results, but it can’t distinguish between a reflection that led to a correct answer and one that didn’t. utility scores add a learned quality signal that improves over time as more traces are reviewed.

How memories are created

Memories are never written directly. They are always generated from a reviewed trace:
  1. Your agent completes a task and you submit the trace with a review ("pass" or "fail")
  2. An LLM reads the trace (task, trajectory, outcome, feedback) and generates a reflection
  3. The task is embedded and stored in a memory bank along with the reflection with an initial q_value of 0.5
  4. On future runs, the memory is retrieved when the query is semantically similar and the utility score is high enough.
This means you can’t manually insert arbitrary text into memory - every memory has a traceable origin, and its quality is tracked over time.

How utility scores evolve

Utility scores are updated every time a memory is retrieved and the run that used it is reviewed. A memory starts at q_value = 0.5. If it’s retrieved in a run that passes, its utility nudges upward. If retrieved in a run that fails, it nudges downward. Over many reviews, useful memories converge toward 1.0 and unhelpful ones toward 0.0.
utility scores only update for memories that were retrieved and used in a run. If a memory exists but wasn’t retrieved for a particular trace, its utility is unaffected by that trace’s review.

Retrieving memories

query_memories - raw retrieval

Returns a ranked list of Memory objects without modifying the task text. Use this when you want full control over how memories are injected into your prompt.
memories = client.query_memories(
    task="How do I handle rate limits in an API client?",
    limit=10,
    lambda_=0.5,
)

for m in memories:
    print(f"[q={m.q_value:.2f}] {m.task}")
    print(f"  {m.reflection[:100]}...")

augment_with_memories - retrieval + formatting

Queries memories and appends them to the task as a structured text block. This is the most common method - it returns a ready-to-use prompt that you pass directly to the LLM.
augmented = client.augment_with_memories(
    task="Implement exponential backoff for retries",
    limit=5,
    lambda_=0.6,
)

# Pass this to your LLM - it includes the task + relevant memories
prompt_for_llm = augmented.augmented_task

# The Memory objects are also available if you need their IDs later
retrieved = augmented.memories
The formatted output groups memories into three sections based on their review outcomes:
Implement exponential backoff for retries

Relevant memories:

Successful memories:

--- Memory 1 ---
Past task:
Handle transient API failures

Reflection:
Use a base delay with exponential increase and random jitter...

Failed memories:

--- Memory 2 ---
Past task:
Retry failed HTTP requests

Reflection:
Fixed delays without jitter caused thundering herd issues...
If no memories are found, augmented_task returns the original task unchanged - so you can always use it safely without checking.

Parameters

ParameterTypeDefaultDescription
taskstrrequiredThe task text to search against. The API embeds this and finds semantically similar memories.
limitint10Maximum number of memories to return.
lambda_float0.5Blend weight between similarity and utility (see below).

The Memory object

FieldTypeDescription
idstrUnique identifier - pass these as retrieved_memory_ids when creating traces
taskstrThe past task this memory was generated from
reflectionstrLLM-generated reflection text
q_valuefloatLearned quality score (0-1, higher = better track record)
similarityfloatCosine similarity to the query task
scorefloatFinal ranking score: (1 - lambda_) * similarity + lambda_ * q_value
successbool | NoneWhether the source trace passed review (None if unreviewed)

Tuning retrieval with lambda_

The lambda_ parameter controls the balance between semantic relevance and learned quality when ranking memories:
score = (1 - lambda_) * similarity + lambda_ * q_value
ValueBehaviorWhen to use
0.0Pure semantic similarityEarly in a project when you have few reviewed traces and utility scores haven’t differentiated yet
0.5Equal weight (default)General-purpose starting point - works well for most projects
0.7–0.9Favor utilityMature projects with many reviewed traces - surface memories with the best track records
1.0Pure utilityOnly retrieve the historically most successful memories, regardless of semantic match

When to adjust

  • Increase lambda_ if your agent keeps retrieving relevant-sounding memories that lead to bad outcomes. The memories are topically similar but not actually helpful - utility scores will down-rank them.
  • Decrease lambda_ if your agent needs broader context from different past tasks. Strict utility ranking can narrow retrieval too much, especially when the most successful memories are about a different subtopic.
  • Keep 0.5 if you’re unsure. The default works well until you have enough reviewed traces to notice a pattern.

Best practices

This is what connects the learning loop. When you create a trace, pass the IDs of the memories that were retrieved for that run. Without them, Reflect can’t update utility scores when the trace is reviewed - the memory ranking won’t improve.The context manager (client.trace()) and decorator (@reflect_trace) handle this automatically. If you use create_trace directly, you must pass them yourself.
trace = client.create_trace(
    task="...",
    trajectory=[...],
    retrieved_memory_ids=[m.id for m in augmented.memories],  # don't forget this
    review_result="pass",
)
The task string is what Reflect embeds and matches against when retrieving memories. Vague tasks like "do the thing" will match poorly. Descriptive tasks like "Parse the uploaded CSV, validate column types, and return the first 5 rows" will retrieve more relevant memories.The task is also included in the reflection prompt - a clear task helps the LLM generate better reflections.
More memories means more context in the prompt, which costs tokens and can dilute the signal. Start with limit=3 to limit=5 and increase if the agent seems to be missing relevant context.
Memories start with q_value=0.5 and only differentiate through reviews. An unreviewed project has flat utility scores - every memory is ranked equally. Reviews are what make the system learn. Even a few dozen reviews can significantly improve retrieval quality.
augment_with_memories already groups memories into “Successful”, “Failed”, and “Other” sections. The LLM sees the distinction naturally. You don’t need to filter out failed memories - they contain valuable “what not to do” context.