Every RAG system I’ve seen – including the one I’ve written a handbook about on this site – has the same basic problem.
It does not learn.
You consume 500 documents. You ask a question. The system retrieves the three most similar fragments and passes them to LLM. Repeat for next query.
The system knows exactly as much as it did the day before. This is a library that never catalogs cards, never cross-references its shelves, never notices that its three books are saying contradictory things.
This is what I set out to fix with the knowledge reflection layer. After each ingest, the system searches for semantically relevant documents already in the index and asks LLM to synthesize what is new, how it connects, and whether gaps remain. That combination extends to embedding, storage, and search results.
As you add more documents the knowledge base gets better—not just bigger.
This tutorial shows you how to make it.
Table of Contents
What will you make?
In this tutorial, you will create a post-ingest reflection pipeline that:
Fires automatically after each document entry.
Finds the most lexically related documents already in the index.
Kimi asks K2.5 to synthesize a three-sentence insight linking the new document to existing knowledge.
Stores reflecting with it
doc_type=reflectionAnd a 1.5× increase in rankings in search resultsCombines phenomena into summaries every three entries.
Finally, searching your knowledge base will reveal both raw document fragments and reflection samples that the system has written to Ingest.
Conditions
You will need:
A Cloudflare account – works with the free tier.
Node.js v18+ and Wrangler CLI installed (
npm install -g wrangler)Basic TypeScript familiarity
There are no external API keys. Everything runs on Cloudflare’s infrastructure.
How to Configure a Base System
If you’ve already built a RAG system from my FreecodeCamp Handbook, skip this section — your system is ready for the reflection layer.
If you’re just starting out, this section gets you to a working base in about 15 minutes.
Scaffold the project.
npm create cloudflare@latest rag-reflection-system
cd rag-reflection-system
Select: Hello World Example → TypeScript → No deployment yet.
Create vectorize index and D1 database
npx wrangler vectorize create rag-index --dimensions=384 --metric=cosine
npx wrangler d1 create rag-db
Configure wrangler.toml.
name = "rag-reflection-system"
main = "src/index.ts"
compatibility_date = "2026-01-01"
((vectorize))
binding = "VECTORIZE"
index_name = "rag-index"
((d1_databases))
binding = "DB"
database_name = "rag-db"
database_id = "YOUR_DB_ID"
(ai)
binding = "AI"
make documents The table
-- migrations/001_init.sql
CREATE TABLE IF NOT EXISTS documents (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
source TEXT,
date_created TEXT DEFAULT (datetime('now'))
);
npx wrangler d1 execute rag-db --remote --file=./migrations/001_init.sql
add ingest And search End points
Change it. src/index.ts With this minimal working system:
export interface Env
...r,
score: r.doc_type === 'reflection'
export default {
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise {
const url = new URL(request.url);
if (url.pathname === '/ingest' && request.method === 'POST') {
const = await request.json() as any;
const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5',
...r,
score: r.doc_type === 'reflection' ) as any;
const vector = embResult.data(0);
await env.VECTORIZE.upsert(({
id,
values: vector,
metadata:
...r,
score: r.doc_type === 'reflection' ,
}));
await env.DB.prepare(
'INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)'
).bind(id, content, source ?? '').run();
return Response.json({ success: true, id });
}
if (url.pathname === '/search' && request.method === 'POST') {
const { query } = await request.json() as any;
const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5', {
text: (query),
}) as any;
const vector = embResult.data(0);
const results = await env.VECTORIZE.query(vector, {
topK: 5,
returnMetadata: 'all',
});
const context = results.matches
.map(m => m.metadata?.content as string)
.filter(Boolean)
.join('\n\n');
const answer = await env.AI.run('@cf/moonshotai/kimi-k2.5', {
messages: (
{ role: 'system', content: 'Answer using only the context provided.' },
{ role: 'user', content: `Context:\n\({context}\n\nQuestion: \){query}` },
),
max_tokens: 256,
}) as any;
return Response.json({ answer: answer.response, sources: results.matches.map(m => m.id) });
}
return new Response('RAG system running', { status: 200 });
},
};
Deploy and verify.
npx wrangler deploy
Try this:
# Ingest a document
curl -X POST \
-H "Content-Type: application/json" \
-d '{"id": "doc-001", "content": "Cursor pagination beats offset pagination for live-updating datasets because offset becomes unreliable when rows are inserted or deleted during pagination."}'
# Search
curl -X POST \
-H "Content-Type: application/json" \
-d '{"query": "what pagination approach should I use?"}'
If you get a ground response back, the base system is working. The next sections add a reflective layer on top of this foundation.
Why does standard RAG have a memory problem?
Standard RAG recovery is stateless. Every question goes cold. The system has no memory of what it has found before, no synthesis of what it has learned across documents, and no incremental understanding of which questions remain unanswered.
Imagine you have consumed 200 documents about your product. Twelve of them touch on pricing decisions made in the previous year. No one has the complete picture — it’s broken down into quarterly reports, meeting notes, internal Slack exports, a few concept pages.
A customer asks: “Why did we change our pricing structure?”
Standard RAG retrieves the three most similar segments. If these three parts collectively have the answer, great. If they don’t – if the real answer requires synthesis across those twelve documents – the system has no mechanism for that. It returns fragments. LLM takes its best guess.
The reflective layer addresses this directly. When the twelfth value document is entered, the system finds eleven related documents, combines them, synthesizes them, and stores the synthesis as a retrievable sample. The answer to “why we changed our pricing structure” is in the index before anyone even asks.
Not smart retrieval – smart indexing.
Step 1: Schema Update
The reflection layer requires two new fields in your D1 documents table. Run this transition:
-- migrations/003_add_reflection_fields.sql
ALTER TABLE documents ADD COLUMN doc_type TEXT DEFAULT 'raw';
ALTER TABLE documents ADD COLUMN reflection_score REAL DEFAULT 0;
ALTER TABLE documents ADD COLUMN parent_reflection_id TEXT;
Apply this:
wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql
doc_type Separates raw documents (raw), single document reflection (reflection), and comprehensive multi-reflection summaries (summary). You’ll use this field to filter—displaying reflections only for users who want the distilled view, or excluding them for users who want raw source fragments.
Step 2: Reflection Engine
make src/engines/reflection.ts. This is the main part of the layer.
import { Env } from '../types/env';
import { resolveEmbeddingModel, resolveReflectionModel } from '../config/models';
const REFLECTION_BOOST = 1.5;
const CONSOLIDATION_THRESHOLD = 3; // consolidate every N new reflections
export async function reflect(
newDocId: string,
newDocContent: string,
env: Env
): Promise {
// 1. Find semantically related documents already in the index
const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL);
const embResult = await env.AI.run(embModel.id as any, {
text: (newDocContent.slice(0, 512)),
});
const queryVector = (embResult as any).data?.(0);
if (!queryVector) return;
const related = await env.VECTORIZE.query(queryVector, {
topK: 5,
filter: { doc_type: { $eq: 'raw' } },
returnMetadata: 'all',
});
const relatedDocs = (related.matches ?? ()).filter(
m => m.id !== newDocId && (m.score ?? 0) > 0.65
);
if (relatedDocs.length === 0) return; // nothing related yet — skip
// 2. Build synthesis prompt
const relatedSummaries = relatedDocs
.slice(0, 3)
.map((m, i) => `Document \({i + 1}: \){String(m.metadata?.content ?? '').slice(0, 300)}`)
.join('\n\n');
const prompt = `You are synthesising knowledge across documents in a knowledge base.
New document:
${newDocContent.slice(0, 600)}
Related existing documents:
${relatedSummaries}
Write exactly three sentences:
1. What the new document adds that the existing documents don't already cover
2. How the new document connects to or extends the existing documents
3. What gap or question remains unanswered across all these documents
Be specific. Reference actual content. Do not summarise — synthesise.`;
// 3. Call the reflection model
const reflModel = resolveReflectionModel(env.REFLECTION_MODEL);
const llmResp = await env.AI.run(reflModel.id as any, {
messages: ({ role: 'user', content: prompt }),
max_tokens: 180,
});
const reflectionText = (llmResp as any)?.response?.trim();
if (!reflectionText || reflectionText.length < 40) return;
// 4. Embed and store the reflection
const reflEmbResult = await env.AI.run(embModel.id as any, {
text: (reflectionText),
});
const reflVector = (reflEmbResult as any).data?.(0);
if (!reflVector) return;
const reflectionId = `refl_\({newDocId}_\){Date.now()}`;
await env.VECTORIZE.upsert((
{
id: reflectionId,
values: reflVector,
metadata: {
content: reflectionText,
doc_type: 'reflection',
parent_id: newDocId,
reflection_score: REFLECTION_BOOST,
source_doc_ids: relatedDocs.map(m => m.id).join(','),
date_created: new Date().toISOString(),
},
},
));
await env.DB.prepare(
`INSERT INTO documents
(id, content, doc_type, reflection_score, parent_id, date_created)
VALUES (?, ?, 'reflection', ?, ?, ?)`
)
.bind(reflectionId, reflectionText, REFLECTION_BOOST, newDocId, new Date().toISOString())
.run();
// 5. Check if consolidation is due
const recentCount = await env.DB
.prepare(`SELECT COUNT(*) as cnt FROM documents WHERE doc_type="reflection" AND date_created > datetime('now', '-1 hour')`)
.first<{ cnt: number }>();
if ((recentCount?.cnt ?? 0) >= CONSOLIDATION_THRESHOLD) {
await consolidate(env);
}
}
Two things are worth noting here.
First, the semantic threshold (score > 0.65) matters. Too little and you’re synthesizing irrelevant documents. Too much and you’re rarely getting connections. Works well with 0.65. bge-small. You can bump it with 0.72. qwen3-0.6b (1024d) where the score clusters are high.
The immediate structure is intentional. Three sentences, each doing a specific thing: what’s new, how it connects, what’s left. This keeps reflections useful for retrieval. The freeform synthesis prompt produces beautiful prose that doesn’t retrieve well. This structure creates retrievable artifacts.
Step 3: Consolidation
As phenomena accumulate, they need their own synthesis layer – otherwise you’re adding noise at a higher abstraction level.
Add to it. src/engines/reflection.ts:
export async function consolidate(env: Env): Promise {
// Fetch recent reflections not yet consolidated
const recent = await env.DB
.prepare(
`SELECT id, content FROM documents
WHERE doc_type="reflection"
AND id NOT IN (
SELECT DISTINCT parent_id FROM documents
WHERE doc_type="summary" AND parent_id IS NOT NULL
)
ORDER BY date_created DESC
LIMIT 6`
)
.all<{ id: string; content: string }>();
if (!recent.results || recent.results.length < CONSOLIDATION_THRESHOLD) return;
const reflectionTexts = recent.results.map((r, i) => `Reflection \({i + 1}: \){r.content}`).join('\n\n');
const prompt = `You are consolidating multiple knowledge reflections into a single compressed insight.
${reflectionTexts}
Write two to three sentences that capture the most important cross-cutting pattern or tension across these reflections. What does the knowledge base now understand that it didn't before these documents were added? What's the most important open question?
Be precise. No preamble.`;
const reflModel = resolveReflectionModel(env.REFLECTION_MODEL);
const llmResp = await env.AI.run(reflModel.id as any, {
messages: ({ role: 'user', content: prompt }),
max_tokens: 320,
});
const summaryText = (llmResp as any)?.response?.trim();
if (!summaryText || summaryText.length < 40) return;
const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL);
const embResult = await env.AI.run(embModel.id as any, { text: (summaryText) });
const summaryVector = (embResult as any).data?.(0);
if (!summaryVector) return;
const summaryId = `summary_${Date.now()}`;
await env.VECTORIZE.upsert((
{
id: summaryId,
values: summaryVector,
metadata: {
content: summaryText,
doc_type: 'summary',
reflection_score: REFLECTION_BOOST * 1.2,
source_reflection_ids: recent.results.map(r => r.id).join(','),
date_created: new Date().toISOString(),
},
},
));
await env.DB.prepare(
`INSERT INTO documents (id, content, doc_type, reflection_score, date_created)
VALUES (?, ?, 'summary', ?, ?)`
)
.bind(summaryId, summaryText, REFLECTION_BOOST * 1.2, new Date().toISOString())
.run();
}
Abstracts get a 1.2× multiplier on top of the base reflection boost. In the search results, a summary of the twelve relevant documents should be above a section of a single document on broad conceptual questions. On specific factual questions, raw parts will score higher. The ranking is self-organizing.
Step 4: Wire it into your ingest handler.
Reflection runs as a background task. It does not block the ingest response – that would add 2–3 seconds to each ingest call.
in you src/handlers/ingest.tsafter storing the document:
import { reflect } from '../engines/reflection';
// ... existing ingest logic ...
// After VECTORIZE.upsert() and DB insert succeed:
ctx.waitUntil(
reflect(documentId, content, env).catch(err => {
console.warn('(reflection) failed for', documentId, err.message);
})
);
return new Response(JSON.stringify({
success: true,
documentId,
chunks: chunkCount,
// ... rest of response
}), { headers: { 'Content-Type': 'application/json' } });
ctx.waitUntil() Cloudflare Workers is primitive for background work. The response is returned immediately. Reflection follows. The Ingest API remains fast.
gave .catch() is important. A failed reflection should never fail to digest. Raw documents are the source of truth. Attributes are derived values ​​- useful, but not path-critical.
Step 5: Promote reflection in search.
Promote reflection in your hierarchical logic. src/engines/hybrid.ts. After RRF fusion and before results come back:
// Apply reflection boost
const boosted = results.map(r => ({
...r,
score: r.doc_type === 'reflection' || r.doc_type === 'summary'
? r.score * (r.reflection_score ?? 1.5)
: r.score,
}));
return boosted.sort((a, b) => b.score - a.score);
This is a post-fusion boost, not a pre-fusion rank. Rationale: Apply RRF to all results first, to get a handle on raw relevance before developing reflection. A reflection that would not rank in the top 20 on raw similarity should not appear simply because it has a boost multiplier.
Step 6: Filtering doc_type
Your search endpoint should accept a doc_type Filter so callers can control what they see:
// In your search request handler:
const docTypeFilter = body.filters?.doc_type;
// Pass to Vectorize query:
const vectorFilter: Record = {};
if (docTypeFilter) {
vectorFilter.doc_type = docTypeFilter;
}
It gives callers three modes:
# Only reflections and summaries
POST /search
{ "query": "pricing decisions", "filters": { "doc_type": { "$in": ("reflection", "summary") } } }
# Only source documents
POST /search
{ "query": "pricing decisions", "filters": { "doc_type": { "$eq": "raw" } } }
# Default: all types, reflections boosted
POST /search
{ "query": "pricing decisions" }
The default (no filter) is the most useful. Let the promotion do its job. When you need references, limit to raw. When you want a composed look, stick to reflections.
What changes occur after its construction?
At 200 documents, the difference becomes noticeable. Queries that previously returned five fragmented fragments now show a reflection that has already synthesized those fragments. Broad conceptual questions – “What do we know about X?” – Start returning genuinely useful summaries instead of just the most similar individual paragraphs.
At 2,000 documents, the reflection layer is the most valuable part of the system. Rough sections answer specific factual questions. Reflections and summaries answer conceptual questions that cannot be answered by a single document. The system has learned something that does not contain an individual document.
One failure mode is worth knowing: if your embedding model has poor semantic clustering — old bge-small At 384d with mixed-domain documents – the associated document retrieval step will show weaker connections and produce fewer reflections. The 0.65 threshold filters out most of this, but if you’re looking at reflections that seem off-topic, your embeddings are the first thing to check.
Deployment
wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql
wrangler deploy
Then insert some documents and see what happens:
# Ingest document 1
curl -X POST \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"id": "doc-001", "content": "Your document text here..."}'
# After a few seconds, check if a reflection was created
curl "" \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"query": "your topic", "filters": {"doc_type": {"$eq": "reflection"}}}'
The phenomena will not appear unless there are relevant documents to synthesize. Swallow at least three documents on similar topics before you expect to see them.
What to do next?
The reflective layer as described here fires after each meal. This is expensive at high import volumes: if you’re importing 10,000 documents from a batch, you don’t want 10,000 individual reflect calls.
For bulk ingestion, gate it: call reflect() Only if a document match search returns a match above 0.8, or a batch run reflection after the bulk import is complete. gave POST /ingest/batch End point in Complete repo it does.
Another thing that’s worth building on: surfacing reflections in your UI with visual distinction. A search result that is a reflection should look different than the raw part. In the dashboard included in the repo, the reflection renders with a 💡 Badge and “Synthesized from N documents” note.
At the full source github.com/dannwaneri/vectorize-mcp-worker – Reflection Engine, Consolidation, Batch Ingest, Dashboard, Open API Spec.
The code base is TypeScript, with a single deploy. wrangler deployRuns about $1–5/month at 10,000 queries/day.
Retrieves standard RAG. It learns.