Concepts
Concepts
How ToolUp.RAG works under the hood.
Pipeline overview
End-to-end flow for retrieval:
user query
│
▼
IEmbeddingProvider.GenerateEmbedding ──┐
│ │
▼ ▼
IVectorStore.Search IEmbeddingCache (LRU; SHA256 keys)
│
▼
scope-access filter (AccessContext.TeamId)
│
▼
MergeStrategy (DenseOnly | SparseOnly | DenseSparseHybrid | DenseSparseRerank)
│
▼
truncate to SnippetCharLimit
│
▼
RAGPromptBuilder.withRetrieval ──→ injected into AI system prompt
│
▼
IRetrievalTracer.Trace (records the retrieval to IEventStore)
End-to-end flow for ingestion:
document upload ──→ IBlobStorage.Save
│
▼ (post-save hook)
VectorisationHandler.Vectorise
│
▼ IngestionQueue.Enqueue
▼
IngestionBackgroundService dequeues
│
▼ IEmbeddingProvider.GenerateEmbedding
▼ IVectorStore.Index (stamps EmbeddingVersion)
▼ emits KnowledgeChunkIndexed event
End-to-end flow for re-embedding (model swap):
operator swaps IEmbeddingProvider ──→ ReembeddingQueue.Enqueue scope
│
▼
ReembeddingBackgroundService
│
▼ IVectorStore.ListChunks scope
▼ filter where EmbeddingVersion ≠ current
▼ re-embed + Index
▼ emits KnowledgeChunkReembedded event
Vector store
IVectorStore is the storage interface:
type IVectorStore =
abstract Index: VectorScope -> ChunkVector -> Async<unit>
abstract Search: VectorScope list -> queryVec: float32[] -> topK: int -> minScore: float -> Async<VectorMatch list>
abstract DeleteChunk: VectorScope -> chunkId: Guid -> Async<unit>
abstract DeleteByScope: VectorScope -> Async<unit>
abstract Vacuum: VectorScope -> retainTombstones: TimeSpan -> Async<int>
abstract ListChunks: VectorScope -> Async<ChunkVector list>
abstract ListScopes: unit -> Async<VectorScope list>
and ChunkVector = {
Id: Guid
Text: string
Vector: float32[]
Metadata: Map<string, string> // includes _embedProvider / _embedModel / _embedDim
Origin: ChunkOrigin
}
and VectorMatch = {
Chunk: ChunkVector
Score: float
}
and VectorScope =
| Platform // universally readable for authenticated callers (when PlatformKnowledgeBase enabled)
| Deployment // universally readable across teams; rare
| Team of teamId: string
| User of userId: string
and ChunkOrigin = UserContent | Narrative | Note | Synthetic
Default InMemoryVectorStore
- Pre-normalised vectors (cosine similarity = dot product).
- Debounced
IBlobStoragepersistence —Indexwrites to memory immediately, persists after a 5-second debounce. IDisposablelifecycle — final flush on shutdown so no chunks are lost.- Soft-delete via
_deletedAttombstones —DeleteChunkwrites a tombstone; subsequentSearchfilters them out. Vacuumhard-removes tombstones past the retention window.DeleteByScopeis a config-grade reset — bypasses tombstone semantics for "delete everything in scope X" (e.g., crypto-shred companion's scope-key destruction).
Ceiling: ~50,000 chunks before query latency becomes noticeable (1M+ comparisons per search). For larger corpora, swap to ToolUp.VectorStores.Hnsw.
ToolUp.VectorStores.Hnsw
Hierarchical Navigable Small World index. Approximate nearest-neighbour; trades ~5% recall accuracy for orders-of-magnitude speedup. Persists the index to IBlobStorage for warm restart. Suitable for low-millions of chunks per scope.
For multi-million-chunk or horizontal-scale deployments, a distributed vector store (Qdrant, Pinecone, Weaviate, pgvector) is the right target — future companion work. The IVectorStore contract is portable; the IVectorStoreContract test pack ensures any impl is drop-in.
Retrieval pipeline
IRetrievalPipeline is the high-level facade:
type IRetrievalPipeline =
abstract Retrieve: RetrievalRequest -> AccessContext -> Async<VectorMatch list>
abstract Index: VectorScope -> TextChunk -> Async<unit>
and RetrievalRequest = {
Query: string
RequestedScopes: VectorScope list // caller's intent
TopK: int
MinScore: float
MergeStrategy: MergeStrategy
OriginFilter: ChunkOrigin list option
}
and MergeStrategy =
| DenseOnly // cosine similarity only
| SparseOnly // BM25 only (term overlap)
| DenseSparseHybrid // weighted combination
| DenseSparseRerank // hybrid + cross-encoder reranker (when IReranker is registered)
Stages per Retrieve call
- Scope-access validation —
authorisedScopesfiltersRequestedScopesagainstAccessContext.TeamId. A mismatchedTeam teamIdis dropped (not errored).PlatformandDeploymentscopes survive when enabled. - Embedding generation —
IEmbeddingProvider.GenerateEmbeddingproduces the query vector. Goes throughCachingEmbeddingProviderdecorator (LRU, keyed by SHA256 of text — raw query never lands in cache key). - Dense search —
IVectorStore.Searchagainst the authorised scopes, top-K withMinScorefloor. - Sparse search (if
MergeStrategyincludes it) — BM25 against the same scopes, top-K. - Merge — weighted combination per strategy. Hybrid default: 70% dense + 30% sparse.
- Rerank (if
IRerankerregistered + strategy uses it) — cross-encoder rescore on the merged candidate pool, top-K from reranked. - Origin filter — drop chunks whose
Originisn't inOriginFilter. - Snippet truncation — chunk text truncated to
SnippetCharLimitbefore return.
Scope isolation guarantees
The scope-access filter at stage 1 is the only choke-point. There's no API path that bypasses it — IRetrievalPipeline.Retrieve is the single entry. A misuse pattern would be a custom retrieval impl that doesn't filter, but the SDK's default does, and the contract test pack enforces it.
Result: Team A's vector store entries are never readable by Team B's caller, even if Team B's caller knows the document IDs. The chunk-level isolation is structural.
Chunking
Chunking.fs handles the text → chunk transformation:
type ChunkingConfig = {
MaxTokens: int // default 500
OverlapTokens: int // default 50
MinTokens: int // default 100
}
module Chunking =
val splitBySentence: string -> string list
val splitByTokens: ChunkingConfig -> ITokenCounter -> string -> string list
val chunkSpreadsheet: SheetData -> ChunkingConfig -> ITokenCounter -> string list
val formatRow: string list -> string -> string
val withContextualHeader: ITextSummariser option -> string -> string -> string
Token-aware splitting
ITokenCounter abstracts token counting:
type ITokenCounter =
abstract CountTokens: string -> int
Default HeuristicTokenCounter uses ~4 chars/token (good enough for English; less accurate for code or non-Latin scripts).
For accurate counts, a Microsoft.ML.Tokenizers companion can drop in as a one-line wrapper. The contract test pack ensures any counter is drop-in.
Spreadsheet chunking
Sheets need special treatment — preserving headers and row context across chunks. chunkSpreadsheet repeats column headers per chunk and emits a Sheet "X", rows N–M of T header line so the model knows which slice of the sheet it's reading. Maintains 1-based row indices in the chunked output.
Contextual headers (deferred / optional)
ITextSummariser is an optional extension point — given the whole document, produce a one-sentence summary that prepends every chunk. Helps the model when individual chunks lose document-level context ("...as discussed above"). The default impl is a no-op; a real impl would invoke an LLM to summarise, which costs tokens. Wire only when retrieval quality justifies the cost.
Embedding providers
IEmbeddingProvider is the boundary:
type IEmbeddingProvider =
abstract GenerateEmbedding: text: string -> Async<float32[]>
abstract ProviderId: string
abstract ModelId: string
abstract Dimensions: int
EmbeddingVersion is the (ProviderId, ModelId, Dimensions) triple. Stamped onto every chunk's metadata at upsert time so a future model swap is detectable post-hoc.
Caching layer
CachingEmbeddingProvider decorator wraps any provider with an IEmbeddingCache:
type IEmbeddingCache =
abstract TryGet: providerId: string -> modelId: string -> dimensions: int -> textHash: string -> float32[] option
abstract Set: providerId: string -> modelId: string -> dimensions: int -> textHash: string -> float32[] -> unit
abstract HitRate: float
Default InMemoryEmbeddingCache is a coarse-locked LRU (capacity 10000). Cache keys are (ProviderId, ModelId, Dimensions, SHA256(text)) — raw text never lands in keys. Hit rate exposed for observability.
The cache helps a lot when the same query / chunk text recurs (e.g., re-embedding a document that previously embedded). Distributed Redis-backed cache is a possible companion; the IEmbeddingCache contract is portable.
Background services
Two background services run when RAG is enabled:
IngestionBackgroundService
Drains the IngestionQueue, indexes each job via the pipeline. Concurrency-capped via SemaphoreSlim (default 2; withIngestionConcurrency to tune). Emits:
KnowledgeChunkIndexed— successful index.KnowledgeChunkFailed— embedding or storage failure; retried per policy.
Queue capacity: the IngestionQueue is always bounded (default capacity 5000; withIngestionQueueCapacity N to tune). The channel uses FullMode = Wait, so the non-blocking Enqueue returns false when the queue is full — it never silently drops the incoming job. Upload sites surface that rejection: KB's UploadDocument / AddNote / IngestNarrative mark the affected document Failed, IRagTelemetry records the rejection, and /health/rag exposes live queue depth — so a burst ingestion (e.g. a 10K-document bulk upload) applies visible back-pressure instead of OOMing or losing documents silently.
ReembeddingBackgroundService
Drains the ReembeddingQueue (a Channel<VectorScope>). For each enqueued scope:
- List all chunks via
IVectorStore.ListChunks scope. - Filter chunks whose
EmbeddingVersion(from chunk metadata) doesn't match the current provider's. - For each mismatched chunk: re-embed via the current provider, re-index (overwrites the old vector).
Emits KnowledgeChunkReembedded / KnowledgeChunkReembedFailed.
Trigger on operator action (model swap, scope reset, periodic refresh). The default isn't to re-embed on every chunk access — too expensive. The pattern is: operator decides "I'm changing embedding model"; calls reembeddingQueue.Enqueue(Team teamId) for each affected scope; background service drains over the next minutes/hours.
RAG prompt builder
RAGPromptBuilder.withRetrieval is a SystemPromptBuilder that injects retrieved chunks:
let ragBuilder = RAGPromptBuilder.withRetrieval pipeline {
TopK = 5
MinScore = 0.3
GroundingMode = Preferred
ScopeStrategy = ActiveTeamPlusPlatform
}
AIServerApp.create (aiProviderFactory, aiConfigStore)
|> ...
|> AIServerApp.withAIConfig {
AIAssistantServerConfig.defaults with
SystemPrompt = Some ragBuilder
}
|> AIServerApp.run
The builder runs per chat request. It:
- Reads the user's latest message as the query.
- Calls
pipeline.Retrievefor the appropriate scopes. - Truncates each chunk to
SnippetCharLimit. - Formats them into a "Retrieved context" section in the system prompt.
- Cites each chunk with a short identifier (
[Doc abc, Chunk 1 of 5]). - Returns the formatted string.
When RAGCompose.composeWithRAG is in the pipeline, this builder is automatically composed with the default system-prompt builders (platform + active module). Apps that want full control over composition can wire it via withAIConfig manually.
Grounding modes
type GroundingMode =
| Permissive // retrieve, inject if found, assistant can answer with or without
| Preferred // retrieve, inject if found, system prompt nudges the model to prefer retrieved context
| StrictlyGrounded // retrieve, refuse to answer if nothing retrieved
StrictlyGrounded is the strongest signal — useful when hallucinations are unacceptable. The system prompt explicitly tells the model "answer only from retrieved context; refuse otherwise". Permissive is the loosest; useful when retrieval is one of many input sources.
Retrieval miss
When fewer than 2 matches survive the MinScore gate, the prompt builder emits a KnowledgeRetrievalMiss event via IRetrievalTracer.Miss. Operators surface this in the HealthMonitor admin module to detect deployments where the knowledge base is empty or the embedding model is mismatched.
Observability
The IRetrievalTracer interface:
type IRetrievalTracer =
abstract Trace: RetrievalTrace -> AccessContext -> Async<unit>
abstract Miss: scope: VectorScope -> queryHash: string -> Async<unit>
and RetrievalTrace = {
QueryHash: string // SHA256; never plaintext
QueryLength: int
RequestedScopes: VectorScope list
PermittedScopes: VectorScope list
TopK: int
CandidatePoolSize: int
TopScore: float
Dense: bool
Sparse: bool
Reranked: bool
LatencyMs: int
Stages: string list
ResultCount: int
}
Two shipped tracers:
NoOpRetrievalTracer— zero-cost opt-out.EventStoreRetrievalTracer(default) — writesKnowledgeRetrievedevents toIEventStoreunder_platform.retrieval.
The KnowledgeRetrieved event type is a wire-format contract. Use it for retrieval-quality monitoring, replay debugging (with the QueryHash as the dedup key), and slow-query investigation.
/health/rag exposes a 60-second rolling-window snapshot from IRagTelemetry: embedding latency, ingestion queue depth, flush latency, retrieval hit / miss / empty counts. The HealthMonitorUI admin module surfaces it for operators.
Multi-modal extensions (deferred)
The IOcrProvider, ITableExtractor, and IImageEmbedder extension points reserve the surface for future multimodal work:
IOcrProvider— scanned-document OCR. Default no-op (assumes documents are text-extractable).ITableExtractor— table extraction with structure preserved. Default no-op.IImageEmbedder— CLIP-style image vectors in a shared modality space. No default registered (no honest no-op).
When IAIProvider.AIProviderMessage.Content is widened to support multimodal content blocks (future SDK version), the image-embedding path will plug in alongside text embedding for cross-modal retrieval.