Concepts
Concepts
How ToolUp.KnowledgeBase works under the hood. The companion is a thin glue layer between ToolUp.RAG's ingestion pipeline and the user-facing module surface.
Pipeline overview
user drops file into Documents page UI
│
▼ POST /api/IKnowledgeApi/UploadDocument
▼ KnowledgeBase.Server.knowledgeApi handler
▼ IBlobStorage.Save → team-{teamId}/kb-documents/{docId}.{ext}
│
▼ (post-save hook fires)
▼ KnowledgeBase.Server.kbVectorisationHandler.Vectorise
│ │
│ ▼ multi-format extraction (PdfPig / OpenXml / ...)
│ ▼ Chunking.splitByTokens (token-aware)
│ ▼ returns TextChunk list
│
▼ IngestionQueue.Enqueue { DocumentId; Scope; Chunks }
│
▼ IngestionBackgroundService dequeues
│ │
│ ▼ IEmbeddingProvider.GenerateEmbedding per chunk
│ ▼ IVectorStore.Index per chunk
│ ▼ emits KnowledgeChunkIndexed event
│ │
│ ▼ IIngestionStatusObserver.OnChunkIndexed
│ ▼ Notification.Publish (NotificationKind = "KnowledgeBase.IngestionStatus")
│
▼ SSE event reaches client; KB Documents page updates per-document status pill
End-to-end latency for a typical 10-page PDF: ~5-15 seconds.
Document upload + extraction
KnowledgeBase.Server.knowledgeApi.UploadDocument accepts a byte[] + filename + content-type. The handler:
- Validates the file (size cap, content-type allowlist).
- Generates a
DocumentId: Guid. - Writes to
IBlobStorage: containerteam-{teamId}, keykb-documents/{documentId}.{ext}. - Writes the document metadata blob:
kb-documents/{documentId}.meta.json—{ DocumentId; FileName; ContentType; UploadedBy; UploadedAt; Source = UploadedFile; Status = Pending }. - Emits
DocumentUploadedevent under_platform.kb.
The post-save hook in IDataObjectStore (when registered with composeWithRAG) triggers kbVectorisationHandler.Vectorise. The handler:
- Reads the document bytes from
IBlobStorage. - Routes by content-type to the matching extractor.
- Runs the extractor to get raw text per page / per sheet / per slide.
- Calls
Chunking.splitByTokenswith the configuredChunkingConfig. - Returns
TextChunk listto the ingestion queue.
If extraction fails (corrupted PDF, password-protected DOCX, etc.), the handler emits a DocumentExtractionFailed event with the reason; the document's status becomes Failed. The user sees a one-line error in the Documents page.
Multi-format extractors
Shipped extractors:
| Format | Library | Output |
|---|---|---|
UglyToad.PdfPig |
Per-page text, page-numbered. Scanned PDFs → empty text (no OCR by default; pair with IOcrProvider). |
|
| PPTX | DocumentFormat.OpenXml |
Per-slide text + speaker notes; preserves slide order. |
| DOCX | DocumentFormat.OpenXml |
Paragraph-by-paragraph; tables flattened to comma-separated rows. |
| XLSX | DocumentFormat.OpenXml |
Per-sheet; passed to Chunking.chunkSpreadsheet for header-aware chunking. |
| CSV / TSV | Built-in | Header-aware row-wise chunking. Auto-detects delimiter. |
| TXT | Built-in | Passed directly to Chunking.splitByTokens. |
| MD | Built-in | Same as TXT; markdown structure isn't preserved as metadata (chunks are plain text). |
Extraction is best-effort. The extractor's job is to produce text; if a format ships malformed (truncated PDF, broken zip in PPTX), the extractor emits what it could parse and the handler logs a warning. No-text result is the failure mode — the document is Failed and not re-tried automatically.
Adding a new extractor
A custom extractor lives in your own module's VectorisationHandler for now. The shipped KB extractor list isn't extensible from outside the package — replace the KB module entirely (see extending.md) if you need to add formats.
A future extension point (IDocumentExtractor interface registered by content-type) is a planned addition.
Ingestion-status surfacing
IIngestionStatusObserver is the contract:
type IIngestionStatusObserver =
abstract OnJobAccepted: IngestionJob -> Async<unit>
abstract OnChunkIndexed: jobId: Guid -> chunkId: Guid -> Async<unit>
abstract OnJobCompleted: jobId: Guid -> chunkCount: int -> Async<unit>
abstract OnJobFailed: jobId: Guid -> reason: string -> Async<unit>
KnowledgeBase.Server.makeIngestionStatusObserver wires an implementation that:
- Looks up the
DocumentIdfrom theIngestionJob's metadata. - Updates the document's
.meta.jsonstatus:Pending → Extracting → Chunking → Embedding → Indexed | Failed. - Emits a
Notification.SystemMessagewithNotificationKind = "KnowledgeBase.IngestionStatus"to the scope.
The SSE wire format for the notification is IngestionStatusUpdate:
type IngestionStatusUpdate = {
DocumentId: Guid
Status: IngestionStatus
Progress: float option // 0.0 - 1.0
ChunksIndexed: int
ChunksTotal: int option
Reason: string option // populated when Failed
}
The KB Documents page subscribes to this notification key and updates the per-document status pill live. The AI assistant side panel also subscribes — when documents are still indexing, it can show "indexing in progress" before answering.
Notification-key contract
The literal "KnowledgeBase.IngestionStatus" is a published wire-format contract. External KB replacements should match this string exactly — otherwise the AI side panel won't surface ingestion progress.
Narrative-commit
Other modules can deposit content into KB via narrative-commit:
Html.button [
prop.text "Save to Knowledge Base"
prop.onClick (fun _ ->
Toolup.NarrativeCommit.submit {
Title = "Sales Q3 Analysis"
Body = analysisBody
SourceModule = "SalesAnalysis"
})
]
Toolup.NarrativeCommit.submit is a global function with no compile-time dependency on KB. The handler is installed by KnowledgeBaseView.installNarrativeCommit () at app boot.
The handler:
- Sends the narrative + title via
KnowledgeApi.IngestNarrative(Fable.Remoting). - Server-side: persists a document blob with
Source = FromNarrative { SourceModule; Title }. - Runs through the same vectorisation handler + ingestion pipeline as a file upload.
Narratives appear in the Documents list with a different icon and origin label. Users can delete them like any other document.
This gives modules a clean way to push their content into the KB without coupling to KB internals — the Toolup.NarrativeCommit shim is the only contract.
Notes
Notes are a separate persistence path with the same vector-indexing target. A note is just a single chunk (or a small chunk-set if it's long enough to split). Notes are stored alongside documents (same blob container) but with Source = Note and a different metadata shape.
The Notes page UI is a simple text-input → list-of-notes paradigm. No upload, no extraction — the user types, the SDK indexes.
Use case: capturing one-off observations the assistant should know ("we measure sales in units, not value"; "Q3 analysis should exclude store 47"; "client confirmed the launch is week of Oct 10").
AI Context (standing instructions)
The AI Context page is a different shape — entries are NOT indexed for retrieval. They're injected directly into the system prompt every turn:
let standingContextBuilder = KnowledgeBase.Server.standingContextBuilder blobStorage (Some logger)
The builder reads _platform/kb-ai-context/{teamId}/entries.json per request (cheap blob read; sub-millisecond), formats the entries as a system-prompt section, returns the string.
Use cases:
- Brand-name conventions ("brand names are case-sensitive").
- Date conventions ("the current quarter is Q3 2026").
- Default settings ("default analysis window is 4 weeks").
- High-level domain context ("the team analyses pharma marketing data").
Limit ~20 entries / ~2000 tokens per team. Beyond that, push into notes (indexed) so retrieval picks them up only when relevant.
The standing-context builder is opt-in — deployments wire it explicitly into their withAIConfig system-prompt list. KB doesn't auto-inject because AI doesn't depend on KB (one-way layering); the deployment composition root is the only place that sees both.
Reset / dedup
The KB Documents page has a "Reset KB" button (admin-only). When invoked:
- Confirmation dialog ("This will delete N documents and Y chunks. Proceed?").
KnowledgeApi.ResetKnowledgeBaseis called.- Server-side: deletes all KB document blobs in the scope, calls
IVectorStore.DeleteByScope, clears the AI context. - Emits
KnowledgeBaseResetevent under_platform.kb.
DeleteByScope is a config-grade reset (bypasses tombstone soft-delete semantics). The chunks are physically gone, not soft-deleted; there's no recovery path.
Dedup: when a file with the same content-hash is uploaded again, the SDK detects it via IDataObjectStore's content-addressable layer and re-uses the existing storage blob. The document entry is new (different DocumentId), but the underlying bytes don't duplicate.
Chunk-level dedup is the vector store's responsibility — duplicate chunk text within the same scope is left for the vector store to filter (or not, depending on impl).
Scope isolation
Every KB document lives in team-{teamId}/kb-documents/. The scope filter in IRetrievalPipeline.Retrieve ensures Team A's caller cannot retrieve Team B's chunks even if they know the document IDs.
On team switch (in MultiTeam mode), the KB module's Init re-fires; the Documents / Notes / AI Context pages refetch against the new team scope. Documents from the prior team disappear; documents from the new team appear.
The Reset KB button is scoped to the active team — no path resets across teams.
Performance
- Upload throughput: bounded by extraction speed. PDFs are slowest (~1s per 10 pages with PdfPig). PPTX / DOCX / XLSX faster. CSV / TXT near-instant.
- Indexing throughput: bounded by embedding-provider rate limits. Default ingestion concurrency is 2; raise via
withIngestionConcurrencyif your provider supports higher concurrency. - Retrieval latency: usually 50-200ms per query for in-memory vector store + cached embedding. Sparse-only or hybrid adds 30-100ms for BM25 over the same scope.
- Scaling ceiling: ~50K chunks per scope before in-memory vector store slows. For larger, switch to
ToolUp.VectorStores.Hnswor a future distributed store.
What KB does NOT cover
- Versioning of documents — uploading the same file twice creates two
DocumentIds. There's no "this is v2 of doc X" semantics. For versioning, useIDataObjectStoredirectly withVersioningPolicy = Versioned. - Document-level metadata extraction — extractors extract text, not structured metadata (authors, dates, tags). For metadata-rich knowledge graphs, build a custom module.
- Multimodal content — images / audio / video are not extracted. Future multimodal extension via
IImageEmbedder(deferred). - Cross-document linking — chunks don't know about each other beyond the document boundary. For cross-document reasoning, the assistant uses retrieval + agent loop; the SDK has no native graph layer.
For any of these, the right shape is a custom module on top of ToolUp.RAG, replacing KB or running alongside it.