A starting point for the conversation

This is a working demo, not a finished product, just something concrete to base our discussion on. Each capability has a deep link to the live feature behind it and an honest read of where it stands today: shipped and demoable, designed but not yet wired, or a clear next step. Nothing here is overclaimed; the gaps are the roadmap.

12 shipped · demoable2 designed · not yet wired3 next step · on the roadmap

How to read this page

Most of this is demonstrable from the running app right now; click any live link to see the claim proven, not asserted. The rest is framed as the next step, so we have a shared, honest picture to plan from.

RAG: chunking, embeddings, vector search

SHIPPED · demoable

Grounded RAG runs end-to-end in this app, my own code.

A question retrieves top-k chunks by cosine similarity; each chunk carries a [doc:id] citation marker passed to the model. Open any document to ask and see cited answers.

Open a document workspace

Grounding outputs in trusted sources (anti-hallucination)

SHIPPED · demoable

Answers are cited or refused: exact NOT_FOUND, never invented.

The system prompt forbids answering outside the retrieved context and requires the literal token NOT_FOUND when the answer isn't present. The eval set includes a negative case that must refuse.

See the NOT_FOUND refusal case

Evaluation harness: accuracy, faithfulness, eval-first mindset

SHIPPED · demoable

Every answer is provably scored: coverage, faithfulness, recall plus a pass-gate.

The golden set runs through grounded QA and is scored on citation coverage, faithfulness, and keyword recall, with a persisted run history and a hard pass-gate. Engine pass_rate is 1.00.

Run the eval dashboard

Structured outputs (function calling / Pydantic)

SHIPPED · demoable

Typed tool/function specs back the extraction and corroboration paths.

Tool inputs/outputs are Pydantic-typed; the API returns validated structured findings, not free text. Visible as persisted, schema-shaped insights in a document workspace.

See structured findings on a document

Stateful LLM workflows (LangGraph)

SHIPPED · demoable

Real LangGraph StateGraph: planner, extractor, critic, live in the app.

Extraction runs on an actual LangGraph StateGraph. The critic node provably drops a poison-pill finding. The agent-trace panel shows one row per node with the model it used. (Named twice in the JD, and built, not bluffed.)

Open a document, then the agent-trace panel

Multi-agent / sub-agent orchestration

SHIPPED · demoable

A coordinated planner/extractor/critic graph that merges results.

The orchestration decomposes extraction into nodes that run and report into shared state, with the critic gating low-evidence findings out before they persist.

See the agent trace

Observability & tracing (token / cost / spans)

SHIPPED · demoable

OpenTelemetry-GenAI tracer: per-answer spans with model, tokens, cost.

Every LLM call is a span following OTel GenAI semantic conventions (gen_ai.request.model, token counts, USD cost), surfaced in an inline trace panel and totalled on the eval dashboard. Honesty: it follows OTel conventions; Langfuse is NOT wired.

See token/cost totals

Computer-use / Playwright (evidence capture)

SHIPPED · demoable

A headed Playwright evidence agent corroborates findings against the web.

Corroborate a finding and the agent drives a real browser to capture a screenshot + source URL + timestamp as an evidence artifact attached to the insight.

Open a document, then corroborate a finding

Model routing across providers

SHIPPED · demoable

Task-to-tier routing policy plus a live model picker over a real gateway.

A routing policy maps task to cheap/mid/strong tiers; the picker (GLM-5.2 / Kimi-K2.7-Code / DeepSeek-V4-Pro via the CometAPI gateway) flows the chosen model into /ask and into the trace span.

Pick a model in a workspace

Governance / tool exposure (MCP)

SHIPPED · demoable

An MCP server exposes grounded QA as a governed, audited tool.

A stdlib JSON-RPC MCP server exposes attest_grounded_qa with fail-closed auth and an audit log: the governance surface for letting other agents call this capability safely.

Python + TypeScript hybrid

SHIPPED · demoable

FastAPI engine plus Next.js/TypeScript UI, both running, both deployed.

The engine and API are Python (FastAPI); this UI is Next.js + TypeScript with strict typing. Deployed full-stack: Vercel frontend + AWS ECS Fargate backend over HTTPS.

You're looking at the TS half

Vector DB: pgvector / PostgreSQL

DESIGNED · not yet wired

pgvector is the production target behind the same retrieve() contract.

agents/rag/pgvector.py implements the same retriever contract as the in-memory default, so the swap is config, not a rewrite. Honesty: the running demo uses in-memory cosine; pgvector-in-Fargate is a documented fast-follow, not yet wired.

Azure (OpenAI, AI Search, Blob, Key Vault, App Insights)

DESIGNED · not yet wired

Every dependency is mapped to its Azure equivalent in the README.

Portability is documented dependency-by-dependency (OpenAI to Azure OpenAI, retriever to AI Search, etc.). Honesty: documented only; I make no Azure OpenAI hands-on claim.

Hybrid + re-ranked retrieval

NEXT STEP · on the roadmap

Not built yet; the retriever is pure cosine today.

I can explain the design (BM25 + dense fusion, then a cross-encoder rerank) and where it slots into retrieve(), but it isn't implemented. Spoken to as a concept, not demoed.

Context compaction + tool-call repair

NEXT STEP · on the roadmap

Not built yet; understood as a concept, not in the codebase.

Compaction (summarising older turns to fit the window) and tool-call repair (re-prompting on malformed tool JSON) are day-to-day JD items I can describe precisely but have not shipped here.

Langfuse-wired observability

NEXT STEP · on the roadmap

Tracer follows OTel conventions; Langfuse itself is not wired.

The shipped tracer emits OTel-GenAI-shaped spans (token/cost/model), which is the hard part. A Langfuse exporter is the next step, not a current claim.

Coaching / mentoring juniors (Manager signal)

SHIPPED · demoable

The teaching layer (course/, prep/) is itself the proof.

Modules 0 to 3 plus capstone and the prep layer are written to bring someone from zero to this build: the artifact a Manager produces, not just code.