A starting point for the conversation
This is a working demo, not a finished product, just something concrete to base our discussion on. Each capability has a deep link to the live feature behind it and an honest read of where it stands today: shipped and demoable, designed but not yet wired, or a clear next step. Nothing here is overclaimed; the gaps are the roadmap.
RAG: chunking, embeddings, vector search
SHIPPED · demoableGrounded RAG runs end-to-end in this app, my own code.
A question retrieves top-k chunks by cosine similarity; each chunk carries a [doc:id] citation marker passed to the model. Open any document to ask and see cited answers.
Open a document workspaceGrounding outputs in trusted sources (anti-hallucination)
SHIPPED · demoableAnswers are cited or refused: exact NOT_FOUND, never invented.
The system prompt forbids answering outside the retrieved context and requires the literal token NOT_FOUND when the answer isn't present. The eval set includes a negative case that must refuse.
See the NOT_FOUND refusal caseEvaluation harness: accuracy, faithfulness, eval-first mindset
SHIPPED · demoableEvery answer is provably scored: coverage, faithfulness, recall plus a pass-gate.
The golden set runs through grounded QA and is scored on citation coverage, faithfulness, and keyword recall, with a persisted run history and a hard pass-gate. Engine pass_rate is 1.00.
Run the eval dashboardStructured outputs (function calling / Pydantic)
SHIPPED · demoableTyped tool/function specs back the extraction and corroboration paths.
Tool inputs/outputs are Pydantic-typed; the API returns validated structured findings, not free text. Visible as persisted, schema-shaped insights in a document workspace.
See structured findings on a documentStateful LLM workflows (LangGraph)
SHIPPED · demoableReal LangGraph StateGraph: planner, extractor, critic, live in the app.
Extraction runs on an actual LangGraph StateGraph. The critic node provably drops a poison-pill finding. The agent-trace panel shows one row per node with the model it used. (Named twice in the JD, and built, not bluffed.)
Open a document, then the agent-trace panelMulti-agent / sub-agent orchestration
SHIPPED · demoableA coordinated planner/extractor/critic graph that merges results.
The orchestration decomposes extraction into nodes that run and report into shared state, with the critic gating low-evidence findings out before they persist.
See the agent traceObservability & tracing (token / cost / spans)
SHIPPED · demoableOpenTelemetry-GenAI tracer: per-answer spans with model, tokens, cost.
Every LLM call is a span following OTel GenAI semantic conventions (gen_ai.request.model, token counts, USD cost), surfaced in an inline trace panel and totalled on the eval dashboard. Honesty: it follows OTel conventions; Langfuse is NOT wired.
See token/cost totalsComputer-use / Playwright (evidence capture)
SHIPPED · demoableA headed Playwright evidence agent corroborates findings against the web.
Corroborate a finding and the agent drives a real browser to capture a screenshot + source URL + timestamp as an evidence artifact attached to the insight.
Open a document, then corroborate a findingModel routing across providers
SHIPPED · demoableTask-to-tier routing policy plus a live model picker over a real gateway.
A routing policy maps task to cheap/mid/strong tiers; the picker (GLM-5.2 / Kimi-K2.7-Code / DeepSeek-V4-Pro via the CometAPI gateway) flows the chosen model into /ask and into the trace span.
Pick a model in a workspaceGovernance / tool exposure (MCP)
SHIPPED · demoableAn MCP server exposes grounded QA as a governed, audited tool.
A stdlib JSON-RPC MCP server exposes attest_grounded_qa with fail-closed auth and an audit log: the governance surface for letting other agents call this capability safely.
Python + TypeScript hybrid
SHIPPED · demoableFastAPI engine plus Next.js/TypeScript UI, both running, both deployed.
The engine and API are Python (FastAPI); this UI is Next.js + TypeScript with strict typing. Deployed full-stack: Vercel frontend + AWS ECS Fargate backend over HTTPS.
You're looking at the TS halfVector DB: pgvector / PostgreSQL
DESIGNED · not yet wiredpgvector is the production target behind the same retrieve() contract.
agents/rag/pgvector.py implements the same retriever contract as the in-memory default, so the swap is config, not a rewrite. Honesty: the running demo uses in-memory cosine; pgvector-in-Fargate is a documented fast-follow, not yet wired.
Azure (OpenAI, AI Search, Blob, Key Vault, App Insights)
DESIGNED · not yet wiredEvery dependency is mapped to its Azure equivalent in the README.
Portability is documented dependency-by-dependency (OpenAI to Azure OpenAI, retriever to AI Search, etc.). Honesty: documented only; I make no Azure OpenAI hands-on claim.
Hybrid + re-ranked retrieval
NEXT STEP · on the roadmapNot built yet; the retriever is pure cosine today.
I can explain the design (BM25 + dense fusion, then a cross-encoder rerank) and where it slots into retrieve(), but it isn't implemented. Spoken to as a concept, not demoed.
Context compaction + tool-call repair
NEXT STEP · on the roadmapNot built yet; understood as a concept, not in the codebase.
Compaction (summarising older turns to fit the window) and tool-call repair (re-prompting on malformed tool JSON) are day-to-day JD items I can describe precisely but have not shipped here.
Langfuse-wired observability
NEXT STEP · on the roadmapTracer follows OTel conventions; Langfuse itself is not wired.
The shipped tracer emits OTel-GenAI-shaped spans (token/cost/model), which is the hard part. A Langfuse exporter is the next step, not a current claim.
Coaching / mentoring juniors (Manager signal)
SHIPPED · demoableThe teaching layer (course/, prep/) is itself the proof.
Modules 0 to 3 plus capstone and the prep layer are written to bring someone from zero to this build: the artifact a Manager produces, not just code.