← Back to Articles & Artefacts
artefactswest

Real-world AI patterns powering cognitive support systems

IAIP Research
agent-orchestration

Real-world AI patterns powering cognitive support systems

The four technical patterns central to an AI-assisted memory support system — multi-format RAG, multi-agent orchestration, human-in-the-loop gates, and adaptive routing — are all shipping in production today. None are theoretical. The most surprising finding: the ingestion layer, not the vector store, is the true differentiator in production RAG; deterministic backbones, not autonomous agents, define real multi-agent systems; and true user-state-aware orchestration remains the least mature of the four patterns, mostly confined to education and mental health domains.


Multi-format RAG is a solved ingestion problem

Three shipping systems demonstrate the full spectrum from raw document processing to domain-specific medical RAG to personal knowledge management.

Unstructured.io is the de facto ingestion layer for enterprise RAG pipelines. It handles 45+ file formats — PDFs, images via OCR, emails, presentations, spreadsheets, HTML — and transforms them into AI-ready chunks. The key innovation is element-based document classification: a CV + NLP pipeline identifies structural elements (titles, tables, headers, images) before chunking, preserving semantic relationships that naive token-based splitting destroys. SOC 2 Type 2, HIPAA-compliant, and ISO 27001 certified. IBM's watsonx.data integration uses it in production. Processes 15M+ pages/hour per workflow. One gap: no native audio processing — voice memos require a separate ASR step (e.g., Whisper) upstream.

Abridge is the most directly relevant example for cognitive/medical support. It captures doctor-patient conversations via proprietary medical ASR, transcribes them, then uses RAG to pull in prior patient notes, lab trends, and health system guidelines to generate structured clinical documentation inside Epic EHR. Every AI-generated sentence links back to its source audio via a "Linked Evidence" feature — a built-in provenance chain. Deployed in 100+ health systems including Johns Hopkins and Mayo Clinic. Raised $250M Series D in February 2025 at a $2.75B valuation. The architecture: medical ASR → clinical NLP entity extraction → semantic retrieval over prior notes → contextual generation → structured Epic output.

Khoj is an open-source personal AI assistant that ingests PDFs, Markdown, images, Notion pages, GitHub repos, and WhatsApp voice notes into a searchable conversational interface. Fully self-hostable with local LLMs via Ollama for complete data sovereignty. Uses open-source HuggingFace sentence-transformers for embedding and cross-encoder reranking for retrieval quality. YCombinator-backed, thousands of users across browser, Obsidian, Emacs, desktop, and mobile.

Architecture pattern: Ingest → Parse/Classify Elements → Chunk (element-aware) → Embed → Vector Store → Retrieve + Rerank → Augment with Context → Generate. The differentiator across all three systems is the parse/classify step — not the vector store, which is now commodity infrastructure.


Multi-agent orchestration runs on deterministic backbones

Production multi-agent systems universally reject the "autonomous agents chatting freely" model. Instead, they use deterministic control flow with specialized agents invoked at defined points.

LinkedIn's Hiring Assistant uses LangGraph with a planner-executor split. A Planner agent interprets recruiter intent and decomposes it into subtasks. An Executor dispatches to specialized sub-agents (candidate sourcing, ATS querying, outreach writing) via gRPC. LinkedIn explicitly chose this over ReAct because "LLMs become unreliable when juggling too many things simultaneously." Sub-agents are registered in a Skill Registry — an internal API marketplace with metadata on version, owner, latency, and input schema. Each recruiter gets an async agent instance with memory that persists across sessions, combining experiential preferences with LinkedIn's 1B+ member Talent Graph. Every action is tagged with OpenTelemetry trace IDs for full reasoning reconstruction.

DocuSign built sales outreach automation on CrewAI Flows — a deterministic backbone with specialized agent crews at specific pipeline steps. A Researcher agent pulls data from Salesforce and Snowflake, a Composer drafts outreach, a Validator checks quality. The critical design choice: "They're not free-roaming — they're specialized intelligence invoked at specific steps, doing specific jobs, then returning control." The Flow manages state, branching, and validation. A/B tested against human reps, the agents matched or beat engagement metrics while cutting research time from hours to minutes.

Dagger's Container Use solves multi-tenant isolation. Each AI coding agent gets its own isolated environment: a Dagger container plus a Git worktree on its own branch. No shared filesystem. Standard git checkout to review any agent's work. MCP (Model Context Protocol) provides the communication layer. While designed for developer workflows, the pattern maps directly to multi-tenant agent systems — container-per-tenant for compute isolation, worktree-per-tenant for state isolation, with full audit trails via Git commit history.

Architecture pattern: Supervisor/Planner → Deterministic State Graph (LangGraph/CrewAI Flows) → Specialized Worker Agents → Shared State + Memory → Human Checkpoints. The state graph is the backbone — control always returns to it between agent invocations. LangGraph reports 400+ companies in production; CrewAI claims 60% of Fortune 500.


Human-in-the-loop gates are architecturally enforced, not bolted on

The strongest HITL implementations make human review structurally unavoidable rather than optional.

Viz.ai (stroke triage, 1,700+ hospitals) demonstrates the purest evaluator gate. When a CT angiogram arrives, the AI performs vessel segmentation and applies a fixed threshold. If exceeded, it sends a push notification to the stroke team. The critical design: the AI produces no diagnostic output whatsoever — only a notification recommending review. The human gate is architecturally enforced because there is literally nothing to act on without a clinician's review. FDA-cleared via De Novo pathway. Reduces CTA-to-notification time from 58 minutes to 7 minutes — a life-saving acceleration that still requires human judgment at every decision point.

LangGraph's interrupt() primitive provides general-purpose evaluator gate infrastructure. At designated graph nodes, execution pauses and state is checkpointed to persistent storage (PostgreSQL). A human reviewer receives the interrupt, then issues a Command(resume=...) to approve, edit, or reject. The newer HumanInTheLoopMiddleware enables per-tool policies — write_file requires all decisions, execute_sql allows only approve/reject, read_data auto-approves. Checkpointing means interruptions can last seconds to months. Monday.com built a full eval-driven CI/CD loop on this: sync prompts → sync datasets → sync evaluators on every PR merge. Auth0 published a production pattern using LangGraph interrupts with CIBA for financial authorization requiring supervisor approval.

Meta's content moderation routes billions of posts daily through confidence-threshold gates. High-severity + high-confidence content (CSAM, terrorism) auto-removes. Medium-confidence content routes to human reviewer queues. Low-confidence content passes. The thresholds are actively tuned as policy levers — Meta lowered them during the Israel-Gaza conflict (aggressive removal, more false positives) and raised them in January 2025 (fewer false positives, less coverage). A Dynamic Multi Review system adjusts the number of required human reviewers based on virality and harm potential. In 2025, Meta added an LLM "second opinion" layer to screen clearly non-violating posts out of human queues, freeing reviewers for genuinely borderline content.

Architecture pattern: AI Inference → Confidence/Risk Scoring → Threshold Check → Route (auto-act | human queue | pass-through) → Human Decision (approve/edit/reject) → Feedback Loop. The key variable is threshold tunability: healthcare uses fixed thresholds (regulatory mandate), content moderation uses dynamic thresholds (policy lever), and agent infrastructure uses configurable per-action thresholds (developer choice).


Adaptive orchestration is shipping for tasks, still emerging for users

This is the most cutting-edge area. Task-level routing (which model for which query?) is mature and shipping. User-level adaptation (how does the system change for this specific person?) remains domain-specific and early.

Not Diamond is a shipping model router that analyzes prompt complexity and routes each query to the optimal LLM. A pre-trained router uses prompt embeddings fed through random forest classifiers (open-sourced as RoRF) to predict which model will perform best. Users can submit evaluation data to train custom routers optimized for their specific use case — training takes under 60 minutes and routers can be incrementally updated. Three tradeoff modes: quality, cost, and latency via Pareto optimization. Integrated into OpenRouter's Auto Router powering model selection for its entire user base. Free tier handles 100K monthly routing requests. The routing decision completes "in less time than it takes to stream a single token."

Martian takes a different approach, using mechanistic interpretability to predict model performance without running the model. Their proprietary "Model Mapping" technology unpacks LLM internals to understand what makes models succeed or fail on specific inputs — described as the first commercial application of mechanistic interpretability. Used by developers at 300+ companies including Amazon and Zapier. Integrated into Accenture's Switchboard Service for enterprise LLM governance. Claims up to 98% cost savings by routing simple queries to cheaper models.

Khanmigo (Khan Academy) is the strongest example of user-state-aware adaptation at scale. Serving 700,000+ K-12 students, it adapts tutoring in real-time based on student responses — adjusting hint specificity, question difficulty, language complexity, and pedagogical approach. It never gives direct answers, instead using Socratic questioning that targets exactly where a student got stuck. Built on GPT-4 with heavy prompt engineering integrated into Khan Academy's content graph. The adaptation loop: student response → correctness/effort assessment → adjusted hints → progress tracking → teacher dashboard aggregation. Projected to surpass 1 million students in 2025-26.

Architecture pattern: Input Analysis (embed/classify) → Routing Policy (trained classifier | interpretability model | LLM-as-router) → Agent/Model Selection → Execution → Feedback Loop (eval scores → retrain router). The maturity gap is clear: Not Diamond and Martian handle task routing excellently but don't learn individual user patterns over time. Khanmigo adapts to individual users but through domain-specific pedagogical scaffolding rather than general-purpose meta-orchestration. A system combining both — task-aware model routing and user-state-aware behavioral adaptation — does not yet exist as a general framework.


Conclusion

The technical infrastructure for an AI-assisted cognitive support system exists today across shipping products, but no single product combines all four patterns. The clearest path forward draws from specific implementations: Unstructured.io's element-aware parsing for multi-format ingestion, LangGraph's deterministic state graph with specialist agents for orchestration, Viz.ai's architecturally enforced human gates (where the AI literally cannot bypass review) for safety, and Not Diamond's trainable routing combined with Khanmigo's user-state adaptation for personalization. The gap worth noting: user-level meta-orchestration that learns how a specific person's cognitive needs change over time and adapts agent behavior accordingly has no general-purpose implementation yet — it exists only in narrow domains. That's the novel engineering challenge at the center of this story.