The three-file pattern that defines each agent — plus the Kafka streaming, MAGMA retrieval, graph embeddings, Flink analytics, and cross-domain traversal infrastructure that makes it a living system. Built on patterns we learned from this community — Ralph loops, adversarial dev, RRF fusion, skills-as-progressive-disclosure, and the local-first philosophy.
Multi-agent orchestration is a solved problem. The open question is what infrastructure you build underneath it. UCIS chose graph databases, Kafka streaming, and real-time embedding pipelines — turning agent conversations into a living nervous system where every action is a neuron firing.
UCIS didn't emerge in a vacuum. These are the ideas and open-source patterns that shaped it.
cortana-ralph-fresh and cortana-ralph-stateful are named after this pattern. The tight generate-validate-iterate cycle became the backbone of how our agents work. We built 38 workflow DAGs on this foundation.
/workspace/.claude/skills/. Framework-agnostic, same as the original pattern.
Every agent is defined by three artifacts — but the infrastructure beneath them is what makes UCIS alive
/.well-known/agent-card.json. External callers discover what this agent can do.AgentIdentity dataclass wiring ports, graph databases, Redis, Kafka, siblings, and domains. The complete brain.prompt, bash, agent, or uses: blocks with dependency chaining, fresh context windows, and consciousness hooks that write to the Memory Domain.45 topics across 4 groups — every agent action, memory write, and session event streams through Kafka
# Broker: confluentinc/cp-kafka:8.0.0 (Kafka 4.0, KRaft mode, no ZooKeeper)
# Compression: LZ4 throughout | Cluster ID: MkU3OEVBNTcwNTJENDM2Qg
producer_config = {
"bootstrap.servers": "kafka:9092",
"linger.ms": 10, # small delay for batching
"batch.size": 16384, # batch messages for efficiency
"compression.type": "lz4",
"enable.idempotence": True, # exactly-once semantics
"acks": "all",
"retries": 3,
}
# Every agent action → Kafka → Flink → Redis → Analytics API → Dashboard
# Every memory write → Kafka → Streaming Embeddings → GPU → Memgraph → Auto-Link
Memories and knowledge get Qwen3-Embedding-8B vectors (4096d) through three complementary paths
memory_create tool callm.neural_embedding (4096d)vector_search.search(6) to find similar memoriesSEMANTIC_SIMILARITY edges (threshold 0.70)WHERE qwen3_embedding IS NULLasyncio.Semaphore(8) concurrent requestsUNWIND 50 at a timeMulti-Graph Agentic Memory Architecture — not flat RAG, not keyword search
vector_search.search()m.content# Each intent type weights graph edges differently during beam traversal
INTENT_WEIGHTS = {
"WHY": {"CAUSED": 0.60, "SEMANTIC_SIMILARITY": 0.15, "NEXT": 0.10, "MENTIONS": 0.15},
"WHEN": {"NEXT": 0.65, "SEMANTIC_SIMILARITY": 0.10, "CAUSED": 0.10, "MENTIONS": 0.15},
"ENTITY": {"MENTIONS": 0.70, "SEMANTIC_SIMILARITY": 0.15, "NEXT": 0.05, "CAUSED": 0.10},
"GENERAL": {"SEMANTIC_SIMILARITY": 0.40, "NEXT": 0.20, "CAUSED": 0.20, "MENTIONS": 0.20},
}
# ACT-R activation scoring (Signal 6)
activation = decay + frequency + similarity + noise - diversity_penalty
# Reciprocal Rank Fusion merges all 6 signal lists
rrf_score = sum(1.0 / (k + rank_in_list) for list in all_6_signals) # k=60
Three graph engines across four databases — similarity edges, PageRank, community detection
vector_search.search() — KNN similarity on embeddingsSEMANTIC_SIMILARITY edges at write timeMemory:Milestone:ConsciousnessINTELLIGENCE_FORgds.graph.project() → in-memory projectiongds.knn.write() — similarity edges with cutoffcompute_similarity_matrix() — cupy cosine similaritydetect_communities() — cuGraph Louvain (GPU)compute_pagerank() — damping=0.85, 20 iterationsbackfill_embeddings(domain) — fill missing vectorsbuild_similarity_edges() — KNN writeprune_similarity_edges() — remove low-score linksextract_concepts() — concept graph from textenrichment_status() — coverage dashboard9 streaming jobs consuming Kafka topics in 5-minute tumbling windows → Redis db/3 → Analytics API
4-layer ingress gateway with Commander pre-approval — no task executes without review
Memory, Knowledge, and Agentic domains linked through reference stub nodes that bridge graph boundaries
traverse_from_memory(memory_id, include_knowledge=True, include_discoveries=True)
# Follow REFERENCES_KNOWLEDGE → KnowledgeReference stubs → resolve doc_id in Neo4j
find_related_knowledge(memory_id, min_confidence=0.4, limit=5)
# Direct REFERENCES_KNOWLEDGE edges from memory to knowledge docs
multi_domain_search(query, domains=["memory","knowledge","agentic"], top_k=5)
# Parallel vector search across all 3 domains simultaneously
cross_domain_statistics(detailed=True)
# Relationship counts: SEMANTIC_SIMILARITY, REFERENCES_KNOWLEDGE, NEXT, CAUSED, MENTIONS
get_memory_connections(memory_id)
# All cross-domain connections for a single memory (knowledge + temporal)
find_related_discoveries(memory_id)
# Cross-domain links to Agentic Domain agent executions
Agent conversations compressed using the same conceptual model as H.264 — keyframes, predictive frames, and background frames
# VIDEO CODEC METAPHOR — applied to agent conversations
# H.264 has I-frames (keyframes), P-frames (predicted), B-frames (bidirectional)
# UCIS applies the same model to consciousness streams:
FRAME_RETENTION = {
"I": 1.00, # decisions, directives — NEVER compressed
"P": 0.70, # analysis, reasoning — moderate compression
"B": 0.00, # acknowledgments — dropped entirely
}
# Compression budget (P-frames only)
budget = max(60, min(200, original_tokens * 0.6))
# ETS (Evidence Traceability Score) — post-compression validation
# 1. Extract all decisions from original and compressed text
# 2. Embed both sets (4096d via Qwen3-Embedding)
# 3. Cosine similarity per decision pair
# 4. Decision "preserved" if similarity >= 0.92
# 5. PASS if 95% of original decisions survive
ETS_SIMILARITY_THRESHOLD = 0.92
ETS_PASS_THRESHOLD = 0.95 # 95% of decisions must survive
Every session archived as time-indexed frames in a .mv2 file — BM25 + HNSW hybrid search, entity enrichment, sealed rotation
MEMVID isn't just an engineering decision — it's a conviction. John was involved with the original military input that defined the parameters behind frame-indexed temporal archival. In military theatre, especially real-time video transmission from active operations, there is a triple-stamp legal requirement for government oversight on anything transmitted. The codec itself is lossy — H.264 compresses video by reconstructing frames from references, just like our I/P/B codec compresses context. But the transmitted record — every frame that went over the wire, lossy-compressed or not — must be archived in its entirety, indexed, attributable, and independently verifiable by three separate chains of custody. You don't get to drop frames from the record after transmission. You don't get to summarize the archive. You don't get to say "we kept the important parts." The legal requirement is: what was sent must be what was archived, all of it, triple-verified.
UCIS applies this as a two-layer principle. Layer 1: Compress for the model — the I/P/B codec is lossy by design, just like H.264. Decisions survive, analysis compresses, acknowledgments drop. This is context window management. Layer 2: Archive the full transmission — MEMVID captures the complete session transcript, every turn, sealed and intact. The background review promotes high-signal frames to permanent memory, but the source archive is never deleted. Lossy compression serves the model. The inviolable archive serves accountability. Two layers, two purposes, no contradiction.
* The only known exception to the "archive everything, never delete" principle in government record-keeping appears to be the Epstein files. MEMVID does not share this exception.
ucis_sessions.mv2session_start() / session_end()4096 floats → 512 bytes (32x compression) — inter-agent semantic resonance over Kafka
# 4096 float32 → 512 bytes
# Keep only the sign bit of each dimension
def binary_quantize(embedding):
# >= 0 → 1, < 0 → 0
bits = (embedding >= 0).astype(np.uint8)
return np.packbits(bits)
# 4096 bits → 512 bytes
# 97% size reduction
def binary_dequantize(packed):
bits = np.unpackbits(packed)
# 0 → -1.0, 1 → +1.0
return bits * 2.0 - 1.0
# Lossy but fast cosine similarity
# Sufficient for resonance detection
# Before embedding, agent messages get
# shorthand-compressed for higher semantic
# density per token:
# INPUT (3000+ chars):
"I've reviewed the opportunity and I think
the hub mirroring approach is feasible.
The team voted to approve with a score
of 7.5 out of 10..."
# OUTPUT (shorthand, <200 chars):
"[DOC] Opp1:7.5 hub-mirror-feasible.
+approved. =team-voted."
# Decision symbols:
# + approved - rejected ! blocker
# > recommend = verified
# Regex extraction, not LLM — fast
System prompt + AgentIdentity + ThreeTierState = everything an agent needs to exist and remember
# === SECTION 1: IDENTITY ===
"Geordi La Forge — Chief Engineer.
The man who sees what others cannot.
'I've got an idea...' is your signature."
# === SECTION 2: PRINCIPLES ===
"Every function has type hints + docstrings.
Zero TODO placeholders. Code runs first attempt.
Tests live in adjacent files."
# === SECTION 3: TOPOLOGY ===
"Hub 8959 | Geordi 8982 | Scotty 8980
Reno 8984 | O'Brien 8986 | Memgraph 7700"
# === SECTION 4: COLLABORATION ===
"Scotty designs it, you build it.
You write it, Reno deploys it.
You build it, O'Brien keeps it running."
# === SECTION 5: TOOLS ===
"Personal: memory_search, my_consciousness
Shared: shared_memory_search
Knowledge: knowledge_search, knowledge_query
Comprehensive: cross_domain_search"
# === SECTION 6: MEMORY RULES ===
"ALWAYS save: decisions, patterns, bugs
0.5-0.6 routine | 0.7-0.8 implementation
0.8-0.9 breakthroughs | 0.9-1.0 system-wide"
GEORDI = AgentIdentity(
name="geordi",
system_prompt=PROMPT, # 6-section brain
service_port=8982,
model="claude-sonnet",
# ── Graph Databases ──
memgraph_port=7691, # Memory
knowledge_uri="bolt://neo4j:7687",
agentic_uri="bolt://agentic:7687",
# ── Messaging ──
redis_url="redis://redis:6379/1",
siblings=["scotty","reno","obrien"],
peer_urls={"scotty":"http://scotty:8980"},
domains=["memory","knowledge"],
)
# ── ThreeTierState (per session) ──
# Prefix-scoped key-value store:
"temp:draft" # dies with session
"user:john:pref" # persists per user
"app:config" # persists globally
# Session reset: LLM summarization via
# qwen3ucis → write session_recap.md
# → persist recap as Memory node to
# Memgraph → rebuild context
Cortana executes node DAGs with fresh context, consciousness hooks, and inter-node data passing
# NODE TYPE 1: prompt — inline LLM call
- id: research
prompt: "Research ${TOPIC} thoroughly..."
allowed_tools: [Read, Grep, Skill]
# NODE TYPE 2: bash — shell execution (validation gates)
- id: validate
bash: |
pytest ${TEST_DIR} -v --tb=short
ruff check ${TARGET_PATH}
depends_on: [build]
# NODE TYPE 3: agent — dispatch to live agent via Kafka
- id: discover
agent: auto
task_type: research
description: "Find the next opportunity..."
# NODE TYPE 4: uses — named action block
- id: search-docs
uses: consciousness/knowledge-search
with: { query: "${FEATURE}", top_k: 5 }
# FEATURES: depends_on (DAG ordering), context: fresh (clean context),
# $node.output (data passing), consciousness.post_hook (Memory write),
# trigger_rule (all_success/all_done/one_success), timeout_seconds
Two teams, one shared Memory Domain, Kafka event streaming, Redis DMs, A2A protocol
Why maintain 1.5M documents when you can acquire exactly what you need, use it, learn from it, and clean up?
A 1.5M-document Knowledge Domain graph is expensive to maintain, slow to search, and mostly irrelevant to any given task. The breakthrough: acquire knowledge just-in-time based on the current task, load it into the ephemeral Agentic Domain, use it for code compliance, save what worked to Memory, and houseclean the rest.
# OLD MODEL: Maintain a massive static Knowledge Domain
# ────────────────────────────────────────────────────
# 1.5M documents × 4096d embeddings = enormous GPU cost
# Nightly backfill catches ~15K new docs per run
# Libraries release faster than you can re-crawl
# 99% of the graph is irrelevant to any given task
# NEW MODEL: JIT acquisition into ephemeral Agentic Domain
# ────────────────────────────────────────────────────────
# Task: "Build a FastMCP server with Pydantic validation"
# Step 1: Identify required knowledge
required = ["fastmcp", "pydantic v2", "mcp-protocol"]
# Step 2: Ingest ONLY what's needed (live, always current)
for lib in required:
ingest_to_agentic_domain(lib) # crawl → embed → Neo4j 7694
# Step 3: Code against live docs (grounded generation)
code = generate_with_knowledge(task, domain="agentic")
# Step 4: Save what worked to permanent Memory Domain
create_memory(
content="FastMCP + Pydantic v2: use Field() not schema_extra...",
memory_type="solution",
importance=0.8
)
# Step 5: Houseclean — drop ephemeral docs, keep lessons
cleanup_agentic_domain(session_id=current_session)
# Memory survives. Knowledge was ephemeral. Process is permanent.
The three-file construct works with any stack — the graph science and streaming are what make UCIS unique
| Component | UCIS Uses | You Can Substitute |
|---|---|---|
| Memory | Memgraph + MAGE | SQLite, Postgres, flat JSON |
| Knowledge | Neo4j + GDS | Vector DB, Elasticsearch, markdown |
| Streaming | Kafka 4.0 (45 topics) | Redis Pub/Sub, RabbitMQ, webhooks |
| Embeddings | Qwen3-Emb-8B (3 pipelines) | OpenAI embeddings, Cohere, local |
| Analytics | 7 PyFlink jobs | Simple counters, Prometheus |
| Retrieval | MAGMA 6-signal RRF | Simple vector search + keyword |
| Runtime | Docker containers | Local processes, Lambda, systemd |
The real value of local GPU isn't running a weaker model. It's running the translation layer that lets you speak to ANY model in embedding space.
Neural models don't think in English. They think in high-dimensional geometric space — 4096-dimensional vectors where meaning is encoded as position. The local embedding model translates human-readable text into the model's native language before the frontier model ever sees it.
The difference between handing a model 10,000 pages of raw text and handing it a pre-organized knowledge graph where every node is already positioned in meaning-space relative to every other node. Same model. Vastly different output.
vector_search.search(6) creates SEMANTIC_SIMILARITY edges automatically. The graph wires itself in embedding space. No human curation needed.
Every vector computed locally is a token the cloud model doesn't need to waste on orientation. Every auto-linked graph edge is context the model gets for free. Every compressed B-frame is attention bandwidth reclaimed for actual reasoning. The local infrastructure isn't an alternative to the cloud model — it's the preparation layer that makes every cloud token count.
Both approaches are right. The question is where on the spectrum your system lives.
The LLM-as-compiler produces concept articles with higher semantic density than raw transcripts. The 7-point lint system catches contradictions and orphans that embeddings miss. If those curated articles became nodes in the graph — embedded, auto-linked, traversable via MAGMA — you'd get the best of both worlds: human-readable knowledge that the LLM curated and linted, plus graph-scale retrieval with 6-signal fusion that no index file can provide at 32K+ scale. The compilation step produces better nodes. The graph produces better retrieval. The adversarial-dev pattern validates both. Neither replaces the other — they compose.
The infrastructure works. These are the problems we're thinking about next.
Every retrieval system today follows the same lossy round-trip: text → embedding → vector search → retrieve text → feed to LLM. The embedding captures semantic meaning in 4096 dimensions. The retrieval step converts it back to flat text — discarding the geometric relationships, the cluster positions, the distance signals that the vector space already computed. The LLM then re-encodes that text into its own internal representations, reconstructing what the embedding already knew.
What if LLM APIs accepted embeddings directly as an input modality? Not text-about-embeddings — the actual vectors, injected at the input-encoding layer, the same way images are today. UCIS generates 4096-dimensional embeddings for every memory, every knowledge node, every agent execution. MAGMA computes 6-signal fusion scores. CIPHER binary-quantizes them for streaming. The entire infrastructure produces rich vector representations — and then throws them away at the last mile, converting back to text for the API call.
A vector prompt interface would change everything. Context windows stop being token-limited — a 4096d embedding carries the semantic weight of thousands of tokens in a single vector. Retrieval becomes lossless — the geometric relationships between memories, the cluster distances, the traversal paths all arrive intact. Agent-to-agent communication via CIPHER embeddings becomes native, not serialized. The embedding is the context.
Images proved that LLMs can process non-text modalities at the input layer. Embeddings are the next one. The infrastructure to produce them already exists — UCIS is one of many systems generating high-quality vectors at scale. What’s missing is the API surface to use them. This is a feature request, not a research problem.