Feature Proposal

Stop Translating.
Start Transmitting.

LLMs already think in embeddings. Every other input — text, images, audio — gets converted to vectors internally. Why are we converting vectors to text just to have the model convert them back?


The Problem

The Lossy Round-Trip

Every retrieval-augmented system follows the same pipeline. The embedding captures semantic meaning in high-dimensional space. Then we throw it away, convert back to flat text, and the model re-derives what the vector already knew.

Text
Embedding
4096d
Back to
Text
LLM re-encodes
to vectors
Today: text → vectors → text → vectors (lossy, expensive)
Text
Embedding
4096d
Embedding
Block
LLM processes
directly
Proposed: text → vectors → vectors (lossless, efficient)

The proposed path skips the lossy middle step entirely. The model receives the geometry it would have computed anyway — directly.


Evidence

We Tried Every Workaround

We spent weeks building text-based compression workarounds — 16 test cases, 16+ prompt variants, automated Q&A scoring. Here's what we found:

FACT RETENTION vs COMPRESSION (16 test variants)

Baseline
85%
Variant 1
75%
Variant 2
71%
Variant 5
68%
Variant 12
72%
Variant 16
70%
Embedding
100%

Compression and retention are in fundamental tension

Rules protecting numeric data break conversational compression. Rules protecting ratios break summaries. The test categories have mutually exclusive optimal compressions. No text variant can win everywhere.

Text compression is non-deterministic

The same prompt scored 85% one evening and 65% the next morning. Temperature sampling, model state, and serialization variations mean identical content produces different compressions each run. We're optimizing against a moving target.

Embeddings don't have this problem

The same text always produces the same embedding. Geometric relationships are preserved exactly. Cluster distances, similarity scores, and traversal paths arrive intact. The information is already in its most compact, lossless form.


Precedent

This Already Exists — Sort Of

Three independent projects have built embedding input. None are production-ready.

Platform Feature Production Ready Status
vLLM prompt_embeds No V0 only, security CVE, crashes on bad input
NVIDIA NIM Prompt embeddings Experimental "Subject to change," V0 backend only
Protopia AI Stained Glass Transforms Niche Privacy use case, requires vLLM backend
Anthropic Not Yet This is what we're requesting

The concept is proven. The demand is real enough that three projects built it. But enterprise-scale deployments need the safety, validation, and architectural rigor that a production API provider brings.


Scale Context

Where This Comes From

We operate a multi-agent system with persistent memory that hits this wall every day.

33K+
Memory Nodes
12
Autonomous Agents
4096d
Embedding Dims
99%
Auto-Enriched

The entire graph is self-maintaining — new memories flow in, get embedded, get connected via similarity edges, get ranked by PageRank, get clustered by community detection. All autonomously, via streaming consumers and nightly cron jobs. No human touches the graph.

Then, at retrieval time, we serialize 10-30 memories back to text (~4,000 tokens), send them through the Messages API, and Claude re-encodes them internally. The entire autonomous pipeline produces geometric representations — and the API boundary forces us to flatten them.

Every system building persistent agent memory will hit this same wall. The only question is when.

The Proposal

Same Pattern as Image Blocks

Images proved that LLMs can process non-text modalities at the input-encoding layer. Embeddings are semantically closer to the model's internal representations than pixels are.

// What exists today (experimental, V0 only, security concerns):
{"prompt_embeds": "<base64-encoded tensor>"}

// What we're requesting (production-grade, Messages API):
{
  "type": "embedding",
  "data": [0.0234, -0.0891, ...],
  "dimensions": 4096
}

We train and fine-tune models almost daily. The one thing every training run reinforces is that embeddings are the language frontier models actually operate in. No model "reads" text — it converts text to vectors at the input layer and operates entirely in embedding space from that point forward. Every attention head, every layer, every gradient update happens on vectors.

Text is a translation layer for humans. Embeddings are the native representation.


Support This Feature

If you've ever thought "why am I converting embeddings to text just to feed them to a model that's going to convert them back" — you're not alone. Drop a thumbs-up on the issue. It takes 10 seconds.

Support on GitHub