LLMs already think in embeddings. Every other input — text, images, audio — gets converted to vectors internally. Why are we converting vectors to text just to have the model convert them back?
Every retrieval-augmented system follows the same pipeline. The embedding captures semantic meaning in high-dimensional space. Then we throw it away, convert back to flat text, and the model re-derives what the vector already knew.
The proposed path skips the lossy middle step entirely. The model receives the geometry it would have computed anyway — directly.
We spent weeks building text-based compression workarounds — 16 test cases, 16+ prompt variants, automated Q&A scoring. Here's what we found:
FACT RETENTION vs COMPRESSION (16 test variants)
Rules protecting numeric data break conversational compression. Rules protecting ratios break summaries. The test categories have mutually exclusive optimal compressions. No text variant can win everywhere.
The same prompt scored 85% one evening and 65% the next morning. Temperature sampling, model state, and serialization variations mean identical content produces different compressions each run. We're optimizing against a moving target.
The same text always produces the same embedding. Geometric relationships are preserved exactly. Cluster distances, similarity scores, and traversal paths arrive intact. The information is already in its most compact, lossless form.
Three independent projects have built embedding input. None are production-ready.
| Platform | Feature | Production Ready | Status |
|---|---|---|---|
| vLLM | prompt_embeds |
No | V0 only, security CVE, crashes on bad input |
| NVIDIA NIM | Prompt embeddings | Experimental | "Subject to change," V0 backend only |
| Protopia AI | Stained Glass Transforms | Niche | Privacy use case, requires vLLM backend |
| Anthropic | — | Not Yet | This is what we're requesting |
The concept is proven. The demand is real enough that three projects built it. But enterprise-scale deployments need the safety, validation, and architectural rigor that a production API provider brings.
We operate a multi-agent system with persistent memory that hits this wall every day.
The entire graph is self-maintaining — new memories flow in, get embedded, get connected via similarity edges, get ranked by PageRank, get clustered by community detection. All autonomously, via streaming consumers and nightly cron jobs. No human touches the graph.
Then, at retrieval time, we serialize 10-30 memories back to text (~4,000 tokens), send them through the Messages API, and Claude re-encodes them internally. The entire autonomous pipeline produces geometric representations — and the API boundary forces us to flatten them.
Every system building persistent agent memory will hit this same wall. The only question is when.
Images proved that LLMs can process non-text modalities at the input-encoding layer. Embeddings are semantically closer to the model's internal representations than pixels are.
We train and fine-tune models almost daily. The one thing every training run reinforces is that embeddings are the language frontier models actually operate in. No model "reads" text — it converts text to vectors at the input layer and operates entirely in embedding space from that point forward. Every attention head, every layer, every gradient update happens on vectors.
Text is a translation layer for humans. Embeddings are the native representation.
If you've ever thought "why am I converting embeddings to text just to feed them to a model that's going to convert them back" — you're not alone. Drop a thumbs-up on the issue. It takes 10 seconds.
Support on GitHub