Native Embeddings — Stop Translating, Start Transmitting

The Problem

The Lossy Round-Trip

Every retrieval-augmented system follows the same pipeline. The embedding captures semantic meaning in high-dimensional space. Then we throw it away, convert back to flat text, and the model re-derives what the vector already knew.

Text

→

Embedding
4096d

→

Vector
Search

→

Back to
Text

→

LLM re-encodes
to vectors

Today: text → vectors → text → vectors (lossy, expensive)

Text

→

Embedding
4096d

→

Vector
Search

→

Embedding
Block

→

LLM processes
directly

Proposed: text → vectors → vectors (lossless, efficient)

The proposed path skips the lossy middle step entirely. The model receives the geometry it would have computed anyway — directly.

Evidence

We Tried Every Workaround

We spent weeks building text-based compression workarounds — 16 test cases, 16+ prompt variants, automated Q&A scoring. Here's what we found:

FACT RETENTION vs COMPRESSION (16 test variants)

Baseline

85%

Variant 1

75%

Variant 2

71%

Variant 5

68%

Variant 12

72%

Variant 16

70%

Embedding

100%

Compression and retention are in fundamental tension

Rules protecting numeric data break conversational compression. Rules protecting ratios break summaries. The test categories have mutually exclusive optimal compressions. No text variant can win everywhere.

Text compression is non-deterministic

The same prompt scored 85% one evening and 65% the next morning. Temperature sampling, model state, and serialization variations mean identical content produces different compressions each run. We're optimizing against a moving target.

Embeddings don't have this problem

The same text always produces the same embedding. Geometric relationships are preserved exactly. Cluster distances, similarity scores, and traversal paths arrive intact. The information is already in its most compact, lossless form.

Precedent

This Already Exists — Sort Of

Three independent projects have built embedding input. None are production-ready.

Platform	Feature	Production Ready	Status
vLLM	`prompt_embeds`	No	V0 only, security CVE, crashes on bad input
NVIDIA NIM	Prompt embeddings	Experimental	"Subject to change," V0 backend only
Protopia AI	Stained Glass Transforms	Niche	Privacy use case, requires vLLM backend
Anthropic	—	Not Yet	This is what we're requesting

The concept is proven. The demand is real enough that three projects built it. But enterprise-scale deployments need the safety, validation, and architectural rigor that a production API provider brings.

Scale Context

Where This Comes From

We operate a multi-agent system with persistent memory that hits this wall every day.

33K+

Memory Nodes

Autonomous Agents

4096d

Embedding Dims

99%

Auto-Enriched

The entire graph is self-maintaining — new memories flow in, get embedded, get connected via similarity edges, get ranked by PageRank, get clustered by community detection. All autonomously, via streaming consumers and nightly cron jobs. No human touches the graph.

Then, at retrieval time, we serialize 10-30 memories back to text (~4,000 tokens), send them through the Messages API, and Claude re-encodes them internally. The entire autonomous pipeline produces geometric representations — and the API boundary forces us to flatten them.

Every system building persistent agent memory will hit this same wall. The only question is when.

The Proposal

Same Pattern as Image Blocks

Images proved that LLMs can process non-text modalities at the input-encoding layer. Embeddings are semantically closer to the model's internal representations than pixels are.

    // What exists today (experimental, V0 only, security concerns):

    {"prompt_embeds": "<base64-encoded tensor>"}

    // What we're requesting (production-grade, Messages API):

    {

      "type": "embedding",

      "data": [0.0234, -0.0891, ...],

      "dimensions": 4096

    }

We train and fine-tune models almost daily. The one thing every training run reinforces is that embeddings are the language frontier models actually operate in. No model "reads" text — it converts text to vectors at the input layer and operates entirely in embedding space from that point forward. Every attention head, every layer, every gradient update happens on vectors.

Text is a translation layer for humans. Embeddings are the native representation.

Support This Feature

If you've ever thought "why am I converting embeddings to text just to feed them to a model that's going to convert them back" — you're not alone. Drop a thumbs-up on the issue. It takes 10 seconds.

Support on GitHub