How Embeddings Work
This page is the "why" companion to Embeddings — Getting Started. It explains what a BertRuntime.forward(…) actually returns, why the public API performs mean pooling and L2 normalization on your behalf, and why two embeddings of the same dimension can still be incompatible.
From Tokens to a Single Vector
A BERT-style encoder produces one vector per token. For an input of length N tokens with hidden size H, the encoder output is an N × H matrix.
For most downstream uses (semantic search, similarity, RAG retrieval) you want a single vector per input — independent of length. Two ways to collapse the N × H matrix into a 1D vector dominate in practice:
-
[CLS]pooling: take the embedding of the first token (the special[CLS]token BERT prepends). Original BERT was trained to use this for classification. -
Mean pooling: average across the token dimension. Sentence-transformers (and
all-MiniLM-L6-v2,all-mpnet-base-v2, etc.) are fine-tuned for this — they outperform[CLS]for similarity tasks, often by a wide margin.
The BertRuntime implements mean pooling because that’s what every modern sentence-encoder GGUF in the wild expects. The EmbeddingModel.embed(…) methods return the pooled vector directly.
Why L2 Normalize
After mean pooling you get a vector whose magnitude depends on the input — longer inputs, different vocab distributions, layer-norm idiosyncrasies all push the norm around. For similarity comparisons you almost always want the direction of the vector, not its length:
-
cosine similarity is the inner product divided by the product of norms;
-
if both vectors are L2-normalized to length 1, that division is 1, and cosine similarity reduces to a plain dot product;
-
most vector stores assume normalized inputs and compute cosine as dot product internally — passing un-normalized vectors silently gives the wrong rankings.
BertRuntime therefore L2-normalizes the pooled vector before returning it. The Java/Kotlin embed(…) calls always return unit-length vectors.
Where Does the Dimension Come From?
The dimension of the output (EmbeddingModel.dimensions) equals the encoder’s hidden size H:
| Model | H | Notes |
|---|---|---|
|
384 |
6 layers, fast; popular default |
|
768 |
12 layers; higher quality |
|
768 |
original BERT; uses |
|
1024 |
24 layers; SOTA for English retrieval |
The dimension is fixed at training time — different models almost never produce mutually-compatible vectors even when their dimensions match by accident.
Numerical Caveats
The encoder runs in FP32 internally; output components are typically in [-0.5, 0.5] after pooling (before normalization). After L2 normalization each component is in [-1, 1].
Cosine similarity for L2-normalized vectors is in [-1, 1]:
| Cosine | Interpretation |
|---|---|
|
very similar (often near-duplicate) |
|
semantically related |
|
weakly related |
|
unrelated |
|
actively dissimilar (rare for English sentences with the same encoder) |
For practical retrieval, thresholds vary by encoder — a 0.7 cutoff on all-MiniLM-L6-v2 is roughly the same recall as 0.4 on bge-large-en. Always calibrate against a labelled set rather than picking thresholds in the abstract.
Tokenization Matters
The same text can produce different vectors with different tokenizers — this is not a bug, it’s the model’s input representation changing. Always pair an encoder with the tokenizer it was trained with. KBertJava.loadGGUF(…) reads the tokenizer config out of the GGUF metadata so you can’t accidentally mix them up; if you wire BertRuntime and HuggingFaceTokenizer manually in Kotlin, make sure both come from the same model.
Where the Code Lives
| Concern | Source |
|---|---|
Encoder forward pass + mean pooling + L2 norm |
|
Provider-neutral SPI |
|
BERT → SPI adapter |
|
Java surface |
|
WordPiece / SentencePiece tokenizer |
|