Cosine similarity in 60 seconds

You don't need linear algebra to use embeddings. You need one function. It's called cosine similarity and it answers one question:

Do these two vectors point in roughly the same direction?

That's it. Two pieces of text that mean similar things produce vectors that point similar directions. Two unrelated pieces of text produce vectors pointing different directions.

The formula is three lines of stdlib Python — run the editor.

What the numbers mean

Score	Interpretation
1.0	Identical direction (vectors of the same text)
0.85–0.99	Strong semantic match — usually a hit
0.5–0.85	Related topic, possibly the right chunk
0.2–0.5	Tangentially related — probably noise
Below 0.2	Unrelated — definitely noise

These ranges shift with the model. text-embedding-3-small tends to give higher absolute scores than voyage-3. Always look at the relative order, not the absolute number. The top-K matters, the score is a sanity check.

Why "cosine" and not regular distance?

Two reasons. First, embeddings have hundreds or thousands of dimensions, and in high-dimensional space, direction carries meaning while magnitude mostly carries length (more tokens → bigger vector). You want direction.

Second, cosine is bounded — it's always between -1 and 1, so you can compare scores across queries. Euclidean distance is unbounded and depends on vector length.

The pre-flight you have to remember

Cosine similarity assumes vectors are pointing out from the origin. If one vector is at the origin (all zeros) the formula divides by zero. In practice this means:

Don't embed empty strings.
Don't embed strings of only whitespace.
Trim and validate your input before calling the API.

A surprising number of RAG bugs are "we embedded an empty string once and now the index has a zero vector that matches every query equally poorly."

What you'll do next

Step 4: given three vectors and a query, predict which one is closest. Step 5: fill in the cosine formula yourself. Step 6 and 7: fix two bugs every junior engineer ships. Step 8: write the ranker. Step 9: build a tiny FAQ search engine end-to-end.

⌘↵ runs the editor.read, then continue.

promptdojo_›phase 04 · shipping discipline›ch 22 · context and retrieval

lesson 2 of 4 · embedding that fits the budget — pick a model that matches your corpusstep 3 / 9

Cosine similarity in 60 seconds

You don't need linear algebra to use embeddings. You need one function. It's called cosine similarity and it answers one question:

Do these two vectors point in roughly the same direction?

That's it. Two pieces of text that mean similar things produce vectors that point similar directions. Two unrelated pieces of text produce vectors pointing different directions.

The formula is three lines of stdlib Python — run the editor.

What the numbers mean

Score	Interpretation
1.0	Identical direction (vectors of the same text)
0.85–0.99	Strong semantic match — usually a hit
0.5–0.85	Related topic, possibly the right chunk
0.2–0.5	Tangentially related — probably noise
Below 0.2	Unrelated — definitely noise

Why "cosine" and not regular distance?

Second, cosine is bounded — it's always between -1 and 1, so you can compare scores across queries. Euclidean distance is unbounded and depends on vector length.

The pre-flight you have to remember

Cosine similarity assumes vectors are pointing out from the origin. If one vector is at the origin (all zeros) the formula divides by zero. In practice this means:

Don't embed empty strings.
Don't embed strings of only whitespace.
Trim and validate your input before calling the API.

A surprising number of RAG bugs are "we embedded an empty string once and now the index has a zero vector that matches every query equally poorly."

What you'll do next

⌘↵ runs the editor.read, then continue.

Embedding that fits the budget — pick a model that matches your corpus — step 3 of 9

Cosine similarity in 60 seconds

What the numbers mean

Why "cosine" and not regular distance?

The pre-flight you have to remember

What you'll do next

Embedding that fits the budget — pick a model that matches your corpus — step 3 of 9

Cosine similarity in 60 seconds

What the numbers mean

Why "cosine" and not regular distance?

The pre-flight you have to remember

What you'll do next