Embedding that fits the budget — pick a model that matches your corpus — step 7 of 9
Someone wrote a "fast similarity" that's just the dot product — they skipped the normalization step. It compiles, it runs, and it RETRIEVES THE WRONG DOCUMENT.
Why: dot product without normalization rewards big vectors. A doc that points the wrong direction but has a bigger magnitude beats a doc that points the right direction with a smaller magnitude.
In the editor, doc_match is an exact-direction match for the
query (cosine = 1.0). doc_long points a totally different
direction but has a bigger magnitude. The broken function thinks
doc_long is closer. Fix the function to divide by the norms.
Expected output:
doc_match: 1.0000
doc_long: 0.5774
winner: doc_match
Someone wrote a "fast similarity" that's just the dot product — they skipped the normalization step. It compiles, it runs, and it RETRIEVES THE WRONG DOCUMENT.
Why: dot product without normalization rewards big vectors. A doc that points the wrong direction but has a bigger magnitude beats a doc that points the right direction with a smaller magnitude.
In the editor, doc_match is an exact-direction match for the
query (cosine = 1.0). doc_long points a totally different
direction but has a bigger magnitude. The broken function thinks
doc_long is closer. Fix the function to divide by the norms.
Expected output:
doc_match: 1.0000
doc_long: 0.5774
winner: doc_match
this step needs the editor
on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.