Attention and Transformer blocks (step 1/7) · cnns, transformers, and useful llm internals

Attention and transformer blocks

Attention is not magic — it's a weighted lookup. For each token, the model asks "which other tokens matter to me right now?", turns those relevance scores into weights that sum to 1, and returns the weighted average of the other tokens' values. High score → more of that token mixed in.

Run the editor: scores [1, 3] become weights [0.25, 0.75], and the output leans toward the more-relevant token.

Note: real attention turns scores into weights with softmax (which uses exponentials so the weights are always positive and sum to 1). We use plain proportional normalization (score / total) as a browser-safe stand-in — same shape: scores in, weights summing to 1 out.

A transformer block

A transformer stacks blocks, and each block is two parts:

Attention — mix information across tokens (the weighted lookup above).
A small feed-forward network — process each token's mixed result.

Stack N of these blocks and you have a transformer. The reason transformers beat CNNs for language: attention lets any token look at any other token (long-range), while a CNN filter only sees a local window.

Why a builder cares

When you read "12-layer transformer, 8 attention heads," now you know the shape: 12 blocks, each doing weighted-lookup mixing then per-token processing. And the practical intuition — attention = relevance-weighted average — is what lets you reason about why a model attended to the wrong part of a document, instead of treating it as a black box.