Prompting for real output — composition, control, and the eight knobs that matter — step 4 of 8
Reference images and ControlNet — when prompts aren't enough
The eight knobs get you to a specific-looking image. But sometimes "specific-looking" still isn't enough. You need the exact pose. You need this building's silhouette. You need a character whose face matches this photo.
For these cases, you stop describing the image in text and start showing the model a reference.
Three flavors of image conditioning
The 2026 toolbox has three distinct ways to condition generation on an existing image. Each does a different thing.
1. Image-to-image (the strength dial)
You give the model an image plus a prompt, plus a strength parameter between 0.0 and 1.0. At strength 0.1 the output is nearly your input image with light prompt changes. At strength 0.9 the input is just a vague composition starter and the prompt dominates.
- Flux — supports image-to-image at all tiers; pass
imagefield with the source URL or base64, and astrengthbetween 0 and 1. - Fal.ai — many models on Fal expose image-to-image as
image_url+strengthin the input schema. - nano-banana — supports image input as part of the multi-turn conversation. Edit-by-instruction is the same surface; you pass the image then describe what to change.
Use it for: rough sketch → polished render, color reference → matching scene, hand-drawn composition → photorealistic version.
2. Edit-by-instruction (Flux Kontext, nano-banana)
Newer than i2i. Instead of "give me this image but adjusted by strength X," you say "change the background to a beach" and the model edits in-place while preserving identity.
- Flux Kontext [pro], [max], [dev] — built specifically for this. Maintains character consistency across edit chains. Supports multi-step edits (edit, then edit again, identity holds).
- nano-banana family — the same shape inside a Gemini conversation; cheaper per turn but less precise than Flux Kontext for serious editing work.
Use it for: same product on different backgrounds, same character in different scenes, swapping text on a sign, recoloring an outfit while keeping the model identical.
3. ControlNet (structural conditioning)
The most powerful and most fiddly. You give the model:
- An image (or a pre-extracted control signal — an edge map, a depth map, a pose skeleton, a segmentation mask).
- A prompt.
- A control type (
canny,depth,pose,seg,scribble,lineart,mlsd, etc.).
The output follows the structure of the control signal while filling in the content from the prompt. ControlNet pose conditioning is how you get "this exact pose, but as a knight in armor in a fantasy setting." ControlNet canny is how you get "this exact building silhouette, but rendered in oil painting style."
ControlNet is most commonly used through the open-weights Flux and Stable Diffusion ecosystem. You'll see it referenced in ComfyUI workflows, in self-hosted Flux setups, and via specific Replicate/Fal models like controlnet-pose or controlnet-depth.
API surfaces vary widely; if you need this in a hosted API, search the model index on Replicate or Fal for "controlnet" plus your control type. The call shape varies by provider — don't memorize one signature.
When you'd actually reach for these
The decision tree:
- Pure text prompt is enough for most marketing/social/illustration work. The eight knobs cover the case.
- Image-to-image when you have a rough that you want polished — a sketch, a color reference, a low-quality phone shot.
- Edit-by-instruction (Flux Kontext) when you have a finished image and want variants that preserve identity — same product on twelve backgrounds, same character in five scenes.
- ControlNet when the structure must be exact — pose-locked character animation frames, building-silhouette-preserved architectural rendering, matching the layout of a wireframe to a finished UI mockup.
The mistake to avoid: starting with ControlNet because it sounds powerful. It's a power-user feature with significant setup overhead. Most production briefs are solved with a good text prompt and, if needed, Flux Kontext for the multi-variant case.
What the next step asks
You'll be given five real-world briefs, each with one knob obviously missing. The drill is identifying which one. After that you'll read the failure modes (hands, text, character drift), then write the scoring function.
Prompting for real output — composition, control, and the eight knobs that matter — step 4 of 8
Reference images and ControlNet — when prompts aren't enough
The eight knobs get you to a specific-looking image. But sometimes "specific-looking" still isn't enough. You need the exact pose. You need this building's silhouette. You need a character whose face matches this photo.
For these cases, you stop describing the image in text and start showing the model a reference.
Three flavors of image conditioning
The 2026 toolbox has three distinct ways to condition generation on an existing image. Each does a different thing.
1. Image-to-image (the strength dial)
You give the model an image plus a prompt, plus a strength parameter between 0.0 and 1.0. At strength 0.1 the output is nearly your input image with light prompt changes. At strength 0.9 the input is just a vague composition starter and the prompt dominates.
- Flux — supports image-to-image at all tiers; pass
imagefield with the source URL or base64, and astrengthbetween 0 and 1. - Fal.ai — many models on Fal expose image-to-image as
image_url+strengthin the input schema. - nano-banana — supports image input as part of the multi-turn conversation. Edit-by-instruction is the same surface; you pass the image then describe what to change.
Use it for: rough sketch → polished render, color reference → matching scene, hand-drawn composition → photorealistic version.
2. Edit-by-instruction (Flux Kontext, nano-banana)
Newer than i2i. Instead of "give me this image but adjusted by strength X," you say "change the background to a beach" and the model edits in-place while preserving identity.
- Flux Kontext [pro], [max], [dev] — built specifically for this. Maintains character consistency across edit chains. Supports multi-step edits (edit, then edit again, identity holds).
- nano-banana family — the same shape inside a Gemini conversation; cheaper per turn but less precise than Flux Kontext for serious editing work.
Use it for: same product on different backgrounds, same character in different scenes, swapping text on a sign, recoloring an outfit while keeping the model identical.
3. ControlNet (structural conditioning)
The most powerful and most fiddly. You give the model:
- An image (or a pre-extracted control signal — an edge map, a depth map, a pose skeleton, a segmentation mask).
- A prompt.
- A control type (
canny,depth,pose,seg,scribble,lineart,mlsd, etc.).
The output follows the structure of the control signal while filling in the content from the prompt. ControlNet pose conditioning is how you get "this exact pose, but as a knight in armor in a fantasy setting." ControlNet canny is how you get "this exact building silhouette, but rendered in oil painting style."
ControlNet is most commonly used through the open-weights Flux and Stable Diffusion ecosystem. You'll see it referenced in ComfyUI workflows, in self-hosted Flux setups, and via specific Replicate/Fal models like controlnet-pose or controlnet-depth.
API surfaces vary widely; if you need this in a hosted API, search the model index on Replicate or Fal for "controlnet" plus your control type. The call shape varies by provider — don't memorize one signature.
When you'd actually reach for these
The decision tree:
- Pure text prompt is enough for most marketing/social/illustration work. The eight knobs cover the case.
- Image-to-image when you have a rough that you want polished — a sketch, a color reference, a low-quality phone shot.
- Edit-by-instruction (Flux Kontext) when you have a finished image and want variants that preserve identity — same product on twelve backgrounds, same character in five scenes.
- ControlNet when the structure must be exact — pose-locked character animation frames, building-silhouette-preserved architectural rendering, matching the layout of a wireframe to a finished UI mockup.
The mistake to avoid: starting with ControlNet because it sounds powerful. It's a power-user feature with significant setup overhead. Most production briefs are solved with a good text prompt and, if needed, Flux Kontext for the multi-variant case.
What the next step asks
You'll be given five real-world briefs, each with one knob obviously missing. The drill is identifying which one. After that you'll read the failure modes (hands, text, character drift), then write the scoring function.