
An image prompt describes a frozen moment; a video prompt has to describe change over time. We break a shot into eight parts you can fill like a checklist, hand you a copyable template, and finish on the one contradiction that wrecks more clips than anything else.
Episode 2 of Act I. Last time you picked a hosted tool and learned to read the Artificial Analysis Video Arena leaderboard. This episode teaches the durable craft underneath every tool: how to actually write a video prompt. The model field reshuffles monthly, but the anatomy of a good prompt survives every swap.
The one idea: an image prompt describes a frozen moment, a video prompt has to describe change over time. Miss the motion and the model invents its own, which is where that "living photo" drift comes from.
What we cover:
The copyable workflow: a fill-in-the-blank template, built up live one slot at a time from "a woman dancing" to a fully directed shot, plus a second fast run on a product shot.
The pitfall: contradictory instructions (a locked-off camera AND a dramatic push-in), too many simultaneous actions, and tangling camera motion with subject motion. How to recognize it and the durable fix: change one variable at a time, separate camera and subject into clean clauses, avoid "fast," and judge a prompt over several runs, not one lucky roll.
Next up: starting from an image instead of text, the keyframe trick, negative prompts and failure modes, and seeds.
AI-generated podcast by OCDevel.
Last episode, we did the boring-but-load-bearing work: you picked a hosted video tool, and you learned to read the Artificial Analysis Video Arena leaderboard instead of trusting whatever launch reel was trending that week. Good. You've got a tool open in a browser tab. Now comes the part nobody really teaches you, the part that decides whether you get the shot you pictured or a soupy mess that sort of gestures at it.
You have to actually write the prompt.
And here's the thing that trips up almost everyone coming from image tools, or from typing things into a chatbot. Writing a video prompt is not the same job as writing an image prompt. There's one idea underneath this whole episode, and if you take nothing else away, take this. An image prompt describes a frozen moment. A video prompt has to describe change over time. That's the whole game. A still is a single instant. A video is what happens between the first frame and the last frame, and if you don't tell the model what changes, it'll make something up. Usually something you didn't want.
So today we're going to break a video prompt into its parts, the way a mechanic lays an engine out on the bench, so you can see what each piece does. Then I'll give you a fill-in-the-blank template you can steal, build one prompt up from nothing in front of you, and finish on the single mistake that wrecks more clips than any other. The components survive a model swap. The leaderboard reshuffles every month, but the anatomy of a good prompt has been stable for years, and it's the same whether you're typing into Veo, Sora, Kling, or whatever tops the chart next week.
The eight parts of a shot
Here's something reassuring. The major labs all wrote their own prompting guides, independently, and they basically arrived at the same checklist. Google's guide for Veo lists seven elements. OpenAI's guide for Sora breaks a prompt into scene, cinematography, action, dialogue, and sound. ByteDance's guide for Seedance walks you through subject, then action, then environment, then camera, then style. When that many teams converge on nearly the same list without coordinating, that's your signal the list is real. The dialect changes between tools. The parts don't.
So let me give you the parts. There are eight slots. I'll name each one, tell you what it controls, and we'll come back and fill them all in later with a real example.
Slot one is the subject. Who or what is in the frame. This sounds obvious, but the difference between "a person" and "a silver-haired woman in a red silk dress" is the difference between a generic result and a specific one. Concrete noun, appearance, wardrobe, expression. The more specific the subject, the less the model improvises.
Slot two is the action. What the subject does over the length of the clip. This is the first slot that's unique to video, and it's the one beginners forget. The trick that works is to describe the action in beats, little sequential steps. Not "she moves," but "she spins once, her dress flares out, then she dips backward." You're giving the model a tiny piece of choreography instead of a vibe.
Slot three is the camera move. What the viewpoint does over time. The camera can push toward the subject, drift alongside it, circle it, or sit dead still. We'll spend real time on the vocabulary in a minute, because this is where a clip goes from looking like a screen-saver to looking like a movie.
Slot four is the lens and framing. How tight the shot is, and what the implied optics are. Wide shot or close-up. A wide-angle look or a portrait look. How much of the background is in focus. This one's a property of the frame, so it's closer to an image idea than a motion idea.
Slot five is lighting. Soft or hard, where it's coming from, what time of day. ByteDance's Seedance guide makes a strong claim here that I think is right: if you can only add one thing to improve a shot, add a lighting description. Lighting is the highest-leverage word you can add.
Slot six is mood and color, what a colorist would call the grade. Warm or cool, the emotional tone, the film-stock look. Grade just means the color treatment laid over the footage, the thing that makes one clip feel like a sun-bleached memory and another feel like a cold thriller.
Slot seven is pacing. The tempo. Slow, languid, gentle, or frantic. This is a knob on both the subject's motion and the camera's motion, and as we'll see at the end, it's also one of your main tools for fixing a broken shot.
Slot eight is the setting. Where this happens, the time of day, the weather. A rooftop at dusk. A wet street at night. Part of this is a frozen backdrop, and part of it can move, like rain falling or fog drifting.
The split that organizes everything
Now look at those eight slots again, because they sort into two piles, and this sorting is the most useful mental move in the episode.
Some of them describe a frozen frame. Subject, lens and framing, lighting, mood and color, setting. If you handed those five to an image tool, it could draw you a picture. They're the still half.
The other three describe change over time. Action, camera move, pacing. An image tool has no use for these, because a picture doesn't move. They're the motion half.
Here's why that split matters so much. When people come from image generation, they write a gorgeous still description, hit go, and get a clip that barely moves, or worse, one where the whole frame slowly melts and warps as the model invents motion it was never given. That "living photo" drift, where a face subtly morphs and the background breathes, is almost always a missing-motion problem. You filled the still slots and left the motion slots empty, so the model improvised.
The discipline is simple. After you've written the still description, stop and ask one question. What changes between the first frame and the last frame? Then write that down. That single habit, naming the change, is what separates someone who fights the tool from someone who directs it.
Talking to the camera
Let me give you the camera vocabulary, because this is the part that feels intimidating if you've never held a camera, and it's secretly the easiest to fake. These are real film-crew terms, and the reason they work is that the models trained on movie scripts and shot lists, where these words appear constantly. You're not learning camera operation. You're learning the handful of words the model already knows.
I'll group them so they stick. First, ways the camera moves through space.
A dolly is the big one. Originally it was a camera on a little wheeled cart, and to "dolly in" means the whole camera physically glides toward the subject. Dolly out, it backs away. A lot of tools also accept "push in" and "pull out" as plain-English synonyms, and honestly those are clearer, so use them. It feels like stepping toward someone, or backing off.
Then there's trucking, sometimes called tracking, which is the camera sliding sideways while staying the same distance from the subject, like walking alongside someone. A tracking shot more loosely just means the camera follows a moving subject. A pan is when the camera stays put and rotates side to side, like turning your head. A tilt is the same idea but vertical, like nodding. A crane, or boom, lifts the camera straight up or sets it down, which is how you get those shots that rise to reveal how big a place is.
A few more worth knowing. An orbit, or arc shot, circles around the subject while keeping it centered, which is your hero-reveal move. Handheld means exactly what it sounds like, the camera has a little organic shake, and it reads as raw and documentary and alive. And then the most important one for reasons we'll get to: static, also called locked-off. That's the camera bolted to a tripod, not moving at all. Remember that word, because it's the opposite of every move I just listed, and mixing it up with a moving camera is the classic way to break a shot.
There are spicier ones the models know too. A whip pan is a violent, blurry fast pan, usually a transition. A Dutch angle, also called a canted angle, is when you roll the camera so the horizon goes diagonal, and it instantly signals unease or tension. A rack focus is when focus shifts from one thing to another during the shot, like a face going soft as someone behind them snaps sharp. Runway's guide even lists the abbreviation F-P-V, first person view, that fast immersive flying-drone feel, as a keyword it recognizes. Google's Veo guide explicitly calls out the Dutch angle and the rack focus as behaviors you can unlock just by naming them.
Second group: how tight the frame is, what crews call shot size. A wide shot, or establishing shot, puts the subject small in a big environment, and it's called establishing because it sets up where we are at the start of a scene. A medium shot is roughly waist up. A close-up fills the frame with a face. An extreme close-up is a detail, just the eyes, or a watch. An over-the-shoulder frames one person from behind another's shoulder, which is the bread and butter of any conversation scene.
Third, the angle. Eye level is neutral. A low angle looks up at the subject and makes them feel powerful. A high angle looks down and makes them feel small. That's it. Those three do most of the emotional work.
And fourth, the lens itself, which is worth a minute because the numbers look scary and aren't. Lens focal length is measured in millimeters, and the number just tells you how wide or how tight the lens sees. A twenty-four millimeter lens is wide, it takes in a lot, and it can distort a face if you get close. A thirty-five is close to how your own eyes see. A fifty, the one photographers call the nifty fifty, looks natural and gives you nice background blur. An eighty-five is the classic portrait lens, it flatters faces and melts the background into a creamy blur. OpenAI's Sora guide will happily take a virtual focal length, you can literally type fifty millimeter or eighty-five millimeter into the prompt and it understands.
The other lens word is depth of field, which just means how much of the scene is in sharp focus from front to back. Shallow focus means only your subject is sharp and everything behind dissolves into blur. That blur even has its own name, bokeh. Deep focus means everything's sharp, near and far, so the whole scene stays readable. Shallow to isolate a person, deep to keep the context. Both Veo and Sora list this as a control you can ask for directly.
One quick trap while we're here. The word "zoom" is ambiguous. A true zoom changes the lens, but a lot of people use it to mean the camera moving closer, which is really a dolly. If you want the camera to physically move toward the subject, say "push in" or "dolly in." It's less likely to confuse the model.
Light and color
Lighting and color are where a clip stops looking like a tech demo and starts looking like something you'd pay to watch. And again, it's a vocabulary, not a skill you need to own a single lamp for.
Start with time of day, because it's the easiest win. Golden hour is that warm, soft, low light right after sunrise or before sunset, and it makes almost anything look cinematic and a little romantic. Blue hour is the cool twilight just before sunrise or after sunset, calm and moody. Overcast is flat, even, soft daylight with weak shadows, which is forgiving and neutral. Google's Veo guide leans on phrases exactly like "soft overcast daylight with even shadows."
Then the quality of the light. Soft light is diffused, with gentle feathered shadows, and it flatters faces. Hard light is sharp and high-contrast, with crisp-edged shadows, and it's dramatic. There's a pair of terms worth knowing: high-key means bright and low-contrast and cheerful, the look of a clean commercial, while low-key means dark and shadowy and moody, the look of film noir. Rim light, or back light, comes from behind the subject and traces a bright outline around them, which is how you separate a person from a dark background.
A few richer words pay off if you want a specific feeling. Chiaroscuro, a term borrowed from Renaissance painting, means extreme contrast between light and dark, deep blacks and bright highlights. Practical lights are light sources you can actually see in the shot, a lamp, a neon sign, a candle, a phone screen, instead of invisible movie lights. Volumetric light, sometimes called god rays, is when you can see the actual beams of light, through fog or dust or a window. Naming any of these gives the model a concrete target instead of a vague "make it look nice."
Color, the grade, is the last layer. Warm pushes toward amber and orange, cool toward blue and teal. The single most common Hollywood look has a name, teal and orange, warm skin tones against cool teal shadows, and you'll see it on basically every blockbuster poster. Desaturated means the color's pulled out, which reads as gritty and serious. And you can ask for film-stock looks directly, "thirty-five millimeter film," a "Kodak-style warm grade," add a little film grain and a vignette where the corners darken, and suddenly your clip has that analog, shot-on-real-film feeling. Sora's own guide uses tags exactly like "nineteen-seventies thirty-five millimeter film, natural flares, Kodak grade."
One warning. The word "cinematic" by itself is nearly meaningless now. Everyone types it, the models have seen it a billion times, and it barely steers anything. Concrete grade words beat it every time. Say "warm Kodak film look with soft grain," not "cinematic."
How to order the words
So you've got eight slots and a pile of vocabulary. Does the order matter? It does, more than you'd think.
Several 2026 write-ups, including Nemo Video's prompt guide and a prompt-adherence guide from Digen, point out that these models weight the front of the prompt most heavily. The first twenty or thirty words carry the most authority. So front-load. Put your subject and your main action first, where the model is paying the closest attention, and save the lighting and mood and grade for the back half. If something has to get dropped because you asked for too much, you want it to be the atmosphere, not the subject.
There are basically two prompt styles, and which one a tool prefers is the main thing that changes between models. One style is the rich cinematic paragraph, full flowing sentences, like you're a director briefing a crew. Google's Veo and OpenAI's Sora both love this, and Sora will even take a labeled, sectioned prompt with separate blocks for scene, cinematography, action, and sound. The other style is terse and keyword-forward, short and motion-led. Runway's Gen-4 guide is the clearest example, it tells you the model thrives on simplicity, start simple, add detail only as needed. Kling sits in the middle, structured natural language with a fixed order of parts.
Here's the move that keeps you sane. Don't memorize one tool's dialect. Learn the eight components, which are universal, and then translate them into whatever your tool likes. For Runway, comma-separate the components and keep it tight. For Veo or Sora, write them as flowing sentences. Every major lab publishes its own prompt guide naming its preferred dialect, so when you commit to a tool, read its guide once and learn how it likes to be spoken to. The components are the language. The dialect is just the accent.
Quick bit of plumbing while we're here, because the names get confusing. The thing you type into is often not the same name as the model underneath. Google's website for Veo is called Flow. ByteDance's app for Seedance is called Dreamina. The model is the engine, the front-end is the dashboard you actually sit at. Don't let the two names throw you.
Be specific, but don't overpack
Two forces pull against each other here, and learning to balance them is most of the craft.
Force one: specificity wins. Models take you literally. Sora's guide has the cleanest examples of this. Don't write "a beautiful street," write "wet asphalt, a zebra crosswalk, neon reflections." Don't write "moves quickly," write "jogs three steps, brakes, stops at the curb." Observable, concrete details, not abstract adjectives. The phrase to remember is show, don't tell. "Beautiful" tells the model nothing it can render. "Wet asphalt with neon reflections" it can render exactly.
But force two pulls the other way: you can overpack a prompt, and when you do, it gets worse, not better. Past a certain point the model starts dropping or blending your instructions. Sora's guide is blunt about it, it says treat your prompt as a wish list, not a contract, the model may strain under too much. Kling actually publishes a budget, its faster tier wants three or four elements maximum, and even the newer one tops out around five to seven, because over-complex prompts bog it down. Seedance suggests a target of roughly sixty to a hundred words.
So how do you balance them? Decide your one non-negotiable. Usually that's the subject plus the key action, or a specific camera move you really want. Front-load that, make it specific, and let the rest of the prompt be looser. Be precise about what matters, your product, your brand color, the exact composition, and leave the atmosphere broad. When you have to cut, cut atmosphere before you cut the subject or the action. A tight prompt that nails one thing beats a bloated one that smears five things together.
The template, built up live
Okay. Let me hand you the actual reusable template, and then build one real prompt from nothing so you can watch each slot change the result.
Here's the skeleton, in front-loaded order. Subject, with appearance and wardrobe and expression. Then the action, in beats. Then the camera move. Then the lens, framing, and angle. Then the lighting. Then the mood and color grade. Then the setting, with time of day and weather. Then a pacing word. For a keyword tool like Runway, drop the connecting words and comma-separate. For a paragraph tool like Veo or Sora, write it as real sentences.
Now watch it grow. I'll start as bare as possible and add one slot at a time.
Start with just a subject. "A woman dancing." Run that, and you get a generic person in a generic room, the model inventing the motion and the camera, and probably that unstable living-photo drift because you gave it nothing to animate.
Add a specific subject. "A silver-haired woman in a red silk dress, dancing barefoot." Now her identity and wardrobe are locked. Already steadier.
Add the action, in beats, which is the motion slot. "She spins once, her dress flaring outward, then dips backward." Now the model has actual choreography, a defined change from first frame to last, instead of ambient wobble. This single addition fixes more drift than anything else.
Add the camera move. "A slow push-in toward her as she spins." Now there's a deliberate viewpoint, and the shot reads cinematic instead of static.
Add lens, framing, and angle. "Medium-wide shot, eye level, a fifty millimeter look, shallow focus that lifts her off the background." She pops, the background goes to soft bokeh.
Add lighting. "Warm golden-hour key light, with a soft rim light tracing her shoulders." Depth, and that flattering outline that separates her from the world behind her. Remember, Seedance's team calls lighting the single highest-leverage thing you can add.
Add the mood and grade. "A nostalgic, nineteen-seventies thirty-five millimeter film look, warm Kodak-style grade, faint grain, natural lens flares." The aesthetic locks in.
Add the setting. "On a city rooftop strung with fairy lights, at dusk, a light breeze." Now the world is grounded.
And finally pacing. "A languid, slow-motion tempo." The rhythm's set.
Read that final prompt back and you can hear how complete it is, every slot filled, front-loaded, specific where it counts. That build, by the way, is almost exactly the structure of a worked example in Sora's own prompting guide, so it's not my invention, it's the pattern the labs teach.
If you want one more level, some tools, Sora and Seedance among them, take a timed shot list, where you break even a single clip into labeled time segments, each with its own camera setup and action. Zero to two and a half seconds, the reveal, a slow push-in. Two and a half to four seconds, the dip, handheld. That's a genuinely powerful technique, but it's a level-two move. Get the eight slots solid first.
Let me run the template once more, fast, with a completely different shot, so you can feel it flex. Say you're a marketer and you need a three-second product hero for a sneaker. Subject first: "a white running shoe on wet black stone." Action, in beats: "a single droplet falls and splashes off the toe." Camera: "a slow orbit around the shoe." Lens and framing: "extreme close-up, low angle, shallow focus." Lighting: "hard rim light catching the wet edges." Grade: "high-contrast, desaturated, cool teal." Setting: "a dark studio with one practical light off to the side." Pacing: "smooth and deliberate." Same eight slots, totally different result, and notice it took about thirty seconds to write because you're just walking the checklist. That's the whole point of having a template. You stop staring at a blank box wondering what to say, and you start filling in slots.
And notice what I did and didn't include. There's no dialogue, no character backstory, no plot. It's one subject, one clear action, one camera move. That restraint is deliberate. A three-second clip can carry one idea cleanly or four ideas badly, and the listener watching it can only register one anyway.
One more pro touch worth stealing from Kling's guide: when you name a camera move or an action, give it an endpoint. Instead of just "the camera pushes in," try "the camera pushes in, then settles." Instead of "she turns," try "she turns to face us, then holds." Telling the model where the motion ends keeps it from drifting past the moment you wanted and sliding into warp territory. You're not just naming the change. You're naming where the change stops.
The one mistake that breaks the shot
Every episode I want to leave you with the single pitfall you'll actually hit, and for prompting, it's this: contradiction. You ask the model for two things that can't both be true, and instead of picking one, it averages them into mush.
The clearest version, and you will do this, is asking for a locked-off, static camera and a dramatic push-in, in the same prompt. Those are opposites. The camera can't be bolted down and gliding forward at once. So the model either silently ignores one of your instructions, or it splits the difference and gives you something unstable and warped. LTX's guide on common prompt mistakes describes exactly this, that contradictory elements make the model average competing signals rather than commit.
The sibling version is too many actions at once. "She runs, sits, waves, and turns" all in a three-second clip. The model can't choreograph all of it, so it blends them into something that looks like a person glitching. And the most common structural mistake, which Seedance's guide names as its number one, is failing to separate the camera's motion from the subject's motion. If you tangle them together, you get uncontrollable, shaky video. The fix is to state them as two clean, separate clauses. "The dancer spins slowly. The camera holds a fixed frame." Two sentences, two jobs, no confusion.
One specific word to be careful with: "fast." Seedance's team flags "fast" as the single keyword most likely to wreck quality, because fast camera plus fast cuts plus a busy scene almost guarantees jitter and artifacts. Reach for "slow," "smooth," "gradual," or "gentle" instead. You can always speed a clip up later in editing. You can't un-jitter it.
So how do you recognize you've hit this? Three tells. The output picks one interpretation and quietly ignores the rest. Or it blends your instructions into morphing, warping mush. Or it flat-out skips your camera move, the subject moves but the camera sits there, or the reverse.
And here's the fix, which is the real durable skill, more valuable than any vocabulary word I gave you. Change one variable at a time. This is unanimous across every guide I read. If you change three things between generations, you'll never know which one helped. So move one slot, regenerate, keep what worked, and build up rather than down. Start simple, add one component, judge it, add the next. Use one primary camera instruction, not three. Control speed with a pacing word, not a pile of technical parameters. And move your most important instruction to the front.
There's a deeper reason to generate more than once, too. The same prompt gives a different result every run, that's just how these models work, which is why Sora's guide says treat the prompt as a wish list, not a contract. So don't judge a prompt on a single roll. Run it a few times and look at the spread. That ties straight back to last episode, where we said bench tools on your own shots instead of trusting the demo reel. Same instinct. You're not looking for one lucky clip. You're looking for a prompt that lands reliably.
Where this goes next
That's the anatomy. Eight slots, sorted into the still half and the motion half, front-loaded, specific where it counts, and free of contradictions. Internalize that and you can sit down at any tool on the leaderboard and speak its language with an accent it understands.
And it sets up everything coming. Next time we'll look at starting from an image instead of pure text, and here's the lovely part: when you hand the model a still you've already approved, the still slots, subject, framing, lighting, setting, are already fixed by that picture. So the prompt collapses down to mostly the motion slots, action and camera and pacing, the exact three we pulled out today. After that, the keyframe trick, where you mint that perfect opening still yourself. Then negative prompts and the failure modes, the morphing hands and the flicker, where naming the change pays off again. And seeds, which are the real answer to that "different result every time" problem, the way you lock a look you've landed so you can come back to it.
But that's next time. For now, go write one prompt with all eight slots filled, run it three times, and watch how much closer you land to the shot in your head.