
The multi-shot problem isn't a model problem, it's a planning problem: because generative video is stateless, the cheapest edit you'll ever make is a shot list and a storyboard built before you spend a single credit. Learn the end-to-end workflow from brief to beats to board to animatic, and the continuity rules that make independently generated clips actually cut together.
This episode pairs a fast news rundown with the planning tutorial that anchors Act Three: how to storyboard multi-shot AI video scenes before you generate anything.
News (late June 2026)
Tutorial: Storyboarding Multi-Shot Scenes
Generative video is stateless, so continuity has to be decided on paper. Covers the shot list and its columns (Boords template), the storyboard and animatic, the brief-to-assemble workflow, continuity rules (180-degree rule, eyeline match, coverage), and why frameworks like MultiShotMaster and CoAgent exist. Scene-builder tools compared: LTX Studio, Runway Workflows, Google Flow, Kling 3.0, Higgsfield, Invideo, Krea, plus non-generative StudioBinder and Boords. Note: Sora was discontinued around April 2026.
Let's start with the news, and the story of the week belongs to Runway, which shipped four separate things between June twenty-fourth and June thirtieth. On June twenty-fourth, they added four-K support for Seedance 2.0 on the platform, with six new four-K ratios on the video endpoints, including thirty-eight-forty by twenty-one-sixty in sixteen by nine widescreen, and the vertical flip of that. That tier bills at a hundred and fifty credits per second, so it's the premium option.
On June twenty-fifth came Agent 2.0, an upgraded conversational agent aimed squarely at marketers. It builds complete campaigns, analyzes creative performance, and scales ads and videos across platforms and formats, all inside one conversation. Then on June twenty-sixth, Runway added Seedance 2.0 Mini, a much cheaper tier that takes text, image, and video inputs, supports keyframe control, and generates audio. Mini makes four-to-fifteen-second clips at four-eighty or seven-twenty p, and it prices at sixteen credits per second with a sixty-four-credit minimum. Compared to full four-K Seedance at a hundred and fifty credits a second, Mini is roughly nine times cheaper per second, which makes it the obvious choice for drafts and batch work.
Rounding out the week, on June twenty-ninth and thirtieth, Runway shipped Seed Audio 1.0, first to paid plans, then via the API. It's a text-to-speech plus sound-effects and music model that generates up to a hundred and twenty seconds of audio, with an optional thirty-second audio reference, outputting standard audio file formats. It bills at a quarter-credit per second, and it puts native voice, effects, and music inside the Runway pipeline, competing with the dedicated audio finishing tools. One attribution note: Seed Audio and Seedance are ByteDance-lineage technology. Here they show up as models hosted on Runway's platform, so think "Runway added them," not "Runway trained them."
Just before that window, on June twenty-third, ByteDance announced Seedance 2.5 at its Volcano Engine conference. Treat it as just-announced and not yet shipped, with enterprise beta now and a public launch targeted for early July. Reportedly, and I stress reportedly since pricing and benchmarks aren't public, it does a single-pass thirty-second clip with scene changes inside one generation, accepts up to fifty reference assets across images, audio, and video, and supports native four-K. For reference, Seedance 2.0 ran about two dollars fifty per fifteen-second clip on some third-party platforms. One viral post called it the best video model in the world; treat that as a launch reel, not a fact, since no arena scores exist yet.
Finally, the leaderboard snapshot, which reshuffles monthly. On the Artificial Analysis Video Arena, in text-to-video with audio, Dreamina Seedance 2.0 leads. Without audio, Alibaba's HappyHorse line leads, with Seedance close behind. For open weights, Lightricks' LTX-2 owns the board. The durable read: treat every model as interchangeable.
Today's episode is the planning episode, the one that happens before you generate anything at all. And I want to open with the single idea that runs underneath everything we're going to do, because if you take one thing away, take this. The multi-shot problem is not a model problem. It's a planning problem. Over the last several episodes in Act Two, you learned how to make a single shot reliable. You learned character consistency with sheets and references. You learned keyframe chaining, seeding one shot from the last frame of the shot before it. You learned style look-locking so the palette and lighting hold. All of that made one shot trustworthy. Now the job changes. Now the job is to make a set of shots cut together, and that decision gets made before generation, on paper, in a shot list and a storyboard.
So why does this matter more for AI video than it does for a traditional film shoot? The answer is a single word, and that word is stateless. Generative video is stateless. Every text-to-video or image-to-video call produces a standalone clip that has no memory of the clip before it. There's no persistent memory, no feedback loop between generations. The model cannot recall an entity's exact visual identity from one shot and reproduce it in the next. So when you ask for a sequence without a plan, what you get is identity drift, scene discontinuity, and unstable style. Researchers who tried feeding multiple shot descriptions at once reported, and I'm quoting the spirit of it, wildly inconsistent characters and wildly inconsistent settings.
This statelessness is exactly why an entire ecosystem of frameworks exists to fight it. Kuaishou, the company behind Kling, released an open-source project called MultiShotMaster specifically for multi-shot generation. On the academic side, there's work like VideoMemory, which introduces what it calls a Dynamic Memory Bank, and CoAgent, which runs a plan, then generate, then verify, then refine loop. You don't need to memorize those names. What you need to internalize is why they exist. They exist because the raw model forgets, and somebody has to hold the memory. In the do-it-yourself workflow we're about to build, that somebody is you, and the memory is your shot list and your board.
There's a second reason planning matters more here, and it's about money and irreversibility per reroll. On a physical set, coverage is cheap once the crew is standing there. You roll the master, then the medium, then the close-up, all in one setup, because the lights are already hung and the actors are already in position. In AI video, none of that is true. Every single shot is a paid generation, measured in credits or compute-seconds. A shot that won't cut is a shot you pay to regenerate. So planning becomes the cheapest edit you will ever make. Say that back to yourself. Planning is the cheapest edit you'll ever make, because it's the only edit that costs nothing to redo.
Let's define the first artifact. A shot list is the plain-text plan that breaks a scene into individual camera setups, one row per shot, listing what each shot shows and how it's framed. In traditional film it's the director's blueprint, telling the crew what to capture. In AI video it's something even more direct, because each row maps almost one-to-one onto a generation prompt. That's the magic of it. Every row you write is a prompt you're going to run. So the shot list isn't paperwork you do and set aside. It's the actual script your generator reads, translated by you into prose.
Now, a full professional shot-list template can run to about fourteen columns, but let's walk the core set so you know what each one controls. First is the shot number, a reference ID like one-A or one-B within a scene, so you can talk about a specific setup. Second is the shot description, the action and any dialogue. Third is the shot size, or framing, and this is a whole vocabulary of its own. You've got the extreme wide shot, abbreviated E-W-S. The wide or long shot. The medium wide. The medium shot, M-S. The medium close-up, M-C-U. The close-up, C-U. And the extreme close-up, E-C-U. That ladder from extreme wide down to extreme close is how you control how much of the world and how much of the face the viewer sees in any given beat.
Next column is the angle. Eye-level, which reads neutral. High angle, looking down, which tends to diminish the subject. Low angle, looking up, which tends to empower. Overhead or bird's-eye, straight down. And the Dutch tilt, a canted, tilted frame that signals unease. After angle comes camera movement. Static, meaning locked off. Pan, tilt, dolly or tracking, crane, push-in, zoom, or handheld. Then there's the lens column, the focal length, and this one really matters because it controls compression and depth of field. A twenty-four-millimeter lens is wide. A fifty-millimeter is normal, close to how the eye sees. An eighty-five-millimeter is a portrait or telephoto lens that compresses the background and throws it soft. After that you've got location, subject or talent, props, lighting and time of day, audio, equipment, and notes or duration.
That's the full film template, but for AI video you can trim it to an actionable subset per row. The columns you truly need are subject, action, camera, and by camera I mean size plus angle plus movement bundled together, then lens, then light. Plus two more that aren't optional for us, because they're literal generator parameters. Duration and aspect ratio. Those aren't creative preferences you can smooth over in the edit. They're numbers the generator needs. And then I want you to add two columns that are AI-only, that no film shot list has ever needed. The first is reference assets, meaning which character-sheet image, which style reference, or which last-frame this shot seeds from. The second is seed, the number that pins a generation's randomness, so you can reproduce a result exactly or vary it deliberately.
Let me pause on aspect ratio for a second, because it's a silent killer. Aspect ratio is the width-to-height shape of the frame. Sixteen by nine is widescreen, the YouTube shape. Nine by sixteen is vertical, the Reels and TikTok shape. One by one is square. And two-point-three-nine to one is the wide anamorphic cinema shape. Here's the rule. Decide your aspect ratio once, at the top of the shot list, and never let it drift. A mismatched ratio between two shots is an instant, guaranteed won't-cut. Your viewer's eye catches it before their brain does.
Now the visual sibling of the shot list, the storyboard. A storyboard is a sequence of drawn or generated frames, one per shot, showing composition, character placement, and camera framing before anything moves. It's the visual version of the shot list. And then there's the animatic, which is where it comes alive. An animatic is the storyboard frames edited together with rough timing and scratch audio, rendered out as video. It's a moving storyboard. Its whole purpose is to test pacing and to check whether the sequence actually reads before you commit to full production. There's an even fancier cousin called previs, or previsualization, which does the same thing with rough three-D, but for our purposes the animatic is the workhorse.
Here's why the storyboard is uniquely, almost unfairly valuable in AI video, more than it ever was in film. Your storyboard frames double as keyframes. Think about that. A frame you mint in an image model to plan a shot is the exact same asset you feed an image-to-video generator as the first frame. Or, using the keyframe chaining you learned earlier, as the last frame seeding the next shot. The planning artifact and the production input are the same file. In film, a storyboard is a sketch you throw away once you shoot. In AI video, the storyboard is the shoot. That collapse is the single biggest reason boarding pays off more here than anywhere else.
And the animatic is a near-free way to catch a bad cut before you spend a single credit. Cut your stills onto a timeline with placeholder durations and watch it back, and you'll feel a jump cut or a screen-direction error while it still costs you nothing to fix. That feeling, that little wrongness in your gut when two frames sit next to each other, is worth more than any amount of regeneration. You want to provoke that feeling early, on free stills, not late, on paid clips.
So let me give you the copyable end-to-end workflow, start to finish, and then we'll go back and deepen the tricky parts. There are seven steps, and they run from brief, to beats, to shot list, to board and keyframes, to animatic, to generate, to assemble.
Step one is the brief. One or two sentences capturing the goal, the platform, and the who and where. The platform matters because it sets your aspect ratio and your length before you make any other decision. Here's my running example for the whole episode. A thirty-second vertical, nine-by-sixteen product teaser, where our character Mia unboxes a pair of headphones in a sunlit loft, and the mood is upbeat. That one sentence just locked my ratio to nine by sixteen and my length to thirty seconds, and it told me my character, my setting, and my tone.
Step two is beats. You break the brief into three to six story beats. Arrival, discovery, reaction, payoff, that kind of shape. And here's an important distinction. Beats are narrative. Shots are camera. They are not the same thing. One beat can be several shots. The discovery beat, Mia opening the box, might be a wide, a medium, and a close-up all by itself. Don't confuse the story unit with the camera unit.
Step three is the shot list itself. One row per shot, and each row carries subject, action, camera, lens, light, duration, aspect ratio, reference, and seed. And I want you to aim for coverage even if the whole sequence is only twenty seconds long. Coverage means options. We'll come back to what that word means precisely.
Step four is the board and keyframes. For each row, you mint one still in your image model. And crucially, you reuse the character sheet from the consistency episode as a reference, so Mia is literally the same person in every frame. And you apply the locked look, the style reference from the style episode, so the palette and lighting match. These stills are simultaneously your storyboard and your first and last frames. Same artifact, two jobs.
Step five is assembling a quick animatic. Drop those stills onto a timeline with rough durations and scratch audio, and watch it. This is where you fix continuity, now, while it's free. Step six is generate. You run each shot as image-to-video, seeding from its board frame, and you chain last-frame to first-frame wherever two shots need to be continuous. And step seven is assembling the real edit in an N-L-E, a non-linear editor, which is software where you arrange clips on a timeline in any order. DaVinci Resolve, CapCut, or Premiere Pro. That assembly step is its own episode, coming up next, so today we stop at the animatic and the generation.
Now let's go deep on continuity, because this is where sequences live or die. Remember, the model won't enforce any of these rules for you. It has no memory and no craft. So you enforce them, in the shot list and on the board. There are five big ones.
The first is the one-eighty-degree rule. Imagine a line, called the axis of action, running through your two subjects. You keep the camera on one side of that line across all your shots. Why? Because it preserves screen direction. A character stays on the same side of the frame, and eyelines stay consistent, so two people talking keep looking at each other. The moment you cross that line, the two people suddenly appear to look the same way, off in the same direction, instead of at each other, and it's deeply jarring even to a viewer who couldn't name why. For AI, the fix is beautifully concrete. Encode left and right position directly in each prompt. Write "Mia frame-left, facing right." Say it every time, so the generator can't flip her on you.
The second rule is the eyeline match. If you show someone looking off-frame, and then you cut to what they see, the gaze vectors have to be consistent. It's tied directly to the one-eighty rule. And the practical takeaway is that you plan the pair together, the looker and the looked-at, as a unit, never separately.
The third rule is matching action. You overlap the same movement across a cut. This is called matching on action. She reaches for the box in the wide shot, and the close-up continues that same reach. In AI video, the way you get this is to generate an overlap of motion at your shot boundaries, so the editor has a frame, a moment of shared movement, to cut on. Without that overlap, the cut feels like a hiccup.
The fourth idea is establishing versus coverage. An establishing shot places the viewer in the space, and it's usually a wide. Coverage means shooting from multiple sizes and angles so you have cutting options later. And a master shot is one angle that holds the whole scene start to finish, which can double as your establisher. The working rule is to plan at least a wide, a medium, and a close for every beat. That's coverage. That's what gives your future self, in the edit, room to breathe.
The fifth is entrances and exits. If a character exits frame-right, they should enter the next shot frame-left, to preserve their travel direction. If they exit right and enter right again, they appear to have reversed. So note the exit edge and the entry edge in your shot list, explicitly, every time someone moves between shots.
Underneath all five of these sits what I'll call the consistency substrate, and it's built entirely from your earlier-episode primitives. Identity and look continuity are carried by three things. The character sheet, those multi-reference images defining the same face and wardrobe, which pins who is on screen. The look-lock, or style reference, which pins the palette and the lighting. And keyframe chaining, the last-frame seeding and first and last-frame interpolation, which pins motion continuity across a boundary. Everything you learned in Act Two is the foundation that makes today's planning actually hold.
Let me make the row-to-prompt translation completely concrete, because this is the moment the plan becomes a generation. Here's a worked row for shot one-A. It reads: one-A, extreme wide shot establishing, high angle, slow push-in, twenty-four-millimeter, golden-hour window light, Mia enters frame-left, five seconds, nine by sixteen, references the third image of the Mia sheet plus the loft style frame, seed forty-four-twelve. Now watch how that becomes a prompt. You'd write something like: extreme wide shot, high angle, slow push-in, twenty-four-millimeter look, golden-hour side light from a large window, sunlit loft, Mia entering from frame-left carrying a box, warm cinematic color, nine by sixteen. Then you attach the character reference and the style frame, you set the duration to five seconds, and you set the seed to forty-four-twelve.
And here's a discipline that will save you enormous grief. Keep subject, camera, lens, and light in the same order in every single row and every single prompt. Same grammar every time. When your prompts share a structure, drift becomes easy to spot, and you can vary exactly one axis at a time. Reuse the same seed across rerolls of the same shot, so everything stays constant while you change one word. Only change the seed when you genuinely want a different take. That one habit, one variable at a time, holding the seed, is what separates deliberate iteration from gambling.
Now let's talk tools, because there are two philosophies here and you should know both. On one side is the bare text-to-video box, which only knows the current clip. On the other side is a scene builder, which is a tool that manages a whole multi-shot sequence in one project. You lay out shots as cards or panels, you keep your characters and your style attached across all of them, and you generate them together rather than one at a time. Let me walk the field.
LTX Studio is a script-to-storyboard pipeline. You paste in a screenplay and it segments it into scenes, generates storyboard thumbnails, suggests framing, and casts characters using what it calls Persistent Character Profiles, where you define age, wardrobe, and facial details once and hold them across shots. It also has keyframe-able camera controls like crane, orbit, and tracking, plus sound effects and soundtrack. It's priced in compute-seconds, with a free tier, a Lite tier at fifteen dollars a month, a Standard tier at thirty-five dollars a month with commercial rights and Veo 2, and a Pro tier at a hundred and twenty-five dollars a month with more seconds and both Veo 2 and Veo 3.
Runway takes a different angle with Workflows, which let you chain steps together, either Runway-built or custom no-code, wiring the outputs of one step into the inputs of the next for a repeatable pipeline, and its generation is strong for concept and storyboarding. Runway is credit-based, with a one-time free grant, a Standard plan around twelve to fifteen dollars a month, a Pro plan around twenty-eight to thirty-five, and an Unlimited or Max tier up in the seventies to nineties.
Google Flow is Google's Veo-powered app, and it has a Scene Builder specifically for stitching and chaining shots into longer sequences, with camera controls and native audio. It's credit-based too, with a free daily allotment, an AI Pro plan at about twenty dollars a month, and an AI Ultra plan at around two hundred and fifty. Note that the Veo tiers cost different amounts per generation, with the lighter Veo 3.1 tiers cheap and the quality tier much more expensive.
Kling, with Kling 3.0, added a native multi-shot storyboard tool in its Omni line, where you set duration, size, angle, pacing, and camera movement per shot, and the model weaves roughly two to six shots, up to about fifteen seconds, into one sequence from a structured prompt. Its subscriptions run from free with a daily credit allotment, up through Standard, Pro, Premier, and an Ultra tier around a hundred and eighty dollars with early four-K and the multi-shot storyboard. On the API, it runs roughly eight and a half cents per second standard up to about seventeen cents per second Pro.
Then there's Sora's storyboard, from OpenAI, which used a timeline-based editor with caption cards describing setting, characters, and actions over time. But this is important. OpenAI discontinued the Sora product around April twenty-sixth of 2026. The app is gone, and Sora 2 and Pro were pulled. The Sora 2 API remains for existing developers until roughly late September of 2026. So treat Sora's storyboard as a pattern to understand, an idea worth knowing, not a tool to send anyone to.
Higgsfield has a storyboarding tool called Popcorn, which generates up to about eight matching scenes with aligned characters, lighting, and tone from your prompts and reference images, and it has an export-to-Sora-2 path, with that same caveat about Sora. Plans start around five dollars a month, credit costs vary by model, and unused monthly credits don't roll over. Invideo, through its Vision product, runs a script-to-shot-list-to-panel agent workflow. You load a full script, it breaks it into scenes and a shot list, and then a storyboard agent generates per-shot panels. It bundles Sora 2 and Veo 3.1 from around twenty-eight dollars a month.
And Krea is more of an aggregator. Its Pro-and-up plans give you access across Veo 3.1, Kling, Hailuo, Wan, and Runway from a single interface, and it has a Motion Transfer feature that applies the motion from one clip onto a still. It's units-based, from a free daily tier up through Basic, Pro, and Max plans. Krea is genuinely useful for one specific thing, benching the same board frame across several models to see which one nails it. For the spreadsheet-minded who want pure planning without generation, there are two non-generative tools. StudioBinder gives you a free shot-list and storyboard creator linked to your script and breakdowns. And Boords is a dedicated storyboard tool with a free tier, a Standard plan at forty-four dollars a month with two hundred AI image credits, and a Workflow plan at eighty-nine dollars a month that removes watermarks.
Let me come back to that image-model step, minting your frames, because it's the heart of the do-it-yourself approach. For each shot-list row, you generate one still in your image model, with the character sheet attached for identity, the style reference attached for look-lock, and the framing, angle, and lens all described in words. Those stills are your storyboard for the animatic, and they're your first and last frames for the image-to-video pass. Planning and generation collapse into one artifact. I keep hammering this because it's the reason the AI workflow rewards boarding even more than film ever did.
Now let me name the pitfall, because you will hit it, everyone does. The pitfall is over-generating before planning. The tell, the diagnostic symptom, is a folder of forty gorgeous clips that absolutely refuse to assemble. You recognize it by three signatures. The first is identity drift, where the character's face or wardrobe subtly changes shot to shot, which is pure statelessness. You fix that upstream, with the character sheet as a persistent reference and by boarding every shot from it. The second is style drift, where the palette, lighting, or grain shifts between clips. You fix that with the locked look, the style reference, on every single generation. The third is the won't-cut continuity error, where the character jumps sides of the frame because the one-eighty was broken, or eyelines don't meet, or travel direction reverses, or the aspect ratios and lens looks just don't match. You recognize this one by feeling a jump when you lay two clips side by side, and you'll feel it in a two-minute animatic of stills long before you've paid to generate a single second.
And here's the meta-recognition, the thing to tattoo on your brain. If you find yourself rerolling the same shot more than a couple of times chasing a look, stop generating and go back to the board. The fix is almost always a planning decision. A reference you forgot to attach, a screen-direction note you didn't write, a seed you should have held. It is almost never solved by another spin of the dice, and every extra spin is paid.
Let me contrast the two workflows head-on so you can choose deliberately. The do-it-yourself path is a shot list in a spreadsheet, an image model for your frames, and assembly in an N-L-E. You get maximum control, you stay model-agnostic so you can bench each shot on whatever the current best model is per the leaderboard, and it's the cheapest option. The cost is more manual bookkeeping, because you carry your references and seeds by hand. The all-in-one scene-builder path, whether that's LTX Studio, Flow's Scene Builder, Runway Workflows, Kling's multi-shot, or Invideo Vision, handles persistence for you, with character profiles and project-wide style, and it's faster from script to board to sequence. The cost there is that you're locked to that platform's models and its credit economy, with less shot-by-shot control.
And when is planning actually overkill? Be honest with yourself here. For a single five-to-eight-second clip, a mood or texture loop, a one-shot social post, just prompt it. Don't board a single clip. But planning becomes essential the moment you have three or more shots that must feel like one scene, any dialogue exchange, because that means eyelines and the one-eighty, any deadline where rerolls cost real money and time, and any piece that has to match an existing brand look.
On the model landscape, I'll point you at the live board rather than freeze a ranking, because these names change monthly. Go to the Artificial Analysis Video Arena, which is an Elo leaderboard built from blind pairwise human votes on the same prompt, and bench on your own shots. As a dated snapshot only, and please verify before you trust it, text-to-video without audio was led by the HappyHorse line, with Dreamina Seedance 2.0, Kling 3.0, and Kling Omni close behind. With audio was led by Dreamina Seedance 2.0. And open weights were led by the LTX-2 family. The durable point, the one that outlives every reshuffle, is to treat every model as interchangeable, plan model-agnostically, and let the board and your own bench of a test shot pick the generator per project.
Finally, let me connect this to what came before and what's coming next. Backward, the character sheet from the consistency episode is your identity reference attached to every board frame and every generation. Keyframe chaining, the last-frame seeding and first and last-frame interpolation, is how adjacent shots stay continuous across a cut. And style look-locking is the style reference keeping palette and lighting stable across independently generated shots. Forward, this board and this shot list feed the assembly edit, where you arrange your generated clips on an N-L-E timeline. That's DaVinci Resolve, the free but pro-grade finisher, CapCut, the fast vertical-social tool with AI-agent automation, or Premiere Pro, the industry standard. Then comes dialogue and lip-sync, and then the directorial-control episodes. The animatic you built today is the skeleton that the assembly edit fleshes out. So plan it now, on paper, while it's free.