
A single clip caps out around five to ten seconds, so you build longer scenes by harvesting the last frame, seeding the next clip, and using first/last-frame interpolation to keep motion flowing across the seam. The catch is the join: generators decelerate the camera at clip ends, so naive concatenation reads as a freeze unless you match momentum and trim the deceleration tail.
This episode pairs an AI video news rundown with a hands-on tutorial on chaining keyframes into continuous multi-shot scenes.
Why clips cap at 5-10 sec, the last-frame to start-frame workflow, the frozen-seam problem (CineLOG), first/last-frame interpolation per model (Kling, Luma, Runway, Pika, Veo 3.1, Wan FLF2V, Seedance), native extend vs manual chaining, fighting temporal drift and color shift (Knot Forcing), and a copyable end-to-end workflow with DaVinci Resolve and CapCut (tensorpix, seedance-2ai).
News rundown for the window of June fifteenth through nineteenth, twenty twenty-six.
Lead story. xAI moved Grok Imagine Video one-point-five from preview to general availability on June sixteenth, across the Imagine API, the web app, and both iOS and Android. A high-speed Fast variant rolled out the same week, and Musk's wider-release push landed June seventeenth. It's an image-to-video model with native synchronized audio built in.
On specs, the press reports a seven-twenty p ceiling, twenty-four frames a second, and six-second standard clips, though the official docs page leaves out resolution and frame rate. Fast mode renders a six-second seven-twenty p clip in about twenty-five seconds, down from forty-plus in the prior version. The improvements they cite are motion physics and object detail. Hair, fabric, moving objects hold together instead of dissolving mid-frame.
Pricing has two tiers of certainty. Verify-grade, straight from xAI's own docs, is eight cents a second, with a sixty-requests-a-minute rate limit. Press framing puts that at eight cents a second at four-eighty p, and four dollars twenty a minute at seven-twenty p, which they pitch as roughly eighty-six percent below Sora 2 Pro's thirty-ish dollars a minute. SuperGrok at thirty a month raises your limits, and there's a free web tier. The catch is the seven-twenty p ceiling while several rivals already do ten-eighty. A higher-res Pro Mode is reportedly coming, no date.
So your next action: test the API at eight cents a second on a single image-to-video shot against your current model, and check whether that seven-twenty p ceiling blocks your delivery spec.
Item two. ByteDance shipped Seedance 2.0 Mini, a cheaper, faster tier in the Seedance 2.0 family, rolling out mid-June first on Dreamina, with broader API access reportedly around June twenty-second. It's roughly twice as fast as Seedance 2.0 Fast at comparable or better quality, around seven-point-three cents a second, about half of standard Seedance 2.0. It keeps the family's multimodal reference system, up to twelve reference inputs: six images, three audio, three video. Text-to-video and image-to-video both. A/B it against Seedance 2.0 Fast and compare cost per acceptable take.
Item three. Runway launched Studio Trim on June eighteenth. Trim, stitch, reorder, and export a final video in one place inside Studio, on all subscription tiers. Finishing moves into the gen tool, less round-tripping to an editor for rough cuts.
And the leaderboard snapshot, from the Artificial Analysis Video Arena, blind human Elo votes that reshuffle monthly. On image-to-video with audio, Dreamina Seedance 2.0 leads, Grok one-point-five preview is second. Without audio, Seedance 2.0 still tops it, with two Grok entries right behind. Press is calling Grok one-point-five the number one i2v model around thirteen-thirty Elo, but the verified Arena standing has it second or third, top of the no-audio cluster behind Seedance. Frame that number one as xAI's framing, not the Arena's. By job: prompt adherence goes to Seedance and HappyHorse; value-per-second fast i2v with audio is Grok at eight cents; the native-audio safe all-rounder is Veo 3.1; human motion and cinematic work goes to Kling 3.0.
Today we're chaining keyframes into continuous multi-shot scenes. This is the start of Act Two, where we stop thinking about one clip and start thinking about a scene, and then a sequence. Everything from Act One I'm going to assume you already know, and I'll just call it by name as we go. Prompt anatomy. Text-to-video versus image-to-video. Minting keyframes, meaning generating start frames in an image model. Aspect ratio, duration, resolution, frame rate. Seeds, negative prompts, the different prompt dialects each model wants. Native audio versus silent. Conversational editing. And cost-per-finished-clip economics, which is going to come back hard at the end. If any of those feel fuzzy, go back to Act One. Today builds straight on top of them.
And one callback to the last episode specifically, Character Consistency, where we covered character sheets, single reference images, native multi-reference, and trained character LoRAs. A LoRA, just to refresh, is a small fine-tune that bakes your character into the model. Keep that episode in your pocket, because chaining is exactly where a character starts to drift, and that toolkit is how you stop it. Drift is the enemy of the whole act, and almost everything we do today is some flavor of fighting it.
Let me set the frame for what we're actually doing. In Act One you made a clip. One prompt, one output, five or ten seconds, done. That's a shot. Today we're making a scene, which is several shots that read as one continuous moment, and then we're stacking scenes into a sequence. The difference matters because the moment you ask a viewer to watch two clips back to back, their eye starts hunting for the join. Your whole job in Act Two is to make that join invisible. Or to make it an intentional cut. But never an accident.
Why Clips Cap Out
Here's the wall you hit first. Almost every generator caps a single generation at five or ten seconds in mid-twenty-twenty-six. That's not a settings problem you can argue your way around. It's baked in. You ask for thirty seconds, the tool says no, or it gives you ten and quietly ignores the rest. So before we chain anything, you need to know exactly where each model's wall sits, because the wall determines how many links you'll need.
Let's walk the lineup. Sora 2 Pro pushes to twenty-five seconds, but that's Pro-only, web-only, and you get there through its Storyboard interface, not a plain prompt box. Kling, across its versions, two-point-one Pro, O1, two-point-six, and three-point-zero, gives you five or ten seconds. Veo 3.1 is native four, six, or eight seconds, and you need the full eight if you want ten-eighty p, 4K, reference images, or extension. So on Veo the long options are gated behind the longest clip. Runway Gen-4 does five and ten as standard, with some configurations stretching sixteen to twenty.
A few more, because the spread is real. Hailuo 02, which is MiniMax, gives you six seconds at any resolution, or ten seconds but only at seven-sixty-eight p or five-twelve p. So length costs you resolution there. Seedance 2.0 is the picky one. It accepts fixed lengths only, four, five, six, eight, ten, twelve, and fifteen seconds, and it flat rejects anything in between. Ask for seven seconds and it says no. And on Seedance, multi-shot prompting, where one generation contains its own internal cuts, starts working around twelve seconds. Pika 2.5 goes up to twenty, even twenty-five seconds. And Wan, the open-weights model, does seven-twenty p with first-last-frame locally in ComfyUI, which is a node-based visual workflow tool for running models on your own machine. We'll come back to Wan more than once today.
So why does the cap exist at all? Three reasons, and understanding them tells you why fighting the cap is a losing game. Reason one is cost, and the key word is that cost scales aggressively, not linearly. Think about it. A ten-second clip at twenty-four frames a second is two hundred forty frames. And these models don't generate frames one at a time in isolation. They work across a massive spatiotemporal volume, meaning they model space and time together, all at once. So every second you add doesn't add a fixed chunk of compute, it multiplies it. The math gets ugly fast.
Reason two is training data. The models were trained on short clips. They simply never saw a coherent ninety-second continuous shot during training, so they have no idea how to produce one. You're asking for something outside everything they learned. Reason three is memory, and this is the one that bites hardest. Current AI models lack long-range memory and contextual awareness. They don't hold a stable mental model of your character's face, or the room's layout, or the direction the camera was moving ten seconds ago. So when you force length, you get drift. Identity drifts, settings drift, motion drifts. You get motion decay, and you get an accumulating AI sheen, that plasticky over-smoothed look creeping in as the clip runs long.
So flip how you think about the cap. It's not a limit fighting you. It's a guardrail protecting you from the model's own worst output. The cap is roughly where quality falls off a cliff, and the vendors drew the line there on purpose. Which means the answer isn't to push past it. The answer is to stay inside it and chain. Make a string of short, high-quality clips and join them. That's the whole game.
Core Chaining Workflow
The core move is simple to state. You take the final frame of Video A, and you use it as the starting input image for Video B. That last frame becomes a visual bridge, so the motion and the look flow from one clip into the next. The viewer's eye lands on the same image at the join, so there's nothing jarring to catch on.
And notice what this actually is under the hood. It's just image-to-video, the Act One primitive, with one twist. Instead of minting a fresh seed image in an image model, you harvest the seed image from the previous clip. That's it. Everything you already know about image-to-video applies. You're just sourcing the start frame from somewhere new. Hold onto that, because it means all your Act One instincts still work here.
The Seam Problem
But here's where the trouble starts, and it's the central problem of this whole episode. Even when two clips share a frame, even when the last frame of A is literally the first frame of B, naive concatenation gives you a noticeable discontinuity. You slap the two clips end to end in a timeline and there's a hiccup. A jump cut. The image is continuous, the picture matches, but the motion isn't continuous. And the human eye is brutally good at catching motion glitches. So what's going on?
There are two root causes, and they're worth understanding because the fix follows directly from them. Root cause one is inconsistent velocity. Generators tend to accelerate the camera at the start frames of a clip and decelerate it toward the final frames. It's like the model eases in and eases out of every shot. So when you append clip B, the camera was slowing to a stop at the end of A, and now it has to ease in again at the start of B. The result is the camera briefly stops or freezes right at the transition. That's the mechanical origin of what people call the frozen-seam pitfall. The picture hitches because the motion flatlines for a beat.
Root cause two is divergent trajectories. The direction and the acceleration of the camera differ between the two clips. Clip A was drifting left and slowing down. Clip B starts drifting up and speeding up. Same image at the join, totally different momentum. And that mismatch reads as a jolt, a little visual whiplash at the cut.
So how do you hide the seam? Three approaches. First, match camera momentum across the join, so the motion leaving A is the motion entering B. Second, use image-to-video chaining rather than hard concatenation, so the model is actually generating into the transition instead of you butchering two finished clips together. And third, use first-last-frame tools, which generate motion into a supplied end frame. VideoGen Extend is one example of this. It generates the motion between the last frame of an original video and a final uploaded frame, and it works to keep lighting, motion, and visual style consistent across that span. The tool is filling the gap with coherent motion instead of leaving you a hard butt-join.
First and Last Frame Interpolation
This brings us to the single most important technique in the act, so let me slow down and contrast it carefully against what you already know. Single-frame image-to-video extrapolates freely. You hand the model one start image and a prompt, and it invents where things go from there. You're suggesting a direction, not specifying a destination. The model has latitude, which is great for discovery and bad for control.
First-last-frame is the opposite posture. Instead of one anchor, you give the model two, a start frame and an end frame, and it interpolates along a defined trajectory toward a known end state. You specify both ends, and the model fills the middle. And this is gold for chaining, for one clean reason. The end frame of one link can be the exact start frame of the next link. You're not hoping the clips line up. You're nailing both ends to known images, so the join is guaranteed to match. That's the difference between extrapolation, where you suggest, and interpolation, where you specify both ends and let the model connect them.
Now, every model calls this something different, and the naming is a mess, so let me translate the dialect for you, model by model. Kling calls it Start slash End Frame. On Kling O1 specifically it's labeled First Frame to Last Frame, and it's also on two-point-one Pro and three-point-zero, at five or ten seconds. Luma Dream Machine calls it Keyframes, a start plus an end. And Luma's Ray3 Modify adds start and end frames to video-to-video editing, where you can change the lighting, swap seasons, even morph a subject from one thing into another.
Keep going, because you'll meet these names in the wild. Runway calls its version Extend Video, which works off the last frame, plus there's a keyframe feature on Gen-4.5. Pika calls it Pikaframes, and Pika's version is generous. Up to five keyframes, with custom transition lengths and custom prompts for each segment, running twenty to twenty-five seconds total. So Pika lets you script a multi-beat sequence in one tool. Veo 3.1 calls it first-last frame to video. You give the start as an image and the end via a last-frame input, and you get eight-second clips out of it.
And then there's Wan, the open-weights model, which is the technically interesting one. Wan calls it FLF2V, first-last-frame to video. Its ComfyUI node enforces boundary conditions at the start and the end of the clip, meaning at time equals zero and time equals one it hard-locks the supplied frames. As a result it matches the first and last frames at around ninety-eight percent, at seven-twenty p, running natively in ComfyUI. That ninety-eight percent number is the highest boundary match in the lineup, and it's not an accident. Wan hits it precisely because it treats the start and end as hard boundary conditions rather than soft suggestions. Open-weights gives you tighter control than some of the hosted models here. Seedance 2.0 also does first-plus-last anchoring, combined with multi-shot, and it accepts up to nine reference images.
Now the caveat, and you have to internalize this or you'll get burned. The end frame is a strong directional guide. It is not pixel-perfect. The final generated frame may not match your reference exactly, even on the models that try hard. So the rule is, use subtle endpoint changes, not dramatic transformations. If your start frame and your end frame are close cousins, a small camera push, a slight turn of the head, the model interpolates cleanly. But if you ask it to get from a wide daylight street to an extreme close-up at night in one link, you're demanding an extreme change, and an extreme change equals a smear. The model hallucinates some melting nonsense to bridge a gap that big. Subtle bridges, clean motion. Big jumps, garbage. That's the whole caveat in one line.
Native Extend Versus Manual Chaining
Now, some models offer to do all of this for you automatically, and that's worth weighing honestly, because there's a real tradeoff. Take Runway Gen-4 Extend. It takes the final frame as the start of the next block automatically. You don't harvest anything by hand, it just continues. Runway's marketing claims a Temporal Anchor giving around ninety-nine percent consistency, and says you can chain indefinitely. But, and this is the real-world part, it degrades after two to three minutes regardless. Runway also markets a Story Block feature up to five minutes. I'd flag both the indefinite claim and the five-minute number as vendor marketing. Treat them as ceilings nobody actually reaches with quality intact.
Veo 3.1 extension works differently. Each extension adds seven seconds, and it's repeatable up to twenty times, so the input can stack up to a hundred forty-one seconds total. That sounds huge. But there's a catch that ties straight back to the cap discussion. Extensions are limited to seven-twenty p. The moment you extend, you lose ten-eighty p and you lose 4K. So you can have long, or you can have sharp, not both on Veo. Sora 2's Storyboard gives you a timeline of beats. Pika 2.5 Studio gives you a timeline and a layer editor. Both are timeline-style tools for laying out a sequence.
So here's the tradeoff laid out plainly. Native extend is convenient, and it's vendor-tuned, meaning the company optimized the seam-hiding for their own model. But it locks you to one model for the whole sequence, it often downgrades resolution, like Veo dropping to seven-twenty p, and the consistency still decays no matter what the marketing says. Manual last-frame chaining is more work, you're harvesting and re-seeding frames by hand. But it's model-agnostic. You can re-mint or upscale the seed frame between links, you can switch models per shot, using one model for the wide and another for the close-up, but you own the seam-hiding and the color matching yourself. You're trading convenience for control.
And then there's the third path, which is Wan FLF2V running in ComfyUI. That gives you the tightest boundary control of anyone, that ninety-eight percent number, and it's free and it's local. But it's Act Three complexity. Node graphs, a local GPU, real setup. So it's not for this episode. Just know it exists, and know it's where the ceiling on control actually lives, for when you're ready to get your hands dirty later.
Keeping Motion Continuous Across the Seam
Okay, let's get practical about making seams disappear, because there are four levers and you'll pull all of them. The first lever is lighting. Inconsistent lighting between your first and last frame is poison. If the light source moves or changes between the anchors, you get shadows that move in contradictory directions and skin tones that shift mid-motion. The fix is to match the lighting at your anchors. Same key light, same direction, same color temperature on both the end of A and the start of B. If the anchors agree on light, the motion between them stays believable.
The second lever is pose continuity. Keep your compositions similar across the join. Small positional shifts, not extreme camera moves. Remember why, it's the same logic as the endpoint caveat. An extreme move forces the model to hallucinate a complex transition, and that's where you get smear and morph. So a character who's standing center-frame at the end of A should be standing roughly center-frame at the start of B. Nudge, don't leap.
The third lever is camera momentum, which is really about framing consistency. Maintain consistent framing across the seam. If you cut from a wide establishing shot directly into an extreme close-up using a chain, the model tries to interpolate across that enormous scale gap and gives you smeared, morphing output. Wide to extreme-close in one chained link is a recipe for goo. If you want that wide-to-close move, that's a hard cut, not a chain. We'll get to that distinction in the workflow.
The fourth lever is the frozen-frame mechanic, and now you understand its root, because we covered it back in the seam problem. The camera decelerates at the clip's end. So if the next clip starts from rest, the join reads as a freeze, that little hitch. Two ways to fix it. One, prompt continued motion at the start of the next clip, explicitly telling B that the camera is already moving when it begins, so it doesn't ease in from a standstill. Two, trim the decelerating tail frames before the seam, literally cut off the last few frames of A where the camera was slowing down, so you join on motion that's still live. Either one kills the freeze.
Pitfalls That Compound
Now the bad news, and you need to hear all of it, because these are the failures that sneak up on long chains specifically. The big one is temporal drift, also called autoregressive error accumulation. Here's the mechanism. Minor errors in the early frames accumulate as the chain goes, and that accumulation leads to error build-up and eventually semantic collapse, where the model loses track of what things even are. It shows up as periodic oscillation in background color, sudden object deformation, and abrupt motion shifts. And critically, it gets worse the longer the chain runs. A two-link chain might look fine. A ten-link chain can fall apart by the end.
Color shift deserves its own callout, because it's the most common version of this. The color and exposure drift across links, and it becomes more severe as the video grows longer. Link one is neutral, link three is slightly warm, link six is noticeably orange, and you didn't change anything. The drift just compounds. Keep that in mind, because we fight it directly in the assembly step with color matching.
Then there's blur and noise creep, and the cause here is subtle and important. The model was trained on clean videos. But during chaining it conditions on its own imperfect output as the history it builds from. So any artifacts in your first frame get amplified through the motion, and they compound link over link. This is exactly why you don't just blindly harvest the last frame and slam it into the next clip. You extract a clean last frame and you re-mint it, or upscale it, in an image model before chaining. Cleaning the seed frame between links keeps the whole chain from rotting from the inside. Think of it as wiping the lens before each new shot.
There are more pitfalls, and they're quicker to name. Aspect-ratio mismatch, like feeding a landscape frame into a portrait video, which mangles the geometry. Resolution inadequacy, where your source frame is below ten-twenty-four by ten-twenty-four, and the recommendation is your first frame should be at least nineteen-twenty by ten-eighty. Overly complex scenes, which just give the model more to lose track of. Motion-direction mismatch between links, which is the divergent-trajectory jolt again. And second-clip identity drift, which ties straight back to Character Consistency. Your character looks right in clip A and subtly wrong by clip B. The fix is to re-feed the character sheet, or the multi-reference set, or the LoRA, into every single link, not just into clip A. The model has no memory, remember, so every link needs the reference fed fresh.
And the mitigations all stack together. Use subtle endpoint changes. Use longer durations, say ten to fifteen seconds, for big motion arcs, so the model has room to move smoothly instead of rushing. Match the color grading and lighting between frames. Regenerate any artifacted frame instead of letting it ride. And here's the line to tattoo on your wall. Five extra minutes perfecting your first frame can save hours of regeneration. The first frame is the seed of everything downstream, and every flaw in it gets multiplied. So front-load the care.
Copyable Workflow
Let me give you the actual sequence of steps you'll run, start to finish, in spoken order so you can follow along.
Step one, storyboard the beats. Before you generate anything, decide which seams are continuous and which are hard cuts. A continuous seam is one you'll chain, where the camera flows unbroken from one shot to the next. A hard cut is two separate shots that you generate independently with no seam work at all, because the viewer expects a cut there anyway. This decision saves you enormous effort, because hard cuts are free. You only do seam-hiding labor on the continuous joins. So map your scene and label every join, chain or cut.
Step two, mint or extract your keyframes. Generate your first frames in an image model, that's the Act One keyframe-minting move. And here's where last episode pays off, reuse the character sheet, the references, or the LoRA from Character Consistency for every link, so your character holds across the whole scene. Make your first frame at least nineteen-twenty by ten-eighty, because resolution headroom up front saves you from the resolution-inadequacy pitfall later.
Step three, generate the links. Run image-to-video with your start frame. Add an end frame where you're using interpolation or keyframe tools, the first-last-frame stuff we covered. And as you generate, match lighting, pose, and camera between the anchors, all four levers from before. This is the actual production step where the clips get made.
Step four, extract a clean last frame for the next link. Pull the final still out of your editor, and then re-mint or upscale it in image-to-image to remove the accumulated artifacts before you seed clip B. This is the rot-prevention step. Never feed a dirty frame forward. Clean it, then chain it.
Step five, trim the overlapping or decelerating frames at the seam. Cut the camera-deceleration tail off the end of each clip so you join on live motion, killing the freeze. And where the motion between two links is too divergent to match cleanly, add a short crossfade to smooth the jolt. A crossfade isn't cheating, it's a legitimate seam tool when momentum just won't line up.
Step six, assemble in an editor. Lay the links down in order, color-match them to fight the exposure drift we talked about, and export. That color-match pass is your direct counter to the accumulating color shift across links.
A word on editors, since I keep mentioning them. DaVinci Resolve is a free, pro-grade, all-in-one editor, meaning editing, color, and audio all live in one app. To pull a frame in Resolve, you can go File, then Export, then Export Current Frame as Still, which works in version eighteen-point-five and up. Or on the Color page you can Grab Still into the Gallery. Either gets you that clean frame for harvesting. The other option is CapCut, which is a free, fast editor built for vertical social video. CapCut auto-detects scene changes, and it has free crossfade templates, which are exactly what you want for that divergent-motion crossfade back in step five. So Resolve for the full pro workflow, CapCut for fast vertical turnaround.
Let me make this concrete by building one thirty-second scene, shot by shot, so you can see the whole machine turning. Say the scene is a detective walking down a rainy alley toward a doorway, and it's one continuous push-in. Beat one, the detective stands at the mouth of the alley, wide shot, camera starting to drift forward. Beat two, she's mid-alley passing a flickering neon sign, camera still pushing. Beat three, she reaches the door and her hand comes up to the handle, camera settling. That's three beats, and at roughly ten seconds a link, that's three links for thirty seconds, all continuous, no hard cuts.
Now I run the workflow on it. Storyboard says all three joins are continuous chains, since it's one unbroken push-in. I mint the first frame in an image model, the detective at the alley mouth at nineteen-twenty by ten-eighty, and I feed her character sheet from Character Consistency so her face is locked. I generate link one as first-last-frame, start at the alley mouth, end frame is her a few steps in with the camera mid-push. I match the rainy blue-gray lighting at both anchors. Then I extract that end frame, upscale it in image-to-image to strip any wet-pavement noise that crept in, and that clean frame becomes the start of link two. Link two ends with her at the neon sign. I clean that frame, seed link three, which ends at the door. At each link I re-feed her character sheet so she doesn't drift into a different person by the door.
Then I trim. Each link decelerated slightly at its tail, so I cut a few frames off each before the join, so the push-in stays live across all three seams. The neon-to-door join had slightly divergent camera drift, so I drop a quarter-second crossfade there in CapCut. Finally I bring all three into Resolve, color-match link three back toward link one because it had warmed up a touch from drift, and export. Thirty seconds, three links, one continuous push, no visible seams. That's the entire act in one scene.
Pricing
Now let's talk money, because chaining changes the math in a way the pricing pages don't tell you. Start with the headline numbers. Kling three-point-zero silent is six credits a second at seven-twenty p, and eight credits a second at ten-eighty p. Turn on native audio and that bumps to nine and twelve credits a second respectively. So a five-second silent clip at seven-twenty p is thirty credits. A ten-second clip at ten-eighty p is eighty credits. And note, Kling charges strictly per second, so ten seconds costs exactly twice what five seconds costs, no discount for length. Kling two-point-six runs about thirty-five credits per five seconds, and about seventy for ten.
Veo 3.1 is priced in dollars. It's about fifteen cents a second on Fast with audio, and about forty cents a second on Standard. Veo 3.1 Lite is cheaper, around five cents a second, or as low as three cents a second at seven-twenty p with no audio on Vertex, which is Google's cloud platform. So an eight-second Standard clip lands around three dollars and twenty cents. Keep that figure in mind, because now we multiply it.
Here's the chaining cost reality. A thirty-second continuous scene, like the alley I just walked through, is roughly three to six links depending on your clip length. So take that alley scene at three links of Veo Standard, three dollars twenty each, and your best case is about nine dollars and sixty cents. But that's the best case, and best case rarely happens. You have to budget for re-rolls, because drift forces regeneration. Some links will smear, some will drift in color, some will let the character's face wander, and you'll regenerate them, maybe two or three times each. So your real cost-per-finished-second is meaningfully higher than the headline per-clip price. That alley might cost you twenty or thirty dollars by the time every link actually works.
This ties straight back to Act One's cost-per-finished-clip economics, just scaled up to a whole scene. The pricing-page number is your absolute best case, the world where every link works on the first try. Plan for the world where they don't. Budget the re-rolls into the scene, not just the clips, and you won't get a nasty surprise on the invoice.
Benchmarking
Last thing, picking your model, and there's a trap here I want you to sidestep. The go-to resource is Artificial Analysis Video Arena. It runs blind A-B Elo voting, meaning people vote on which of two clips is better without knowing which model made them, and it scores models with an Elo rating like chess. It keeps separate boards for text-to-video and image-to-video, and it's genuinely good signal for raw single-clip quality.
But here's the catch, and it's a big one for us. The Arena ranks single-clip quality. It does not rank chaining or continuity. A model can win on a gorgeous standalone six-second clip and then freeze at every single seam when you try to chain it. The thing the leaderboard measures and the thing you need are different things. The leaderboard tests the shot. The act needs the scene.
So don't pick your chaining model off the leaderboard alone. The only reliable test is to bench on your own shots. Your character, your specific camera move, your actual seam. Because seam behavior is model-specific and shot-specific. A model that handles a slow push-in beautifully might wreck a whip-pan. The only test that tells you whether a model chains well is your chain.