
A still you approve beats gambling on text-to-video, so this episode shows you how to mint a start frame in an image model and hand it to the video stage for motion. We cover the snapshot roster, prompting a frame with somewhere to go, matching aspect ratio and resolution, the full round trip, and the pitfalls you will actually hit.
A slow news week, then the core craft move: minting your start frame in an image model before the video stage ever runs.
News rundown. A rare quiet week for frontier video models (June 1-7, 2026). No new generation model or version bump from Google, ByteDance, Kuaishou, Runway, Luma, MiniMax-Hailuo, or OpenAI inside the window. The live stories are continuing ones: Google's Gemini Omni Flash (unveiled at I/O on May 19) is still consumer-only, with the developer API promised "in the coming weeks"; ByteDance's Seedance 2.0 still has no public developer API amid Hollywood copyright disputes; and MiniMax M3 shipped June 1 but it is a multimodal LLM with native video understanding, not a generator. Wildcards: a reported OpenAI-Disney licensed-character deal, Sora 2's Videos API shutting down Sept 24, 2026, and C2PA plus SynthID watermarking now standard (plan for EU AI Act / California labeling before August). Video Arena snapshot: Seedance 2.0 leads image-to-video and the with-audio board; treat any single Elo as a moving snapshot.
Tutorial: minting keyframes. Image-to-video beats text-to-video on control because a start frame locks composition, lighting, and style (why a still wins, start/end frame). The interchangeable snapshot roster: Google's Nano Banana family (up to true 4K, 500 free images/day), FLUX.2 (open-weight, self-hostable), Seedream 4.5 (4K, deterministic seeds), Imagen 4, Midjourney V8.1, and Ideogram 4.0 (best text). The thesis: prompt a frame with somewhere to go, implied motion not frozen, sharp focus, depth layers, low clutter, at the exact target aspect ratio and highest resolution the video stage accepts. Scene goes in the image prompt; motion goes in the video prompt. Pitfalls: gorgeous frames that won't animate, text that warps in motion, aspect/resolution mismatch, morphing, and SynthID watermark carry-through.
Bench your own shot, and verify the roster before you rely on it; these models and leaderboards churn monthly. AI-generated podcast by OCDevel.
Let's start with the news, and the honest headline this week is that it was quiet. The strict window, the first week of June twenty twenty-six, was a rare slow week on the model front. No major lab shipped a new video generator or a version bump. Not Google, not ByteDance, not Kuaishou, not Runway, not Luma, not MiniMax, not OpenAI. So instead of a launch parade, here's where things actually stand.
The first continuing story is Google's Gemini Omni Flash. Google unveiled the Omni family at its developer conference back on May nineteenth, and Flash, the first model in that family, started rolling out the same day inside the Gemini app, inside Google's filmmaking tool Flow, and free inside YouTube's Create and Shorts Remix features for eligible adults. It generates from text, image, audio, and video references, and unlike the generation-first rivals it also edits by conversation. Flash renders about ten seconds per clip, reportedly a distribution choice rather than a hard limit, and it carries Google's invisible SynthID watermark. The catch for us: as of this week, the developer and enterprise interface still has not launched. Google said "in the coming weeks." Until it does, Omni is locked to consumer surfaces, not a pipeline tool. If you're curious, try it free inside YouTube Create and judge whether that editing loop is worth re-tooling for.
Second continuing story: ByteDance's Seedance two point oh. It launched back in February through ByteDance's consumer apps, and it tops most public benchmarks, but the official developer interface still hasn't released, reportedly held back while copyright disputes with Hollywood studios stay unresolved. It ships with C2PA provenance watermarking built in, the content credentials standard, and TikTok has reportedly labeled over a billion AI videos using it. The takeaway: the field's quality leader is still effectively a consumer-only tool for programmatic work. Keep it for hero shots, but don't architect a production pipeline around it yet.
The one hard-dated launch inside the window is MiniMax M3, on June first. But it's a multimodal large language model with native video understanding, not a generator. It takes video in and produces text out, so think of it as a captioning, quality-control, or prompt-rewriting brain over footage, not a clip maker. MiniMax's actual generator, Hailuo two point three, got no update this week.
A few wildcards, all reportedly: OpenAI and Disney are said to have a roughly one-billion-dollar licensed-character deal. OpenAI's Sora app is winding down, with its programmatic video interface scheduled to shut off September twenty-fourth, a real migration deadline. And on provenance, watermarking is now standard across the field, with commentary flagging European and California labeling rules as a plan-it-before-August item. On the live Video Arena leaderboard, Seedance still leads image-to-video and the with-audio board, but treat every score as a moving snapshot. That's the landscape.
Now the main event. Today we're learning to mint keyframes, which is a fancy way of saying we're going to build the very first frame of our video inside an image model, on purpose, and then hand that frame to the video model and tell it to animate. If you've been following along, this is the natural next rung on the ladder. We talked about start-frame control in the image-to-video episode, and we talked about aspect ratio and resolution constraints a couple of episodes back. Today we connect those two ideas into a single, repeatable move.
Let me define our terms up front so nothing trips you later. A keyframe, in our context, is just a still image that anchors a moment in your video. The start frame is the keyframe at the very beginning, the first image the video model sees. Some tools also let you set an end frame, the last image. Image-to-video means you give the model a picture to start from. Text-to-video means you give it only words and let it invent everything. That distinction is the whole reason this episode exists.
Here's why we go to the trouble of minting a start frame instead of just typing a prompt into the video model. Image-to-video gives you much higher controllability than text-to-video, because a still you approve perfectly preserves the face and the composition of your subject. When you type a text prompt into a video model, you're gambling. You're rolling dice on what the subject looks like, how the shot is framed, what the lighting does, all at once, while also trying to control motion. When you mint a start frame, you separate those problems. You decide the opening image, you sign off on it, and the video model only has to figure out the motion in between. You lock the picture and let the AI fill the middle.
Think about what that buys you in terms of reproducibility. In the reproducibility episode we talked about seeds, the number that makes a model's random choices repeatable. A seed is useful, but a minted start frame is a stronger lever than a seed. With a seed you re-roll and pray. With a start frame you lock the exact composition, the exact lighting, the exact style you already signed off on, and only the motion stays variable. That's a much smaller surface of uncertainty. Google describes the image model as the layout and logic stage: you set your scene, your text, your characters in the image model, and then you push those frames into the video models for motion. The image model decides what; the video model decides how it moves.
And the good news is this workflow is universal now. As of twenty twenty-six, first-frame upload is supported everywhere that matters. Seedance two point oh, Kling three point oh, Runway's Gen-four, Pika two point oh, Sora, Google's Veo three, Luma's Dream Machine, they all let you upload a start frame, and many support an end frame too. Start and end frame control is now a standard feature, not an exotic one. So the skill we're building today applies no matter which video model you land on.
Let me walk you through the roster of image models you'll actually use to mint these frames. I want to stress one thing before I name names: this field reshuffles monthly. Verify the roster before you rely on it. The point of naming them is to show you the categories and the trade-offs, not to crown a permanent winner.
First, Google's Nano Banana family. Nano Banana is the nickname for Google's native image generation inside Gemini, and as of early twenty twenty-six it's a family of three. The original Nano Banana is Gemini two point five Flash Image, from last September, around four cents an image, known for multi-image blending, character consistency, and natural-language edits. Then there's Nano Banana Pro, which is Gemini three Pro Image. This one is reasoning-driven, meaning it plans the scene before it renders. It does accurate multilingual text, character consistency, accepts up to fourteen reference images, holds consistency for up to five people in a shot, and goes up to true four-K, that's four thousand ninety-six pixels on a side. It costs around thirteen cents an image at two-K and about twenty-four cents at four-K. The third is Nano Banana two, Gemini three point one Flash Image, from February, around seven cents an image at one-K. And here's the kicker for hobbyists: Google's free tier is the most generous of the majors, up to five hundred images a day free through Google AI Studio. Nano Banana Pro is also rolling into Flow, Google's AI filmmaking tool, where it's explicitly positioned as a frame-minter. That's exactly our use case.
Second, Black Forest Labs FLUX point two, announced late last November. This one matters because of its open-weight angle, meaning you can download and run the model yourself instead of only calling someone's server. It comes in variants. There's pro, the closed high-quality one. There's flex, which gives developers control over steps and the guidance scale, basically how hard the model leans on your prompt. There's dev, a thirty-two-billion-parameter open-weight model, the most powerful open-weight image generator and editor available. And there's klein, an Apache-licensed distilled version, only four billion parameters, small enough to fit in about thirteen gigabytes of video memory, the kind you'd find on a consumer graphics card. FLUX point two goes up to four megapixels, handles up to ten reference images for character, object, and style consistency with no fine-tuning, and models real-world lighting and physics to reduce that telltale AI look. Fair warning: the full dev checkpoint needs around ninety gigabytes of video memory to load, so that one's for serious local rigs or cloud hosting. The open-weight option is the differentiator if you want a repeatable, self-hosted pipeline that doesn't depend on anyone's interface staying live.
Third, ByteDance Seedream, not to be confused with Seedance, their video model. Seedream four point oh goes up to four-K across one-K, two-K, and four-K tiers, and you can ask for up to fifteen images at once. Seedream four point five, from December, natively hits four-K, improves typography and dense text, generates in roughly five to fifteen seconds, offers eight aspect ratios, and gives you up to six generations per call. Crucially for our purposes, it supports seed values for deterministic reproduction across runs, which makes it genuinely useful for A/B testing two versions of a frame.
Fourth, Google Imagen four, a separate line from Nano Banana. It comes in three tiers. Ultra is max fidelity at native two-K. Flagship is around four cents an image. Fast is around two cents and renders in under three seconds. It supports the standard aspect ratios and goes high-res, and Google says it solved text rendering, meaning legible, accurate typography. One thing to remember: every Imagen tier carries an imperceptible SynthID watermark, and that watermark survives into your video. Hold that thought, it comes back in the pitfalls.
Fifth, Midjourney. Version seven introduced Omni Reference for consistent characters and objects, a draft mode that's roughly ten times cheaper, and better photorealism. Its aspect ratios go all the way up to four-to-one, true panoramic and cinematic, including the vertical and horizontal social ratios. Version eight point one shipped at the end of April, around four to five times faster. Pricing is subscription only, ten to one hundred twenty dollars a month, with no permanent free tier. Its weak spot is text: only about thirty to forty percent accuracy.
Sixth, Ideogram, which is the opposite story on text. Ideogram three point oh renders text at roughly ninety to ninety-five percent accuracy, versus Midjourney's thirty to forty. It uses a hybrid design that pairs diffusion with dedicated typography and character placement, built for marketing, posters, and product visuals. Ideogram four point oh launched June third and tops its design arena among open-weight models. But here's the irony, and it's important: text that renders perfectly in the still still tends to warp once you animate it. We'll come back to that.
Now, how do you choose among these on any given day? Check the live leaderboard, the Artificial Analysis Image Arena, which ranks models on blind pairwise votes, so it moves. As a snapshot, on text-to-image the top was GPT Image two, then GPT Image one point five, then Nano Banana two, with Nano Banana Pro around fifth. On image-editing, GPT Image one point five led, then GPT Image two, then Nano Banana Pro, then Nano Banana two. Why do I call out the editing board separately? Because refining a frame you already minted is an editing task, not a from-scratch generation. When you're nudging a composition or fixing a hand, the editing rankings are the ones that matter. Narrate that board to yourself as: check it live, it moves.
Okay. Now the heart of the episode, the thesis, the one idea I want you to carry out of here. Prompting a still that you intend to animate is fundamentally different from prompting a finished image. The core principle is this: don't prompt the perfect static photo. Prompt a frame with somewhere to go.
Let me unpack what that means in practice, because it's a handful of concrete habits. The first is implied motion, not frozen action. You want to capture a dynamic moment that suggests action, a subject mid-leap, mid-stride, mid-turn, without baking the motion into the pixels. If you give the model a completed, frozen action, you've left the animator nothing to extend. The pose is over. There's no obvious next instant. A mid-leap reads as "this is about to continue," and that's exactly what the video model wants.
The second habit is to avoid baked-in motion blur. You want the start frame sharp. Prefer asking for sharp focus over piling on negations. Clean, well-lit, sharp inputs produce stable animation. Here's the why: the video model adds blur itself during motion. If your still already has blur in it, you're compounding artifacts, blur on blur, and the result smears. So a crisp, sharp opening frame is not a stylistic preference, it's a technical requirement for clean motion.
Third, leave compositional room to move within. Build the frame in layers: foreground, midground, background. That depth gives the camera somewhere to push into for a dolly-in, which is a camera move where the camera physically moves toward the subject. Leave negative space, empty room in the frame, so a subject can enter it or the camera can pan across it. A shallow depth of field, where the background goes soft and blurry, the bokeh effect, isolates your subject and gives a believable plane for the camera to travel along. If every inch of the frame is packed and in focus, there's nowhere for motion to happen.
Fourth, and this is the cleanest contrast with finished-image prompting, don't over-detail and don't clutter. When you prompt a finished image, more detail is usually better. Here it's the reverse. A start frame locks composition, and that's restrictive if the frame is too complex or too static. Reduce unnecessary detail so the AI has flexibility to animate your main subject. Cluttered backgrounds and poor lighting cause unwanted warping when things start moving. So strip it down. Give the model fewer things to keep track of while it's busy inventing motion.
Fifth, divide the labor between your two prompts. This is the rule that ties it all together. Put the scene in the image prompt. Put the motion in the video prompt. When you get to the video stage, focus that prompt on describing the motion, not re-describing the image. If you re-describe elements that are already visible in the frame, in high detail, you can actually reduce the motion or cause unexpected results, because the model thinks you want it to regenerate those elements. And keep the motion simple. A single-axis move, like a slow dolly in, beats a complex transform that pans and zooms and rotates all at once. One clean motion is far more likely to come out right.
Sixth, a small but powerful trick: hold commands for clean openings. You can prompt something like hold the first frame for a second, or no motion for the first twelve frames, to keep your opening crisp before the motion kicks in. That gives you a clean, recognizable beat at the top of the shot before anything starts to move, which is great for titles or for letting a viewer register the scene.
And seventh, if you're using both a start and an end frame, there are consistency rules. Keep the same character design across both keyframes, same outfit, same face. Avoid drastic lighting or camera-angle changes between them. Don't pick start and end frames that are wildly different in color, style, or lighting, because those extremes break the transition. The model has to interpolate between your two frames, and if they're too far apart, it morphs and flails trying to get from one to the other.
Now let's nail down the aspect ratio and resolution piece, because this is a hard callback to the constraints episode and it's where a lot of people quietly lose quality. The rule is simple: pre-crop the still to your target video aspect ratio yourself. Match the first-frame aspect ratio to the output video aspect ratio. If you hand a landscape frame to a model and ask for a portrait output, you've forced the AI to either crop your frame dramatically or hallucinate the missing content to fill the new shape. Neither is good. So crop the source first. Don't delegate that decision to the model.
On resolution, the rule is match or exceed. For a ten-eighty-p output, give it at least a nineteen-twenty by ten-eighty frame. And here's the part people miss: the encoder pulls more detail out of a larger source image even at the same output resolution. A two thousand forty-eight by two thousand forty-eight frame yields a better ten-eighty-p clip than a one thousand twenty-four square does, because there's simply more information for the model to work from. That's the whole reason minting at two-K or four-K is worth the extra few cents even when your final clip is only ten-eighty-p. Nano Banana Pro's four-K, Seedream's four-K, FLUX's four megapixels, Imagen's two-K, those high resolutions pay off downstream.
One more wrinkle on ratios. Video dimensions usually have to be multiples of sixteen, so the model fine-tunes the final resolution slightly. Exact ratios aren't guaranteed and may drift a little. Pre-cropping to the documented target size minimizes that nudge. And mind the mapping: vertical social formats like TikTok and Reels are nine-to-sixteen, standard horizontal YouTube is sixteen-to-nine. The teaching beat here is to mint the still at the exact aspect ratio the video model offers, in one of its native ratios. Don't mint a ratio the video stage can't accept. Midjourney can do a gorgeous four-to-one panorama, but if your video model can't take four-to-one, that beautiful frame is useless to you. Match the downstream tool.
Let me give you the whole thing as one copyable workflow, the round trip, start to finish. Step one: prompt the image model for a start frame. Describe scene, subject, lighting, and mood. Build it in depth layers with room to move. Ask for implied motion, not frozen action. Ask for sharp focus, no baked-in blur. Set it at the target aspect ratio and the highest resolution the video model accepts. And bench it across a couple of the interchangeable models, Nano Banana Pro, Seedream, FLUX, Imagen, since rankings shift monthly and one of them will suit your particular shot best.
Step two: pick and refine the still using the image model's editing mode. This is where the editing leaderboard I mentioned earlier applies, because now you're editing, not generating from scratch. Nudge the composition, fix a hand, clear out clutter. Both Nano Banana and FLUX do natural-language edits and multi-reference work. This step is your approval gate. This is the whole payoff of the episode: you sign off on a specific frame instead of gambling on text-to-video. When the frame is right, you commit.
Step three: feed it as the start frame to the video model. As an example, Veo uses your input image as the initial frame, so you select the image closest to your envisioned first scene. In Veo three point one's Ingredients to Video feature, you upload through the plus button, and you can add up to four reference images for consistency, with ten-eighty-p and four-K output. Other models have their own upload flows, but the idea is identical: this still becomes frame one.
Step four: animate with a motion-only prompt. Describe the change, not the frame. Use a single-axis camera move. Add a hold-frame opening if you want a clean start. And run a short test first, because longer transitions amplify artifacts. Then bench your own shot across video models, the same way you benched the still across image models. Your shot is the real benchmark, not anyone's leaderboard.
Before we close, let me walk you through the pitfalls you will actually hit, because forewarned is forearmed. The first is the gorgeous frame that won't animate. You make a perfectly composed, fully resolved, dead-still photo, and you feed it in, and the video model just locks up and barely moves. That's the trap. A frame that reads as finished art, complete and static, gives the model nothing to extend. The fix is everything we talked about: reduce detail, reduce clutter, give the subject and the camera room to move.
The second pitfall is text breaking. Even if you used Ideogram or Imagen or Nano Banana to render crisp, perfect text in the still, that text tends to warp and garble once it's in motion, because the models struggle to keep letters readable while they move. The mitigation is a high-contrast frame, a simple background, single-axis motion, and a high-fidelity video model. But the real rule is this: don't put load-bearing text in the animated region. Logos, captions, signage, anything that has to stay readable, add it in post, or keep it in a held, static zone of the frame. Don't make the moving part carry your words.
Third, the aspect and resolution mismatch we already covered, where the AI crops dramatically or hallucinates missing content. Pre-crop, don't delegate. Fourth, morphing and hallucination in general, where backgrounds churn, limbs warp, faces distort. That happens when the model's statistical guess about motion conflicts with real-world physics. Simpler frames plus simpler motion reduce it. And fifth, the watermark carry-through I promised you'd hear again: Imagen frames carry SynthID watermarks that propagate right into your video. It's just a one-line caution, but know it's there, especially if provenance matters for your distribution.
So that's the move. Mint a start frame in an image model, prompt it with somewhere to go, match the ratio and resolution to your video stage, sign off on the frame, then animate with a motion-only prompt. You've traded a gamble for a decision. Go bench your own shot, because every one of these models and leaderboards will have shuffled by the time you try it.