OCDevel
Walk
OCDevel AI Video Generation Podcast
OCDevel AI Video Generation Podcast
Make finished, professional video with AI - not just one-off clips. Every episode pairs a fast news rundown on the AI video generation landscape with a hands-on tutorial that takes you from prompting a website to running a one-person studio. The news tracks what moves a producer's week: the fast-shifting model leaderboard - Veo, Sora, Kling, Seedance, Gemini Omni, Runway and whoever's leading this week — plus the capability changes (native audio, image-to-video, character consistency, price-per-second) that change how you shoot. Then the tutorial climbs a single ladder across the series: from typing a prompt and taking what you get, to reliably landing the shot you pictured, to stitching consistent multi-shot scenes with recurring characters, to a repeatable pipeline, to a one-person studio where a client brief comes in and a finished, on-brand cut comes out while you art-direct from the beach. Text-to-video and image-to-video, keyframes, character and style consistency, the edit, the grade, AI audio, and the business of actually delivering - one copyable workflow and one real pitfall per episode. For creators, marketers, indie filmmakers, and small studios who want to direct AI instead of gambling with it. AI-generated podcast by OCDevel.
CTA
Generated with OCDevel PodcasterMade with OCDevel Podcaster
This show was made with OCDevel Podcaster: turn any topic or text into an AI-narrated podcast episode that drops right into your feed.Turn any topic into an AI-narrated episode in your feed.Create your own →Create your own →

Character Consistency: Sheets, References, and When Multi-Reference Beats a LoRA

10h ago

In mid-2026, native multi-reference inside the major video tools does what a custom character LoRA used to for most one-off and short-series jobs. Train a LoRA only when you'll reuse the same face hundreds of times, need exact lock, and control a clean dataset.

Show Notes

This tutorial climbs the next rung: keeping one character looking like the same person across multiple shots. We start with why drift happens (a video generator is stateless, so it re-derives a plausible face on every call) and the "second-clip identity drift" wall, documented as almost never random.

Then the four anchors, weakest to strongest:

Decision rule: default to multi-reference; train a LoRA only for high-volume, exact-lock, stable-base work. Plus pitfalls (outfit drift, lighting, identity bleed, reference quality), provenance (SynthID and C2PA, the EU AI Act and SB 942), real-person rights (NO FAKES Act), and benching it yourself on the Video Arena leaderboard.

Transcript

Let's talk about the wall most people hit second. The first wall is getting any usable motion out of a single clip at all, just one shot where the person moves and it looks right. You climb that, you feel good, and then you go to make your second clip. And the person you so carefully created in clip one shows up looking like their cousin. Same general description, same hair color, but the face is off in a way you can feel before you can name it. That's the problem we're solving today: character consistency. Keeping one person looking like the same person across many shots, so you can cut them together into an actual scene.

Here's the thing you have to understand before any technique makes sense. A video generator has no memory. It is stateless between calls. Every single generation is a fresh sample from a probability distribution. The model does not remember the person it drew thirty seconds ago in your last clip. It re-derives a plausible face every time, from your prompt, plus the seed, plus whatever reference image you hand it. So unless you actively re-anchor the identity on every call, the model drifts back toward its average human for that description. Drift isn't a bug you can file. It's the default behavior of a memoryless sampler doing exactly what it was built to do.

Let me describe what drift actually looks like, because it's sneaky. It's the "almost the same person" problem. The face is ninety percent there. But the spacing between the eyes shifts a little. The nose narrows. The jaw squares off. The hairline creeps up or forward. The eye color desaturates. The freckles vanish. The outfit changes a button, or the fabric weave looks different. The body proportions subtly rescale. And here's why it matters so much: each shot, looked at on its own, looks completely fine. You'd approve any one of them. But cut them together and the audience feels a wrongness they can't articulate. The human brain is exquisitely tuned to faces. A five percent identity error reads, instantly, as "wait, is that the same actor, or did they recast?"

So why the second clip specifically? Because of how the workflow tricks you. On clip one, you generate several options and you hand-pick the best one. You feel great about it. Then on clip two you quietly assume the system remembers what you picked. It doesn't. It never did. There's a great way people describe this from trained pipelines: the first frame looks right, but once the person smiles, turns their head, or changes pose, the face stops looking like the same person. And this is documented, not folklore. Face drift has known causes. Low adapter capacity. Low angle diversity in your references. Conflicting references fighting each other. As the folks who train these character models put it, face drift is almost never a random bug. It's a consequence of something specific you can fix.

What makes drift worse, so you know your enemies. Changing the camera angle, going from a profile to a three-quarter to a straight-on front view. Changing the lighting, warm key versus cool, hard versus soft. Changing the expression, because a big smile literally reshapes the lower half of the face. Changing the distance, close-up versus wide. Changing the outfit or the context around the person. Every one of those is a chance for the sampler to slide. And the worst case of all is plain text-to-video, where you're describing the person in words with no image anchor at all. Image-to-video, where you start from an actual generated frame, is dramatically better, which connects straight back to our earlier episode on why a start frame wins control. The whole arc of today is one idea: stack anchors until the sampler can't drift.

So let's build those anchors, from the simplest to the heaviest. The first one is the character sheet.

A character sheet is an idea borrowed from animation and game art. You'll also hear it called a model sheet, or a turnaround. A turnaround, if you haven't met the term, is just the character shown rotating through a set of standard angles, the way an animator needs to see a character from every side before they draw it moving. The sheet itself is usually a single image, often a grid, showing one character from multiple angles and multiple expressions, all under neutral conditions. What goes on it? The turnaround first: front, three-quarter, profile or side, back, sometimes a three-quarter-back view, which is the orthographic rotation animators have used for decades. Then an expression set: neutral, a smile, a frown or angry face, surprised, talking, basically a little mouth and emotion chart. You want neutral flat shadowless lighting. A plain grey or white background. And at least one full-height shot, so you've captured the proportions and the wardrobe head to toe. The Higgsfield Soul ID guidance spells this out almost exactly: at least one full-height shot, varied angles, multiple expressions, consistent lighting, and no sunglasses, no heavy shadow, no cropped faces.

Why does a sheet work? Because it converts an abstract description into a concrete visual fingerprint. "A thirty-something woman with auburn hair" is a thousand possible faces. A sheet pins it down to one face, captured across every angle, so whatever angle your next shot needs, you already have a matching reference on hand. And it does one more thing that's subtle but powerful: it de-correlates identity from lighting and pose. Remember, mixing up identity with lighting and angle is the exact trap that breaks consistency. A neutral sheet separates them. Now here's a piece of foreshadowing for our Act Three episode on building a real pipeline: this same sheet you build now will later become the training dataset for a character LoRA. The fifteen to thirty images you'd otherwise have to go shoot. So building the sheet pays off twice.

You build a sheet in an image model. And quick callback to our Minting Keyframes episode: in this show, image generation lives here, as frames, not as standalone art. We make images to feed video. So which image models? Nano Banana, and its bigger sibling Nano Banana Pro. That's Google's Gemini-native image model, and Nano Banana is just the nickname that stuck. The Pro version reportedly accepts up to eight reference images, double the original's four, and some sources claim up to fourteen for multi-image composition. It's pitched explicitly at creating character sheets and consistent spokespersons. The face-consistency guidance there wants your references at least about a thousand by a thousand pixels, with three to six angles, and explicit identity-preservation language in the prompt. Then there's FLUX point two, from Black Forest Labs, which comes in Pro, Flex, and Max tiers. FLUX point two Pro reportedly generates consistent multi-pose character grids from two to ten reference images, with around ninety-five percent fidelity across front, side, back, and dynamic angles, though remember to treat that vendor number as reportedly. There's Seedream four point five from ByteDance, praised for consistent edits, and a popular twenty twenty-six workflow pairs them: FLUX for the initial generation, Seedream for the consistent edits afterward. And rounding out the top tier for character consistency you've got Midjourney version seven, Ideogram version three, which offers a single-photo feature called Ideogram Character, and Flux Kontext.

Now the prompt craft for building the sheet, which is a callback to our Anatomy of a Video Prompt episode. You prompt for something like: character turnaround sheet, T-pose, front view, side view, back view, neutral expression, even studio lighting, plain grey background, no shadows. Generate a batch. Pick the single best frame. And then, here's the trick that makes it cohere: use that one best frame as the seed reference for all the expression and angle variants, so every variant inherits one single identity instead of each one being a fresh roll of the dice.

That brings us naturally to the next anchor up: reference images handed directly to the generator.

This is the simplest anchor of all. You give the generator one image and you tell it, this person. This connects directly to Minting Keyframes and to Image-to-Video versus Text-to-Video, because a start frame, the literal first frame of your clip generated in an image model, carries identity into the video. The model animates outward from that exact pixel state instead of inventing a person from scratch. Reference-to-video, image-to-video, this is the single biggest consistency lever a beginner has. If you take one thing from today, take that.

There are two flavors of reference, and the distinction matters. The first is the start frame, the image-to-video case, where the reference literally is frame one of your clip. That's the highest control you can get. Identity is locked at time zero. The catch is that drift can still creep in over the duration of the clip as the person moves. The second flavor is a reference flag, sometimes called a character reference or a face reference. That's not the first frame. It's a guidance image the model consults to keep the face and outfit on-model while it generates freely, in a different angle, a different action, a different setting. Hosted tools expose this as a character reference or face reference slot.

On the hosted side, Ideogram Character generates consistent, photorealistic characters from just a single reference image. One clear photo plus a description of the action. Runway Gen-4 reportedly hits ninety-five percent plus character consistency across video shots using a single reference image, again, reportedly, that's their number. Single-reference is the floor. It's where you start. Multi-reference is the frontier, and we're getting there.

Two more things worth naming in this space, the no-training adapters. PuLID and IPAdapter. These are identity adapters that inject identity at generation time, requiring no training at all. You upload one photo, and PuLID forces the base model to match that face. IPAdapter, which lives in the open-weights world and in ComfyUI, does basically the same job as a face or image prompt. Think of these as a conceptual middle ground between a bare reference image and a fully trained LoRA. You probably won't reach for them as a deadline creator using hosted tools, but they're worth knowing, because they do something hosted tools can't, and they live in Act Three territory.

Okay. Now the big one. Native multi-reference. This is the frontier, and honestly this is the thesis of the whole episode.

Here's the shift. Instead of one reference, you feed several reference images in a single call. Face from image A. Outfit from image B. Environment or prop from image C. And then, this is the skill, you name them in your prompt so the model knows which reference drives which thing. As of mid-2026 this capability is built right into the major hosted tools, and it now does what a custom LoRA used to do for a large fraction of jobs. No training run. No base-model lock-in. You just upload and prompt. That's the headline. For most working creators, this changes the whole calculation.

Let me walk the tools, with the standing caveat that this reshuffles monthly, so verify before you commit. Runway Gen-4 References, also called Gen-4 Image, takes one to three reference images. The convention is to label them image one, image two, image three, and then refer to those labels in your prompt so the model knows which input drives what. That labeling is what makes it work when you're combining elements across references. Resolutions there reportedly max out around seven-twenty by seven-twenty for a square, and twelve-eighty by seven-twenty for widescreen. Then Google's Veo three point one, inside Flow, has a feature called Ingredients to Video. You upload up to three reference images of a character, a product, or an object, and Veo maintains consistent visual identity across scenes, angles, and settings. A January twenty twenty-six update added vertical video via reference images, which matters if you're cutting for phones.

Kling has a feature called Elements. Kling two point five Turbo supports multi-reference up to four images to hold character consistency, and Kling three point zero adds a dedicated multi-reference and inpainting mode, where inpainting just means regenerating a masked region while keeping the rest. Then there's the most expansive one, Seedance two point zero from ByteDance, on Dreamina, with a feature called Omni Reference. This one's wild: up to nine images, three videos, and three audio files in a single generation, each assigned a role through an at-mention syntax in the prompt. You write something like at-char-one, at-outfit, at-scene, and route each reference to its job. It was released in February twenty twenty-six, and its reputation is what people call ID-Lock: faces, clothing, and visual style stay locked across your entire video.

Midjourney version seven is a good teaching moment all by itself. The old character-reference flag, the one people learned last year, is deprecated and incompatible with version seven. It just doesn't work anymore. The replacement is Omni-Reference, which has its own dedicated tab and an alpha flag, plus a strength slider where the sweet spot is reportedly somewhere around three hundred to five hundred. And the lesson there, callback to Prompt Dialects, is that flags churn. A parameter you carefully learned a year ago can simply vanish. Same intent, different dialect, not just per model but per version of the same model.

A few more to have in your back pocket. Higgsfield has Soul ID and Characters, which is a trained persistent identity inside Soul two point zero. You upload around twenty or more photos and it locks identity across style, lighting, and angle. A popular combo is using Ideogram for the identity and then Higgsfield Soul for the cinematic video. Nano Banana Pro, on the image side, with its up-to-eight references, is often used to build the consistent set of images that then feeds a video model. And you should at least know the names Vidu, Pika, and Hailuo, which is from MiniMax, all of which offer reference or subject-reference modes. Plus Sora two Pro, noted for strong character consistency and, notably, a public API you can build against.

Now here's the actual teaching point under all those product names, and it's the thing to internalize: the naming convention is the skill, not the tool. Look across all of them. Runway has three labeled slots. Seedance has at-mentions. Veo calls them Ingredients. Kling calls them Elements. Midjourney has a strength slider. Different words, identical underlying pattern. Multiple slots, each one labeled, and then prompt text that routes each reference to a role. Learn that pattern once and the specific syntax is just a dialect detail, exactly like we said in Prompt Dialects. The components are the language. The dialect is just the accent.

So that's multi-reference. Now let's talk about the heavy artillery, the trained character LoRA, so you know when it's worth the trouble and, just as important, when it isn't.

LoRA stands for Low-Rank Adaptation. In plain terms, it's a small trained adapter file, anywhere from a few megabytes to around two hundred megabytes, that bolts onto a base image or video model and teaches it one specific thing. A face. A style. A product. It does this by nudging a low-rank slice of the model's weights, rather than retraining the whole enormous model. A character LoRA is just: this exact person, learned and baked in. And here's the difference from those runtime adapters we mentioned, PuLID and IPAdapter. Those inject identity from one photo at generation time, on the fly. A LoRA bakes the identity into the weights through an actual training process. It's a deeper, more permanent imprint.

How do you train one? Start with the dataset. The consensus sweet spot for a character is roughly fifteen to thirty images, and variety is the whole game. As people who do this put it, twenty-five diverse, well-captioned images with varied angles and lighting outperform sixty similar images every single time. And the flip side: if every image is one angle, the LoRA only works at that angle. So you want a minimum around a thousand pixels on a side, with varied poses, expressions, backgrounds, and lighting. And here's the payoff I promised you earlier: this dataset is the character sheet from before. The sheet you built at the start of this workflow is the training set. That's why we built it.

Next, captions. Each image gets text-tagged. You use a trigger token, some unique made-up word like sks-jane, plus descriptive captions, so the model binds the identity to a word you can actually prompt with later. Then you pick a base model. On the image side that's FLUX point one dev or FLUX point two. On the video side, Wan two point two, from Alibaba, which is open and trains on as little as twelve gigabytes of VRAM, or LTX two point three. And here's a critical constraint: a LoRA is tied to its base model. Swap the base, and you retrain from scratch. Where do you actually run the training? Open-weights through ComfyUI, which is the most flexible path, handles both text-to-video and image-to-video, and is the canonical Act Three stack. Or hosted trainers: fal.ai, which has a fast FLUX trainer and a FLUX point two trainer, Replicate, with a fast FLUX trainer that runs in under two minutes for under two dollars, or Civitai.

On cost and time, all reportedly: a training run is roughly two to five dollars. Fal.ai is around two dollars. Replicate's fast trainer is under two minutes and under two dollars, and there are walkthroughs that get FLUX done for about a dollar. So per run, it's cheap. But here's the honest framing: the real cost isn't the GPU minutes. It's the dataset creation, the iteration, and the evaluation. That's the labor. And this connects to our Cost Per Finished Clip episode: a three-dollar training run that gives you a hundred on-model shots is trivial per finished clip. A three-dollar run for three shots is not. Always judge it per finished clip, not per training run.

What do you get for it? LoRAs reportedly boost character fidelity forty to sixty percent over plain base prompting in open models, and ten to twenty good references can get you to ninety-five percent plus consistency across poses and styles. Those are the upside numbers.

Now let me be honest about the downsides, because this is where people get burned. First, effort. Curating a dataset, captioning it, iterating, and evaluating is real, unglamorous labor. Second, base-model lock-in. A fine-tuned model is a recurring dependency that needs ongoing maintenance, not a one-time asset you build and forget. Power users end up rebuilding their ComfyUI graphs every time a base model updates. Third, churn. Base models update monthly, so your LoRA slowly decays in relevance as the thing it was trained on moves on. Fourth, stacking complexity. If you run multiple LoRAs at once, say a character plus a style plus a lighting LoRA, you have to tune their weights against each other on every generation. And fifth, and this is the kicker: face drift is still possible even with a LoRA if your dataset lacked angle diversity. All those failure modes from the very beginning of the episode apply to LoRAs too. A LoRA trained only on front-facing neutral shots will still fall apart in profile, or on a big laugh. Training doesn't exempt you from physics.

So here's the decision, stated plainly, for a creator working against a deadline.

Default to native multi-reference. For most working creators in mid-2026, the multi-reference features in Runway, Veo, Kling, Seedance, and Midjourney Omni hit good-enough identity with zero training, zero lock-in, and instant iteration. That's the twenty twenty-six shift in one sentence: native multi-reference now does what a custom LoRA used to, for one-off campaigns, short series, and single-deadline jobs.

Train a LoRA only when most or all of these are true. One, high volume or recurring use. A brand mascot, a series host, a virtual spokesperson you'll render hundreds of times. That volume amortizes the setup cost. Two, you need total lock. The likeness has to be exact on every single frame, a real client's face, an owned IP character, where multi-reference's ninety to ninety-five percent just isn't enough. Three, you're staying on the same base model for a while. You're committed to a stack like Wan in ComfyUI and you're not churning weekly. And four, you control a clean dataset. You can get or make those fifteen to thirty good, varied images.

So the rule, simply: one video, or a short series, on a deadline, use native multi-reference. A character you'll reuse for months across many shots, where ninety-five percent isn't good enough, train a LoRA. And there's a middle ground worth knowing: the hosted trained-identity features like Higgsfield Soul ID and Ideogram Character, and the runtime adapters PuLID and IPAdapter. Those are train-lite. You get some of the lock without the full LoRA pipeline. And foreshadowing Act Three again: in the pipeline episode, the LoRA stops being a one-off and becomes part of an automated character library.

Let me give you one copyable workflow that ties all of this together, start to finish.

Step one, build the character sheet in an image model. Prompt a turnaround plus an expression set, neutral lighting, plain grey background. Generate, pick the single strongest frame as your master reference, then generate your angle and expression variants seeded from that master so they all share one identity. Save four to eight clean reference images. And remember, that set doubles as your future LoRA dataset if you ever need to escalate.

Step two, lock the start frame. Callback to Minting Keyframes. For each shot, generate the exact opening frame in the image model, using your master reference, composed for that specific shot, the right angle, the right framing, the right action pose. That frame is your image-to-video anchor.

Step three, generate the clip with multi-reference. Feed in the start frame, plus the character reference, plus an outfit reference or a scene reference, using whatever labeling convention your tool uses. Runway's three labeled slots, Seedance's at-mentions for character, outfit, and scene, Veo's Ingredients, Kling's Elements, Midjourney's Omni strength around three hundred to five hundred. Write the prompt with subject, action, camera, and light, exactly the structure from our Anatomy of a Video Prompt episode. And draft cheap, finish sharp: generate on a low-res or fast tier while you're finding the take, then finish at full quality only on the good one. That's the Cost Per Finished Clip discipline, the take ratio, draft cheap and finish sharp.

Step four, spot drift. Do a deliberate quality-control pass. Lay your shots side by side at the same frame timing. A/B each face against your master sheet. Check the eye spacing, the nose width, the jawline, the hairline, the eye color, the freckles and moles, the outfit details, the proportions. And watch specifically for drift accumulating across a single clip's duration, not just between clips. The person can start right and slide by the end.

Step five, correct. If multi-reference drifts, re-anchor. Use a stronger start frame. Add the angle that drifted to your reference set. Raise the reference strength. Match the lighting. And if you find you're regenerating the same character constantly and still fighting it, that, right there, is your signal from the decision rule to escalate to a trained LoRA or a Soul ID, using the sheet you already built as the dataset. The workflow loops back on itself.

Before we wrap, there are pitfalls beyond plain face drift that will absolutely bite you, so let's name them.

Outfit and prop consistency. Identity is not wardrobe. The model can keep the face perfectly and still swap a jacket's buttons, change a logo, or just lose a prop entirely between shots. The fix is to dedicate a separate reference slot to the outfit or prop, that's literally what those multi-reference slots are for, and to describe the wardrobe explicitly on every call.

Lighting changes break identity. A face under a warm hard key light versus a cool soft light reads as a different person to the model. This is exactly why your sheets use neutral flat lighting, so you don't bake a lighting accident into the identity itself. When a shot needs dramatic lighting, anchor the identity with a neutrally-lit reference and let the prompt drive the lighting separately. Keep those two things in different lanes.

Multi-character scenes mix faces, what's called identity bleed. Put two characters in one shot and the model can blend them. You'll get character A's nose on character B, or both of them drifting toward a shared average. The fix is distinct labeled references per character, character one and character two, spatial description in the prompt, the one on the left, the one on the right, and verifying both faces separately in quality control. And just expect more takes when there are two people in frame.

Reference image quality is a hard ceiling on everything downstream. A low-res, blurry, shadowed reference, or one with sunglasses, or a cropped face, degrades every result that follows. Minimum around a thousand pixels, sharp, evenly lit, face unobstructed. Garbage reference in, garbage consistency out, always.

Expression and angle limits. References, and LoRAs too, only generalize as far as their coverage goes. An all-front-facing sheet won't hold up in profile. A LoRA trained only on neutral faces breaks on a big laugh. So cover the actual range you intend to shoot, before you shoot it.

Then there's provenance, which is becoming both an industry standard and a legal requirement. Google's SynthID invisibly watermarks Imagen images, Veo video, audio, and Gemini text, and it survives standard post-processing. As of May twenty twenty-six there were over a hundred billion assets watermarked that way. The emerging industry standard is two layers: C2PA Content Credentials, which is a signed metadata manifest backed by a coalition of over six thousand members including Google, Adobe, OpenAI, and Meta, plus an imperceptible watermark underneath. And this is becoming law. The EU AI Act, Article fifty, mandates machine-readable AI marking starting August second, twenty twenty-six, with penalties up to fifteen million euros or three percent of global turnover. California's SB nine-four-two is a parallel law, effective January twenty twenty-six. So stripping these watermarks isn't just bad form, it's heading toward illegal.

And the last pitfall, which is really a warning, is likeness and rights for real people. The NO FAKES Act, from twenty twenty-five, defines a digital replica as a computer-generated, highly realistic representation that's readily identifiable as the voice or visual likeness of an individual, living or dead. The takeaway is direct: training a character LoRA, or building a reference set, on a real person, a celebrity, a client, a found face off the internet, carries real deepfake, likeness, and publicity-rights exposure. So use synthetic characters wherever you can. Get written consent for any real subject. And don't strip the provenance watermarks. That's not legal advice, it's craft hygiene that keeps you out of trouble.

One more thing, and then I'll let you go make something. Bench it yourself. The leaderboard reshuffles monthly. Send yourself to the Artificial Analysis Video Arena, which is a blind, human-preference Elo ranking. It measures which video a human actually prefers, averaged across hundreds of thousands of votes, not which one is objectively correct, because there's no objective correct for this. The standings churn constantly. As of this research, a model called HappyHorse one point zero from Alibaba reportedly held the number one spot at around thirteen hundred and thirty-three Elo, with Seedance two point zero at number two, prized for that ID-Lock consistency, and Sora two Pro noted for strong consistency plus a public API. But don't memorize the number one model. The consistency ranking reshuffles month to month, the news segment is where current standings live, and the only ranking that actually matters is your own A/B test on your own shots. Bench the same reference and the same prompt across two tools, and pick the one that holds your character. And judge it on cost per finished clip, including all the re-takes that drift forces you into, because that's the number that pays your rent.