AI Production Methodology
A systematic model-evaluation framework for choosing the right generative tools and finishing workflows—optimized for Quality, Consistency, and Deliverability.
Generative AI changes weekly; a reliable production pipeline needs a stable methodology. This page documents the evaluation framework I use to test new models, compare outputs, and convert promising generations into broadcast-ready deliverables through repeatable finishing chains.
Default target spec: 4K delivery; 24 or 30 fps; vertical or horizontal versions as needed.
Tools named publicly: Google AI Studio, Kling, Higgsfield (with supporting post tools where relevant).
How I score (1–10 rubric)
Every test receives a composite score across Quality, Consistency, and Deliverability. 8+ = production-ready; 7 = needs finishing/has known limitations; 6 and below = R&D only.
10–9: Indistinguishable from live-action or high-end VFX; motion/performance/lighting all read naturally; no visible AI artifacts
8: Broadcast-ready with minor finishing (controlled grain, light cleanup); passes client/director review without explanation
7: Good foundation but needs targeted post work (sky fixes, edge cleanup, ADR); "you can tell, but it works"
6 and below: Synthetic tells dominate (warping, physics breaks, plastic skin); useful for R&D/previs only
10–9: Zero drift; identity/objects/wardrobe locked across all frames; cuts seamlessly into multi-shot sequences
8: Minor temporal texture crawl or subtle flicker that finishing can address; character/object identity holds
7: Noticeable drift on reveal-from-occlusion or fast motion; usable if you limit shot length or add reference conditioning
6 and below: Identity breaks, morphing, multi-character tracking fails; not reliable for continuity-dependent work
10–9: Drops into editorial clean; survives upscale/grade/QC with zero manual repair; reproducible under deadline
8: Requires standard finishing chain (upscale → cleanup → grain) but integrates predictably; timeline-friendly
7: Needs shot-specific fixes (banding paint-outs, edge roto, stabilization); adds post overhead but shippable
6 and below: Unpredictable finishing behavior, high manual repair cost, or can't hit spec (resolution/fps/aspect); blocks pipeline
How I test (repeatable protocol)
I run tests like production would: same creative goal, consistent inputs, then finish to delivery.
1) Standardize inputs
Use a start frame when possible (image-to-video) for tighter control than pure text-to-video
Keep prompt intent constant and vary parameters intentionally (not randomly)
2) A/B comparisons
Run the same shot concept across multiple tools/workflows
Evaluate tradeoffs (stability vs detail vs control) rather than searching for a single "best model"
3) Finish to delivery
Treat base generations as plates; measure how they hold up after a finishing chain
Output in the target spec (4K; 24/30) so results reflect real-world deliverables
4) Document + transfer
Maintain comparison matrices and before/after visual libraries; this page is also part of that documentation system
Problem: A base generation can look great at native resolution but still read synthetic once you push it to delivery specs.
Test goal: Convert a strong image-to-video generation into something that holds up as a cinematic 4K shot.
Method:
Generate with a start frame (control-first)
Upscale to 4K (evaluate "natural detail" vs "hypersharpen" failure modes)
Cleanup pass to reduce compression/artifacting, then add controlled grain to avoid plastic perfection
Key finding: A repeatable "generate → upscale → cleanup/grain" chain consistently improves realism and compresses finishing time—but some issues (like sky banding) still require targeted fixes rather than assuming upscaling will solve everything.
Scores:
Quality: 8 (broadcast-ready with grain/cleanup)
Consistency: 8 (temporal stability holds through finishing)
Deliverability: 9 (predictable, timeline-friendly)
Visuals to show:
Slider/GIF: base → upscaled → final finished
Settings card: the finishing preset used
Problem: Upscaling isn't a universal win—"more resolution" can reduce authenticity depending on source material.
Test goal: Compare two upscalers across (a) AI/VFX-style clips and (b) archival/doc footage.
Method:
Run the same source through both upscalers
Judge not just sharpness, but believability (avoid hypersharpen), artifact profile, and whether the result supports the story
Key findings:
For AI/VFX-style clips, one upscaler produced consistently cleaner, more believable detail while the other could skew hypersharpened
For archival/doc footage, both can reduce grain, but that can work against storytelling authenticity—sometimes the right call is minimal restoration
Cost/speed tradeoffs matter: one upscaler is significantly faster and more cost-effective for high-volume work
Scores (AI/VFX clips):
Upscaler A: Quality 9, Deliverability 9
Upscaler B: Quality 7 (hypersharpen risk), Deliverability 8
Visuals to show:
Two sliders: AI clip (A vs B), archival clip (A vs B)
Callout box: "When not to upscale"
Problem: We needed realistic bees crawling over tattooed skin—without live insects on set and without a heavy manual VFX timeline.
Test goal: Build a repeatable hybrid workflow that preserves the live-action plate while adding convincing insect behavior.
Workflow tested (plate-first):
Color-grade the original footage (so AI inherits the intended look)
Pull a hero still from the graded plate
Generate multiple AI still candidates (bees placement/texture); iterate prompts
Use the best still as the start frame for video generation; iterate prompts/settings
Upscale/finish to 4K; integrate into edit; final color
What I learned:
Image generation tool choice dramatically impacts downstream video quality—some models produce stronger "hero stills" for realism and detail
For close-ups, workflow-based pipelines produce the most lifelike behavior at the best speed/cost tradeoff
For wider shots, other engines yield more organic movement—so tool selection should change based on scale/detail in frame
This approach works best for close-ups; wider shots require different tool/strategy combinations
Scores (close-up version):
Quality: 8 (convincing on close inspection, minor finishing)
Consistency: 8 (insect behavior stable, tattoo lock maintained)
Deliverability: 9 (fast iteration, predictable finishing)
Visuals to show:
4-step strip: original plate → AI still exploration → animated pass → final composite
2×2 grid: same prompt intent across different generators
Prompt screenshots: iteration trail
What I'm actively testing right now (and why it matters):
Motion control stress testing
Fight scenes, action beats, standard blocking, facial movement and acting nuance
Testing against complex backgrounds and fast motion to map failure boundaries
Multi-character scenes to identify when models lose secondary subjects
Object consistency edge cases
Reveal-from-occlusion drift (when the model must "discover" parts of a subject)
When Elements/reference conditioning is required to prevent identity drift
Tattoos, logos, and wardrobe detail lock across motion
Lip sync + performance realism
Where it holds up, where it breaks, and what needs ADR/cleanup for broadcast polish
Synthetic voice detection thresholds and when to plan for voice replacement
Platform R&D
Seedance 2.0
Higgsfield CinemaScope/VFX/animation workflows
Higgsfield Vibecoding for GFX/text/animation
Seedream platform evaluation (comprehensive feature testing)
These are the repeatable decisions that come out of the testing framework:
Generation strategy
Default to start-frame image-to-video for control; pure text-to-video for exploration only
Treat generations as plates, then finish them; don't judge models only at native resolution
Finishing chains
Live-action enhancement: Topaz Video AI models (Proteus/Artemis/Gaia depending on footage profile)
GenAI finishing: Upscale first (evaluate detail vs hypersharpen), then Topaz cleanup, then controlled grain
Add grain strategically: reduces synthetic perfection while retaining detail (e.g., Silver Rich grain ~40)
VFX integration
For sky replacement and complex removals, build a clean background plate first, then comp back in for maximum control
Kling Omni object removal workflow has become the go-to for background plate cleanup
Prompting optimization
JSON prompting for complex, multi-constraint scenarios
Multi-reference image packing to improve prompt specificity
Motion mapping from simple 3D wireframes when precise choreography is required
If you're building a GenAI production capability, the fastest path to consistent quality is a testing framework that converts experimentation into repeatable workflows.
This methodology ensures:
Predictable results under deadline pressure
No surprises in post (known limitations documented upfront)
Scalable knowledge transfer (any team member can apply the same rubric)
Strategic tool selection based on shot requirements, not hype cycles
The framework adapts as tools evolve—but the evaluation criteria stay stable, which means production pipelines stay stable even as the AI landscape changes weekly.