• AV WORK
    • Swagger
    • Truth Be Told
    • The Handmaid's Tale
    • Friday Night Baseball
    • Baltic Born
    • Brilliant Earth
    • Bite Size Horror
    • Mr. Mercedes
    • Phenoms
    • Rogue
    • Hulu Has...
    • Grizzlyfest 2019
    • Grizzlyfest 2018
    • Grizzlyfest 2017
    • Huluween
    • AMVA
    • 7-Eleven/Speedway
    • Preveta
    • Soul Rebel Marley - Holy Father
    • J'Riva - Frank Ocean
    • AI Production R&D - Winter 2026
  • ART+DESIGN
  • PHOTOGRAPHY
  • SOCIAL
  • RECS
    • My Bio
    • My Scrapbook
  • CONTACT
Menu

MARTYN WATTS

  • AV WORK
  • AV CAMPAIGNS
    • Swagger
    • Truth Be Told
    • The Handmaid's Tale
    • Friday Night Baseball
    • Baltic Born
    • Brilliant Earth
    • Bite Size Horror
    • Mr. Mercedes
    • Phenoms
    • Rogue
    • Hulu Has...
    • Grizzlyfest 2019
    • Grizzlyfest 2018
    • Grizzlyfest 2017
    • Huluween
    • AMVA
    • 7-Eleven/Speedway
    • Preveta
    • Soul Rebel Marley - Holy Father
    • J'Riva - Frank Ocean
    • AI Production R&D - Winter 2026
  • ART+DESIGN
  • PHOTOGRAPHY
  • SOCIAL
  • RECS
  • ABOUT
    • My Bio
    • My Scrapbook
  • CONTACT

AI Production Methodology

A systematic model-evaluation framework for choosing the right generative tools and finishing workflows—optimized for Quality, Consistency, and Deliverability.​

Generative AI changes weekly; a reliable production pipeline needs a stable methodology. This page documents the evaluation framework I use to test new models, compare outputs, and convert promising generations into broadcast-ready deliverables through repeatable finishing chains.​

Default target spec: 4K delivery; 24 or 30 fps; vertical or horizontal versions as needed.
Tools named publicly: Google AI Studio, Kling, Higgsfield (with supporting post tools where relevant).


How I score (1–10 rubric)
Every test receives a composite score across Quality, Consistency, and Deliverability. 8+ = production-ready; 7 = needs finishing/has known limitations; 6 and below = R&D only.

Quality (Does it look real?)

  • 10–9: Indistinguishable from live-action or high-end VFX; motion/performance/lighting all read naturally; no visible AI artifacts

  • 8: Broadcast-ready with minor finishing (controlled grain, light cleanup); passes client/director review without explanation

  • 7: Good foundation but needs targeted post work (sky fixes, edge cleanup, ADR); "you can tell, but it works"

  • 6 and below: Synthetic tells dominate (warping, physics breaks, plastic skin); useful for R&D/previs only

Consistency (Can it stay the same?)

  • 10–9: Zero drift; identity/objects/wardrobe locked across all frames; cuts seamlessly into multi-shot sequences

  • 8: Minor temporal texture crawl or subtle flicker that finishing can address; character/object identity holds

  • 7: Noticeable drift on reveal-from-occlusion or fast motion; usable if you limit shot length or add reference conditioning

  • 6 and below: Identity breaks, morphing, multi-character tracking fails; not reliable for continuity-dependent work

Deliverability (Can it ship?)

  • 10–9: Drops into editorial clean; survives upscale/grade/QC with zero manual repair; reproducible under deadline

  • 8: Requires standard finishing chain (upscale → cleanup → grain) but integrates predictably; timeline-friendly

  • 7: Needs shot-specific fixes (banding paint-outs, edge roto, stabilization); adds post overhead but shippable

  • 6 and below: Unpredictable finishing behavior, high manual repair cost, or can't hit spec (resolution/fps/aspect); blocks pipeline


    How I test (repeatable protocol)

I run tests like production would: same creative goal, consistent inputs, then finish to delivery.

1) Standardize inputs

  • Use a start frame when possible (image-to-video) for tighter control than pure text-to-video

  • Keep prompt intent constant and vary parameters intentionally (not randomly)

2) A/B comparisons

  • Run the same shot concept across multiple tools/workflows

  • Evaluate tradeoffs (stability vs detail vs control) rather than searching for a single "best model"

3) Finish to delivery

  • Treat base generations as plates; measure how they hold up after a finishing chain

  • Output in the target spec (4K; 24/30) so results reflect real-world deliverables

4) Document + transfer

  • Maintain comparison matrices and before/after visual libraries; this page is also part of that documentation system​


Experiment 1: Cinematic 4K finishing chain (deliverability test)

Problem: A base generation can look great at native resolution but still read synthetic once you push it to delivery specs.

Test goal: Convert a strong image-to-video generation into something that holds up as a cinematic 4K shot.

Method:

  • Generate with a start frame (control-first)

  • Upscale to 4K (evaluate "natural detail" vs "hypersharpen" failure modes)

  • Cleanup pass to reduce compression/artifacting, then add controlled grain to avoid plastic perfection

Key finding: A repeatable "generate → upscale → cleanup/grain" chain consistently improves realism and compresses finishing time—but some issues (like sky banding) still require targeted fixes rather than assuming upscaling will solve everything.

Scores:

  • Quality: 8 (broadcast-ready with grain/cleanup)

  • Consistency: 8 (temporal stability holds through finishing)

  • Deliverability: 9 (predictable, timeline-friendly)

Visuals to show:

  • Slider/GIF: base → upscaled → final finished

  • Settings card: the finishing preset used


Experiment 2: Upscaler comparison—AI/VFX clips vs documentary/archival

Problem: Upscaling isn't a universal win—"more resolution" can reduce authenticity depending on source material.

Test goal: Compare two upscalers across (a) AI/VFX-style clips and (b) archival/doc footage.

Method:

  • Run the same source through both upscalers

  • Judge not just sharpness, but believability (avoid hypersharpen), artifact profile, and whether the result supports the story

Key findings:

  • For AI/VFX-style clips, one upscaler produced consistently cleaner, more believable detail while the other could skew hypersharpened

  • For archival/doc footage, both can reduce grain, but that can work against storytelling authenticity—sometimes the right call is minimal restoration

  • Cost/speed tradeoffs matter: one upscaler is significantly faster and more cost-effective for high-volume work

Scores (AI/VFX clips):

  • Upscaler A: Quality 9, Deliverability 9

  • Upscaler B: Quality 7 (hypersharpen risk), Deliverability 8

Visuals to show:

  • Two sliders: AI clip (A vs B), archival clip (A vs B)

  • Callout box: "When not to upscale"


Experiment 3: Hybrid VFX test—bees on skin (real plate enhancement)

Problem: We needed realistic bees crawling over tattooed skin—without live insects on set and without a heavy manual VFX timeline.

Test goal: Build a repeatable hybrid workflow that preserves the live-action plate while adding convincing insect behavior.

Workflow tested (plate-first):

  1. Color-grade the original footage (so AI inherits the intended look)

  2. Pull a hero still from the graded plate

  3. Generate multiple AI still candidates (bees placement/texture); iterate prompts

  4. Use the best still as the start frame for video generation; iterate prompts/settings

  5. Upscale/finish to 4K; integrate into edit; final color

What I learned:

  • Image generation tool choice dramatically impacts downstream video quality—some models produce stronger "hero stills" for realism and detail

  • For close-ups, workflow-based pipelines produce the most lifelike behavior at the best speed/cost tradeoff

  • For wider shots, other engines yield more organic movement—so tool selection should change based on scale/detail in frame

  • This approach works best for close-ups; wider shots require different tool/strategy combinations

Scores (close-up version):

  • Quality: 8 (convincing on close inspection, minor finishing)

  • Consistency: 8 (insect behavior stable, tattoo lock maintained)

  • Deliverability: 9 (fast iteration, predictable finishing)

Visuals to show:

  • 4-step strip: original plate → AI still exploration → animated pass → final composite

  • 2×2 grid: same prompt intent across different generators

  • Prompt screenshots: iteration trail


Active Lab (Winter 2026)

What I'm actively testing right now (and why it matters):

Motion control stress testing

  • Fight scenes, action beats, standard blocking, facial movement and acting nuance

  • Testing against complex backgrounds and fast motion to map failure boundaries

  • Multi-character scenes to identify when models lose secondary subjects

Object consistency edge cases

  • Reveal-from-occlusion drift (when the model must "discover" parts of a subject)

  • When Elements/reference conditioning is required to prevent identity drift

  • Tattoos, logos, and wardrobe detail lock across motion

Lip sync + performance realism

  • Where it holds up, where it breaks, and what needs ADR/cleanup for broadcast polish

  • Synthetic voice detection thresholds and when to plan for voice replacement

Platform R&D

  • Seedance 2.0

  • Higgsfield CinemaScope/VFX/animation workflows

  • Higgsfield Vibecoding for GFX/text/animation

  • Seedream platform evaluation (comprehensive feature testing)


Workflow "rules of thumb"

These are the repeatable decisions that come out of the testing framework:

Generation strategy

  • Default to start-frame image-to-video for control; pure text-to-video for exploration only

  • Treat generations as plates, then finish them; don't judge models only at native resolution

Finishing chains

  • Live-action enhancement: Topaz Video AI models (Proteus/Artemis/Gaia depending on footage profile)

  • GenAI finishing: Upscale first (evaluate detail vs hypersharpen), then Topaz cleanup, then controlled grain

  • Add grain strategically: reduces synthetic perfection while retaining detail (e.g., Silver Rich grain ~40)

VFX integration

  • For sky replacement and complex removals, build a clean background plate first, then comp back in for maximum control

  • Kling Omni object removal workflow has become the go-to for background plate cleanup

Prompting optimization

  • JSON prompting for complex, multi-constraint scenarios

  • Multi-reference image packing to improve prompt specificity

  • Motion mapping from simple 3D wireframes when precise choreography is required


Why This Matters

If you're building a GenAI production capability, the fastest path to consistent quality is a testing framework that converts experimentation into repeatable workflows.​

This methodology ensures:

  • Predictable results under deadline pressure

  • No surprises in post (known limitations documented upfront)

  • Scalable knowledge transfer (any team member can apply the same rubric)

  • Strategic tool selection based on shot requirements, not hype cycles

The framework adapts as tools evolve—but the evaluation criteria stay stable, which means production pipelines stay stable even as the AI landscape changes weekly.

© 2023 K Communications LLC. All Rights Reserved