Skip to content

Bringing Old Family Photos To Life With AI

A technical log of an AI-assisted family film: colour-managed restoration, image-to-video animation, identity preservation, and the...

Peter McCann Strain21 May 202617 min read
Bringing Old Family Photos To Life With AI
Computer Vision • Production AI

The brief was an 85th-birthday gift for my grandmother: take a family archive spanning roughly a century, restore the damaged scans, animate the photographs without losing the people in them, and assemble one finished film. The deliverable shipped: 8 minutes 34 seconds of finished film. But the system that produced it is not the system I designed. The gap between the two is the engineering.

Old photographs are fragile evidence. A model can sharpen a face and quietly turn it into the wrong person. It can colourise a coat with confidence and no historical basis. So the problem was never whether AI could make this look striking; it was whether a pipeline could make the film feel alive while keeping every generated detail labelled as reconstruction, not record.

Privacy and disclosure

A private, non-commercial family project, shared here as technical evidence with consent. I avoid public family names. The film uses AI-assisted restoration and animation; the original scans remain the source of truth. It is a birthday celebration, not commercial client work, and not a memorial.

The full film

Before the machinery, the artifact. Most shots open on a restored still and cross-fade into motion, so the original photographic moment stays visible. The animation is deliberately restrained: blinking, small head movement, minor expression changes, almost no camera theatrics. Identity preservation was the binding constraint, so the goal was to keep the same person rather than render a nice face. And the restoration is generative, a respectful reconstruction rather than documentary recovery.

The full 8:34 film, published as a portfolio-safe web encode with AI-assisted restoration and animation disclosed on-page.

Constraints and the ethical frame

The fixed constraints shaped the architecture. The source was scanned family photography, mostly high-resolution 16-bit TIFFs, arriving in several batches as the project ran. I wanted an API-first build: restoration callable from Python, video generation via hosted APIs, no manual GUI workflow. And there was a hardware mismatch from day one. The plan assumed at least 24 GB of GPU VRAM; the actual machine was a 16 GB RTX 5080. Memory pressure was part of the work, and it is part of why the heaviest local models turned out to be brittle.

Before writing code, I set five rules:

RuleWhy it mattered
No invented speech or lip-syncA moving mouth would imply a person said something they never said.
Subtle motion onlyBring photographs gently to life, not turn them into fictional scenes.
Originals stay immutableThe scan is the evidence; the restoration is an interpretation.
Cut anything that looks wrongEmotional correctness matters more than filling every slot in the film.
Disclose AI assistanceViewers should not have to guess what is restored, generated, or animated.

The last rule has a known gap. I built a reusable AI-disclosure title component, titles.render_disclosure, but the final-assembly script never calls it, so the delivered family cut carries no on-screen disclosure card. That is why this public page discloses the AI assistance directly around the video. A wider-release cut would render the card into the film itself.

The planned architecture

The original design had four phases, ordered deliberately: repair before enhancement, recover face geometry before colourisation, colourise before upscaling, and keep identity checks outside any model that might hallucinate detail.

PhaseGoalPlanned approach
RestorationTurn degraded scans into clean, colour, high-resolution imagesA local six-stage pipeline: preprocess, defect inpaint, face restore, colourise, super-resolve, diffusion refine.
AnimationTurn each restored photo into a short clipCompare hosted image-to-video providers on the same photographs.
EvaluationMeasure whether each step improved the image and kept identityFull-reference metrics on synthetic degradation, no-reference metrics on real scans, ArcFace identity checks.
AssemblyBuild one finished filmProgrammatic FFmpeg composition, cross-fades, music, titles, review cuts.

It was a solid, research-backed plan, and it did not survive intact. The build became a Python package, animate_old_photos, exposing one CLI tool, aop. But the restoration that shipped is not the restoration in that diagram.

AI-assisted photo restoration and animation pipeline from source scans through restoration, cataloguing, animation, verification, and film assembly

Preprocessing: the quiet part that worked

Preprocessing looks boring until it goes wrong. Most downstream models expect 8-bit sRGB. The scans were 16-bit TIFFs carrying embedded scanner ICC profiles. A naive conversion breaks the project before restoration starts: colours shift if a scanner profile is silently treated as sRGB, and smooth areas of faded skin or sky posterise if 16-bit tone values are simply truncated to 8-bit.

The preprocess module is profile-aware and precision-aware. It reads the TIFF with tifffile, extracts the embedded ICC profile and uses it to drive a colour-managed transform into sRGB with a perceptual rendering intent, and keeps a 16-bit sRGB intermediate for precision-sensitive stages alongside an 8-bit PNG for model compatibility. The 16-to-8-bit reduction uses random-noise (stochastic) dithering before truncation to avoid visible banding; the docstrings originally mislabelled this as Floyd-Steinberg or ordered dithering, and it is neither. EXIF orientation is applied, working images are capped at 2048 px on the long edge, and the full-resolution original is kept untouched in originals/.

Rotation correction had a neat solution. Instead of checking every image by hand, I ran a lightweight face detector, InsightFace buffalo_l, at all four 90° rotations and picked the orientation with the most anatomically sane detected faces. If no face was found, the image was left alone rather than guessed.

This stage worked as designed and was never revisited. It is the cleanest expression of the project's posture: protect the source, preserve information, use ML as a cheap signal only where it genuinely helps.

The local restoration stack: built, then bypassed

The planned restoration pipeline had six local stages, each a Stage subclass that declares its model weights, runs the model, writes an output, and records provenance. The architecture was sound. The model integrations were uneven, and the cumulative result was a stack good enough to learn from but not good enough to ship.

  • Stage 2, defect masking and inpainting. SAM 2 generated candidate damage masks; MAT was meant to inpaint larger holes. Mask generation worked and produced output for 80 photos. MAT integration stayed fragile, and heavy structural inpainting was rarely the bottleneck anyway.
  • Stage 3, coarse face restoration. GFPGAN v1.4 (default) and CodeFormer (selectable) ran and produced outputs for 79 of the 80 photos that reached it. A face restorer can make a face sharper and still make it subtly wrong, which is exactly why identity stayed a separate measurement problem.
  • Stage 4, colourisation. The first hard failure. DDColor never produced a single output: a dependency import conflict (missing modelscope, an unregistered image-colorization task, a missing weights file, a basicsr sub-module break). The local colouriser failed on the 14 black-and-white photos that needed it most. This was the most consequential failure in the project and the direct trigger for the hosted pivot.
  • Stage 5, super-resolution. Two routes were planned: SUPIR as the flagship, Real-ESRGAN as the fallback. The genuine ControlNet-based SUPIR was never run; the implemented path was an SDXL img2img approximation that produced watercolour-soft results, not faithful enough for family photography. Real-ESRGAN, the supposedly lesser baseline, became the default and produced outputs for 66 of 80 photos. This is not evidence SUPIR is weak; it is evidence that a frontier method without its intended control machinery is a different, worse model.
  • Stage 6, diffusion face refinement. A posterior-mean rectified-flow (PMRF) stage was scaffolded and never run. The hosted pivot made it unnecessary before it mattered.

One unglamorous theme ran through all of it: obtaining model weights was itself an engineering problem. On 8 May alone there were four distinct weight-acquisition failures, a SAM 2 checksum mismatch, no published hash for CodeFormer, a MAT Google Drive link returning HTTP 404, and a gated InsightFace repo returning HTTP 401/404. A research pipeline's reproducibility is gated by the weakest hosting link in its model supply chain. A weight registry (source, expected size, licence, checksum where available, explicit mirror substitution) reduced the friction but did not remove it.

The FLUX pivot

By the end of the restoration build the choice was clear: keep debugging a local stack that was losing, or restore from a hosted model. I pivoted restoration to Black Forest Labs' FLUX 2 [pro], run as a hosted instruction-driven image editor directly on the original colour-managed scans.

The capability needed was specific: remove damage, rebalance faded tone, colourise black-and-white photographs, sharpen faces, and preserve identity, composition, pose, and the number and position of people. One hosted call replaced the local stages that were brittle, incomplete, or simply not good enough for the deadline. I built and compared two routes: polishing the local pipeline outputs (stage_07_polish) versus polishing the colour-managed originals directly (stage_07b_polish_raw). Polishing from originals won; the local route preserved upstream mistakes and gaps.

The stage_07b_polish_raw directory holds 89 restored renders, one per preprocessed photo plus re-runs of re-scanned and late arrivals, and FLUX 2 [pro] restoration succeeded for every photograph. Generation used a fixed English-paragraph RAW_PROMPT constant, a fixed random seed of 1234, and a 4-megapixel cap.

The delivered film does not show a six-stage local restoration cascade. It shows a hosted FLUX 2 [pro] instruction-edit pass over colour-managed source scans. The local pipeline was real, engineered, and tested, and largely unexercised in the result that shipped.

Diagram comparing the built local restoration stack with the hosted FLUX restoration path used for the delivered film

The hard objection to the pivot is that it makes most of the build dead weight. A six-stage restoration cascade, a weight registry, and a metric harness, and the film ships on a single hosted API call that any competent engineer could have wired up in an afternoon. That objection is largely correct on cost: measured against this one deliverable, the local stack was over-engineered, and a hosted-first plan would have reached the same film faster and cheaper. Where it stops holding is on what the local work actually bought. The colour-managed preprocessing that fed FLUX, the identity registry that constrained every animation prompt, and the seven-metric scaffolding that taught me which failures to watch for were all built in the local pass and all survived the pivot. A hosted model is rentable; the judgement about what to send it, and how to check what comes back, is not. The over-engineering was in the restoration cascade specifically, and that is the part the pivot retired.

Making the collection queryable

A pile of restored images is not a film. The middle layer turned the archive into something structured enough to reason over, using four model families each for its best-fit job rather than asking one model to do everything.

ToolJob in the pipeline
Cheap heuristicsA NumPy black-and-white test, a damage score (high-frequency spikes, saturation drop, blockiness), and a 64-bit DCT perceptual hash for near-duplicates.
CLIPZero-shot scene classification, decade priors, photographic-medium guesses.
BLIP-2One-line captions for prompt templates and review.
ArcFace / InsightFaceFace detection, landmark checks, and 512-dimensional identity embeddings.
HDBSCANDensity-based clustering of face embeddings into person identities, without pre-specifying how many people exist.
LLaVA-NeXTA local VLM in 4-bit quantisation, running two structured-question passes per photo (a quality audit and a deep content audit) into a JSON sidecar.

The people registry mattered most. Identity had to become a first-class, queryable object, not a vague hope. ArcFace detected faces and embedded them; HDBSCAN grouped the embeddings into likely people while leaving ambiguous faces as noise; promoted clusters became person records in a small SQLite registry; an HTML contact sheet went out for human naming and merging; the frozen identities then fed the animation stage. CLIP decade estimates stayed rough priors and VLM descriptions needed human judgement, but as software design this was the layer that made the film possible.

Animation: four backends, one production choice

The animation system was provider-agnostic, with integrations for Google Veo 3.1, Runway Gen-4.5, Kling 3.0, and ByteDance Seedance 1.0 behind a shared dispatcher with cost tracking, rate limiting, and retries. The planned head-to-head on identical photos was never run at scale: Veo, Runway, and Seedance stayed test-only, Seedance got a single test clip, and Kling 3.0 carried the entire production run. Every one of the 147 billed generations was a Kling job.

Kling won on a specific lever. It exposes a cfg_scale guidance dial, which I set low, 0.35 against a default of 0.5, so the model trusts the input image over the text prompt. For this project that mattered more than cinematic camera motion. The goal was not to invent action; it was to keep the person.

The real animation problem: identity drift

The first clips made the failure mode obvious. The models could produce smooth, attractive motion easily. The hard part was stopping them from regenerating the face.

Failure modeWhat it looked like
Identity driftA face became subtly different frame by frame.
Smile decayA warm expression collapsed into neutrality after a second or two.
Pose collapsePeople in groups shifted backwards, slumped, or warped.
Uncanny motionBreathing, blinking, or mouth movement became too theatrical.
Frame lossAspect-ratio handling cropped people near the edges.

I ran five rounds of prompt iteration. v1 used generic living-portrait templates; v2 built per-photo prompts from the catalogue and audit layer; v3 added aggressive identity-lock language plus 35 individually rewritten retry prompts; v4 targeted 7 stubborn clips with smile or pose failures; v5 handled final tuning, late images, and provider-specific aspect-ratio fixes. The prompts became less poetic and more operational: preserve face, hair, age, gaze, expression, and pose; keep the camera locked; no lip-sync, no new people, no face morphing, no exaggerated motion. The iteration paid. 34 of the 80 scored photos (42%) were won by a v3-or-later revision, 27 by v3 and 7 by v4.

Identity-preservation loop showing prompt iteration, Kling generation, ArcFace checks, VLM-assisted review, and keep-or-cut decisions

The most practical bug had nothing to do with model intelligence. Kling silently centre-crops images that do not match a preferred aspect ratio. In a prompt demo that is a minor aesthetic issue; in a family photograph it can remove a person from the edge of the frame. The fix was a pad-and-crop sandwich: pad the image to a supported ratio with a softly blurred extension of its own edges before generation, then crop back afterward using the recorded padding metadata. Small systems detail, but it is what separates a demo from a media pipeline you can trust with someone's family.

Verification

The evaluation plan had two halves. The quantitative half was a seven-metric battery built on the pyiqa library: full-reference metrics (LPIPS, DISTS) usable only on a synthetic degraded test set, no-reference metrics (MUSIQ, CLIP-IQA+, QualiCLIP, a face-specific TOPIQ variant) for real scans, and an ArcFace cosine-similarity identity check before and after each processing step. None of those metrics answers the question that matters most here: is this still the same person?

The harness was built and tested, but it never became the decision tool. The actual day-to-day decider was a vision-language-model judge, Gemini where configured and otherwise local LLaVA-NeXT, scoring each clip 1 to 10 on identity, mood, naturalness, and a composite.

EvidenceMeaning
Mean composite 7.9/10A VLM judge's clip-quality score, across the 80 photos it scored, not a human audience rating.
Mean identity 8.31/10The model-judged identity signal was strong enough to support curation.
0 clips scored below 7The final selected set cleared the model-review threshold.
34 of 80 shots won by a v3+ revisionThe prompt-iteration loop made a material difference.

A 7.9 composite is a VLM's opinion, not objective truth, and it does only one job here: it gave the project a review loop that caught failures and kept the quality claim bounded to model-assisted review rather than a human panel.

Assembly

The final film ran 8 minutes 34 seconds, composed from a curated set of restored and animated photographs. A narrative-ordering script sorted shots primarily by decade, then scene priority, then mood priority. Each shot is a small composition: the restored still is held briefly, letterboxed onto a 1920×1080 canvas with a blurred fill so the original framing stays visible, then cross-faded into its Kling clip; shots join with half-second cross-fades.

Four cuts were produced over 36 hours: v1 at roughly 8:05, v2 adding 4K masters, v3 polishing the ordering, v4 the final cut with the music bed. Audio is a single licensed instrumental, "Amberlight", looped and trimmed to the exact 514-second runtime. The delivered master is birthday_nancy_v4_4k_music.mp4, a 4K HEVC file, with a 1080p H.264 cut alongside for compatibility. The film opens on a birthday intro card and closes on a written tribute.

Final film assembly flow from restored stills and generated clips through ordering, transitions, music, titles, and web publication

Results

Artifact / resultResult
Photos preprocessed87 distinct images, colour-managed
Hosted restoration renders89, including re-runs and late additions; FLUX 2 [pro] succeeded on every photo
Final film compositionA curated set of restored and animated photographs, ordered by decade and narrative priority
Runtime8:34 (514 s)
Billed video generations147 Kling 3.0 jobs
SoftwareOne animate_old_photos package behind a single aop CLI
Test suiteCovers preprocessing, catalogue, animation, and assembly, with identity checks and parametrised degradation variants
Test status246 passing, 7 failing

The seven failing tests map the divergence between plan and shipped route. One, test_default_model_choice_is_supir, fails because the super-resolution default was deliberately switched to Real-ESRGAN after the SUPIR route under-delivered. The other six belong to the never-completed face-refinement stage and prompt module. In a polished product I would update or delete them; in a technical log they earn their place, marking exactly where the design diverged from what reached the film.

Limitations

The local six-stage pipeline is largely unexercised in the delivered film; restoration shipped on hosted FLUX 2 [pro]. DDColor colourisation failed outright with zero outputs. The real ControlNet-based SUPIR never ran; an SDXL approximation stood in for it. The PMRF face-refinement stage never ran. The four-provider video comparison was never executed at scale; only Kling carried production. The quality scores are model-judged, not human-rated. CLIP decade labels are approximate priors, and a few photographs are visibly mis-dated. FLUX 2 [pro] is generative, so a colourised garment or sharpened texture is a plausible reconstruction, not a record of fact. The delivered cut carries no on-screen AI-disclosure card; the component exists but was never wired in.

A beautiful architecture is not a delivered film

The film worked: a complete 8:34 birthday film, warm and recognisable and alive, without pretending to recover the literal past. But the thesis is the gap the opening promised — a beautiful architecture and a successful deliverable are not the same thing. The local, metric-driven restoration cascade was the right design for a repeatable research system and the wrong thing to defend against a deadline, and the film got better the moment I moved restoration to a hosted frontier model and started measuring identity instead of hoping for it. The parts that survived the pivot — the colour-managed preprocessing, the identity registry, the review loop — were never really about restoration at all. They were about knowing what to send a rentable model, and how to check what it sent back. That judgement is the part you cannot rent, and it is the whole project: a pipeline built carefully enough to help someone feel something real from photographs they already loved.

Want the connective tissue?

Peter's profile AI can unpack this article, connect it to the wider work, or point you to the next thing worth reading without pretending there's a newsletter backend here.

Comments

Loading comments…
Related threads

Related Posts