Satellite Pose Estimation and the Sim-to-Real Domain Gap

For the target client, there were no labelled images of the spacecraft in space. None. That single fact shaped every decision in the satellite pose-estimation work I did at Astroscale.

Pose estimation for orbital rendezvous is a 6-degree-of-freedom problem: a vision system has to recover where a target spacecraft is and how it is oriented, from a camera that will see it under lighting and geometry the training set never contained. The engineering problem was never "pick a good pose network." It was that the data you can get cheaply is not the data the model has to survive.

Three pieces of work attacked that gap from three directions, and they reached three different levels of maturity. A preprocessing investigation — a completed study with measured trade-offs. SPS-B — a multi-task architecture proposal that was never trained. And AstroGAN — implemented code that produced one hard result and ran straight into the limit of what that result could prove. I will take them in that order, because that is the order the problem forces.

Why synthetic data is unavoidable and untrustworthy

You cannot photograph enough real spacecraft in orbit to train a pose network. So you render. A synthetic pipeline gives you exact pose labels, controllable lighting, and as much camera variation as compute allows. That is the case for synthetic data, and it is a strong one.

The problem is what the renderer quietly teaches the model. A network trained on synthetic imagery does not learn "spacecraft pose." It learns the joint distribution of synthetic edges, synthetic contrast, synthetic shadow falloff, and synthetic sensor response, and then it is asked to reason about a camera domain with none of those properties. The gap between the renderer and the real sensor is the failure mode, not a finishing touch.

So before any architecture work, the right question is concrete: what actually differs between the domains, and by how much?

Measuring the gap before modelling it

The preprocessing investigation started there. It used the SPEED+ dataset, built for exactly this: the same spacecraft target captured across three domains — synthetic renders, lightbox lab imagery, and sunlamp lab imagery — all at 1920×1200.

Rather than assume the domains differed, the investigation computed per-domain image statistics: colour intensity, contrast, dynamic range, entropy, and edge count. The numbers make the gap legible.

SPEED+ domain statistics

The domain gap, measured before any model touched it

All three domains are 1920×1200 images of the same target. The numbers are per-domain image statistics — the same scene, three measurably different distributions.

Synthetic

Mean intensity: 41.27
Mean entropy: 5.25
Mean edge count: 29,431

The renderer domain. Cheap to label, dense in edge structure, but its own visual signature.

Lightbox

Mean intensity: 54.99
Mean entropy: 5.78
Mean edge count: 26,889

A lab capture closest to synthetic in edge content — and the domain that tuned best.

Sunlamp

Mean intensity: 46.83
Mean entropy: 4.22
Mean edge count: 6,764

Harsh directional light. Roughly a quarter of the edge content — the hardest domain to close.

Sunlamp carries roughly a quarter of lightbox's edge content and the lowest entropy of the three. A pose model trained on synthetic edges has the most to lose there — which is exactly where the domain-specific tuning trade-off showed up.

The edge-count figures are the ones that change how you think about the problem. Sunlamp's harsh directional light produces a mean edge count of 6,764, roughly a quarter of lightbox's 26,889 and synthetic's 29,431. A pose network trained on synthetic edge structure has, in a precise sense, the most to lose in the sunlamp domain. That asymmetry is measured, not assumed, and it predicted exactly where things would later go wrong.

A preprocessing pipeline, and the experiment that mattered

With the gap measured, the investigation built a classical preprocessing pipeline to narrow it — no neural network, just image processing chosen against the diagnostics: grayscale conversion and normalisation to [0, 1]; luminance-based categorisation of each image as dark, extreme, or normal; category-dependent gamma correction (0.1 for dark, 1.8 for extreme, 0.5 for normal); then CLAHE for local contrast, synthetic multi-exposure fusion, histogram equalisation, dynamic-range compression, and a final clipped normalisation.

The pipeline itself is unremarkable. What it was used to test is the interesting part: the same preprocessing was run in two places and gave opposite results.

Where the intervention belongs

The same transform, two placements, opposite results

The preprocessing pipeline did not change between these two experiments. Only where it sat in the workflow did.

Experiment A — inference only

Helped

Preprocessing applied at test time

The best models applied preprocessing steps selectively, keyed to each image’s measured properties. Test-set performance improved.

Selective, per-imageTest-set gains

Experiment B — training + inference

Hurt

Preprocessing baked into the training data

The hypothesis was that normalising the training domain too would help further. It turned out false: performance got worse. Flattening the inputs took variability out of training that the model needed to learn from.

Variability removedPerformance regressed

The takeaway is not “preprocessing works.” It is that a domain-gap fix is only a fix in the right part of the pipeline. At inference it narrows a specific mismatch; in training it can erase the very variety the model needs.

In Experiment A, preprocessing was applied only at inference. The best setup applied steps selectively, keyed to each image's measured properties, and test-set performance improved. In Experiment B, the hypothesis was that preprocessing the training data too would adjust the domain further in the model's favour. It turned out false: performance got worse. The explanation, in the report's own terms, is that normalising the training inputs took variability out of training that the model needed to learn from.

The transform did not change between the two experiments. Only its position in the pipeline did. At inference it narrows a specific, measurable mismatch between an incoming image and the training distribution. In training it flattens the input distribution and starves the model of the variety that makes it robust. Same operation, opposite sign, decided by placement alone.

The placement question

Before asking whether a domain-gap fix works, ask where it belongs: data generation, augmentation, model architecture, inference-time normalization, or evaluation. The same intervention can help in one slot and hurt in another. A preprocessing trick that improves a test slice can quietly build a brittle training story.

The investigation also reported two pose-error figures, best read as a pair. A technique tuned for the lightbox domain reached roughly 7 degrees of orientation error on lightbox; the same technique on sunlamp sat at about 16 degrees. That is not a bug — it is the edge-count asymmetry surfacing as a measured trade-off. Tuning hard for the domain with rich edge structure cost accuracy in the domain with sparse edges, so the method the report settled on was an explicit compromise that balances the two rather than winning either outright. (The ~7° and ~16° are the only two numbers the report states in body text; the per-axis comparison charts are unlabelled, so the balanced method's own figure is not something I can quote.)

SPS-B: a multi-task architecture proposal

The preprocessing work attacked the gap from the data side. SPS-B, designed with William Jones, attacked it from the architecture side. It is an eight-slide design document — not trained, not benchmarked, no accuracy or pose-error numbers — so what follows is design reasoning, not results.

The proposed architecture had a clear shape, every component carrying an explicit problem-to-solution rationale:

Three input datasets, each with its own augmentation strategy — a Unity synthetic set, an LHM physics-based synthetic set, and a ground real set. Very few labelled satellite-in-space images exist, and none for the specific client; combining real and synthetic sources pushes the model toward features common across datasets, which lowers overfitting and improves domain adaptation.
A large Vision Transformer backbone, pre-trained with transfer and self-supervised learning. CNNs discard spatial information and carry strong architectural priors; a transformer retains the spatial structure that target positioning depends on.
A feature pyramid built from fully connected layers. Transformers are not pyramidal, so a CNN-style FPN does not fit; the proposal builds the pyramidal structure from FC layers instead.
A four-headed "Hydra Net" multi-task head — object detection, pose estimation, heatmap, and segmentation sharing one backbone — on the reasoning that training on several related tasks builds a better underlying understanding and improves generalisation across the gap.

No numbers came out of it, but the design logic holds together: multi-source data to dilute any single domain's bias, a transformer to keep the spatial information a pose head needs, a shared backbone so related tasks reinforce each other. It is a coherent answer to "no labelled real data exists." The augmentation experiment in the AstroGAN work below was later run inspired by SPS-B's pose-estimation thinking — that is the real, traceable link between the two, not evidence that SPS-B itself was validated.

AstroGAN: closing the gap from the model side

A renderer can give you a perfectly labelled greyscale image of a spacecraft. What it cannot give you is the image a real camera would have taken of that same spacecraft. AstroGAN is the model I built to translate the first into the second — and the third, most mature artefact in this story.

The sim-to-real gap for satellite imagery is an appearance-translation problem. Synthetic renders and real captures of the same target differ in texture, contrast, blur, sensor response, background, and illumination, but they share what a pose system cares about: target geometry and pose-relevant structure. The job is to move appearance toward the real domain while leaving that structure intact. That is squarely image-to-image translation, so AstroGAN is a GAN — and not a from-scratch one. It is a modification of the UVCGANv2 codebase, drawing on the UVCGAN and AptSim2Real lines of work. Starting from a strong, maintained translation codebase meant the engineering went where it mattered: adapting the model to greyscale satellite imagery and to data that does not come in clean pairs, rather than reimplementing a GAN training loop.

That clean-pairs assumption is the first thing that had to go. Most image-translation methods are easiest to reason about with paired data — the same scene in domain A and domain B, pixel-aligned. Satellite imagery does not give you that. What you can get is approximately paired data: a synthetic render and a real-domain image (rendered here with pyrenderer, a renderer built by William) that share target structure and pose context but are not pixel-aligned. Treating those as exact pairs asks the model to learn a lie; treating them as unrelated throws away the structural correspondence that makes translation tractable. Approximate pairing is the middle path, and committing to it shaped the rest of the architecture.

Six code changes on top of UVCGANv2

What AstroGAN actually changed

AstroGAN is a modification of the UVCGANv2 codebase, not a from-scratch model. These are the six concrete changes that turned it into a satellite sim-to-real translator.

1data/data_new.py

Approximately-paired data handling

A new ApproxPairedDataset class handles real/simulated pairs that share features but lack pixel-wise alignment. select_dataset and construct_single_dataset were modified; legacy CycleGAN dataset support was removed.

2models/generator/stylevitunet.py

Style encoder for greyscale imagery

Rebuilt for single-channel satellite images: in_channels set to 1, an extra convolutional layer with 512 filters for depth, and a spatial attention mechanism integrated.

3models/generator/stylevitunet.py

AstroStyleVitUnet generator

Instance Normalization replaces Batch Normalization in the style encoder; spatial and channel attention; a UNet with skip connections for multi-scale feature integration.

4models/discriminator/stylepatchdiscriminator.py

AstroStylePatch discriminator

Deeper than the original PixelDiscriminator, with residual connections, spectral normalization on every conv layer, and attention. It judges larger patches for context rather than 1×1 pixels.

5base/new_losses.py

AstroSim2Real loss suite

Replaces the CycleGAN losses. The cycle-consistency requirement is eliminated — only the forward transformation is used, which is what makes weakly-paired data trainable.

6train/new_train.py

New training loop

Adds mixed-precision training, an adjusted discriminator update frequency for balanced dynamics, and early stopping.

Six concrete code changes define AstroGAN on top of UVCGANv2, and each answers a specific property of the data. A new ApproxPairedDataset class commits the data layer to the weak-pairing assumption. The style encoder is rebuilt for the domain — single-channel rather than 3-channel RGB, a 512-filter convolutional layer for depth, and spatial attention — because "style" here is the brightness, contrast, texture, and sensor character that separate a renderer from a camera. The generator, AstroStyleVitUnet, uses UNet skip connections so multi-scale features survive the translation, because pose cues are not all at one scale. The discriminator, AstroStylePatch, is deeper than UVCGAN's PixelDiscriminator and judges larger patches rather than single pixels: a 1×1 discriminator can only ask "does this pixel look real," while a patch discriminator asks "does this region of structure look real," which is the question that matters when the failure modes are local texture and edge artefacts on a satellite body. Spectral normalisation on every layer keeps that deeper discriminator from overpowering the generator.

The most consequential change is the loss design. Standard CycleGAN-family losses lean on cycle-consistency — translate A→B→A and penalise the round-trip error — which assumes the round trip is meaningful. With approximately paired data it is not; there is no clean A to return to. So AstroGAN drops cycle-consistency entirely and uses only the forward transformation, replaced by an eight-term objective.

The AstroSim2Real objective

Eight terms, two jobs

The loss suite is split between pushing appearance toward the real domain and protecting the geometry a pose system depends on. No single term is allowed to win.

Adversarial

MSE

Pushes outputs toward the real-domain distribution.

Style reconstruction

L1

Matches style codes of real vs generated images, from the style encoder.

Style classification

BCE-with-logits

Forces the output to read as the target domain, not a blend.

Content NCE

PatchNCE

Keeps local content consistent between input and output patches.

Content identity

L1

When style difference is zero, output must equal input.

Content luminance

Normalized L1

Constrains luminance patches — the brightness structure pose cues live in.

Perceptual

VGG features, L1

Aligns deep features so realism is not just pixel-level.

Gradient penalty

WGAN-GP

Enforces the Lipschitz constraint for stable training.

Sim-to-real FID

109 → 49

FID between real and synthetic imagery fell from 109 to 49 by 500 training epochs — the one hard, measured result. Loss-component ablations were not run.

Those eight terms split along the two ways a sim-to-real translator fails. It can preserve content but not shift the domain enough to matter, or it can shift style hard and destroy the geometry a pose system depends on. Adversarial, style-reconstruction, style-classification, and perceptual terms pull appearance toward the real domain; content-NCE, content-identity, and content-luminance terms protect structure; a gradient penalty keeps training stable. No single term is allowed to win, because either failure mode is a usable-looking image that is useless downstream.

The result is one hard, measured number. By Fréchet Inception Distance between real and synthetic imagery, the sim-to-real gap fell from an initial FID of 109 to 49 by 500 training epochs — more than halved.

What the FID number does and does not say

FID 109→49 measures distributional realism: translated images sit much closer to the real domain. It is reported without a stated dataset size, FID protocol, or held-out test definition, and it is not a pose-accuracy result. It says the translation works as a translation. It does not, on its own, say the downstream pose system improved.

The honest half: FID measures the wrong thing

Here is where the whole project meets its limit. The translated images looked right — closer to the real domain by every distributional measure I had — and the obvious next move was to feed them to the pose network the exercise was meant to serve. The pose network did not agree. The integration broke, and there was no time to chase down whether the cause was a shallow bug or something deeper in the translation.

That is the fair criticism of everything above, and it is correct. FID rewards images whose Inception features sit close to the real domain in aggregate, and nothing in that metric guarantees the per-image geometry a pose regressor reads is intact. The loss suite is designed to protect that geometry — content-NCE, content-identity, and content-luminance all pulling against style collapse — but design intent is not a measurement. AstroGAN works as an image translator by the FID measure; whether that translation reliably helps a pose network is the question the available work did not close. The proposed next step, gradually increasing resolution toward 2028×2028 with longer training, is a roadmap, not a result.

The same shape shows up on the preprocessing side: inference-time normalisation is a patch over an unsolved domain shift, and a more principled fix would adapt the model rather than hand-tune each image. That is right too — but for this client there were zero labelled real images, so the model could only ever be trained on the wrong distribution. Within that constraint, two things hold. Experiment B showed you cannot close the gap by laundering the training data; that starves the model. And the edge-count diagnostics showed the gap is not uniform, so narrowing the inference-side mismatch is the cheap, measurable lever you actually have. It is a patch, not the destination — but it is the move available before there is data for anything better.

Limitations

SPS-B was never trained or benchmarked. It has no accuracy, pose-error, or generalisation results, and the ViT variant, parameter count, and dataset sizes were never specified.
The preprocessing investigation's measured pose-error figures are the lightbox ~7° / sunlamp ~16° trade-off pair only; the per-axis comparison charts are unlabelled, and the balanced method's own figure is not stated. The comparative-analysis section was left unfinished. Preprocessing applied during training made performance worse and was not pursued.
AstroGAN was not validated end-to-end on pose estimation; pose-estimation testing had unresolved issues, possibly a bug, possibly deeper, that were not extensively investigated. Some bugs remain in the codebase; it was not production-deployed. The FID 109→49 figure is reported without a stated dataset size, FID protocol, or held-out test definition. No quantitative ablations exist for the individual loss components or attention modules; their value is argued from design, not measured. The 2028×2028 high-resolution pipeline is a proposal, not a run.

Where this leaves the work

Three artefacts, three maturity levels, one problem. The instinct that runs through all of them is the one I still bring to language-agent systems: in physical-world computer vision, the simulator, the preprocessing, the augmentation strategy, the training split, and the evaluation method are part of the model. A larger backbone does not rescue a weak data-generation story.

The preprocessing result proved that the hard way — the same transform helped in one slot and harmed in another, decided by placement alone. AstroGAN proved the other half: you can close the appearance gap and still not know whether you closed the gap that matters, because FID measures realism and a pose network measures geometry, and those two things can come apart. For this client there were no labelled images of the spacecraft in space. You cannot photograph your way out of that. But you can measure the gap you are stuck with — per-domain intensity, contrast, dynamic range, entropy, edge count, the way the SPEED+ diagnostics did — and that measurement is what tells you which domain has the most to lose, where a fix belongs, and which half of the problem the next thousand epochs should be spent on.