Quantifying the Domain Gap in Multivariate Diffusion Downscaling: Real-to-Real versus Synthetic-to-Real Training Across Two Downscaling Scales
Hossein Yousefi Sohiᵃ, Andrew Bennettᵃ, Guo-Yue Niuᵃᵇ, Ali Behrangiᵃᶜᵈᵉ,
ᵃ Department of Hydrology and Atmospheric Sciences, The University of Arizona, Tucson, AZ, USA
ᵇ Biosphere 2, The University of Arizona, Oracle, AZ, USA
ᶜ Department of Civil Engineering–Engineering Mechanics, The University of Arizona, Tucson, AZ, USA
ᵈ Department of Geosciences, The University of Arizona, Tucson, AZ, USA
ᵉ Remote Sensing and Spatial Analysis (RSSA) Graduate Interdisciplinary Program (GIDP), The University of Arizona, Tucson, AZ, USA
Abstract:
Kilometer-scale atmospheric forcings are critical for hydrologic and impact applications, yet dynamical downscaling remains computationally prohibitive. Diffusion-based generative downscaling provides a promising alternative; however, many studies train and validate using synthetic low-resolution inputs created by coarsening the high-resolution target. These perfect-prognosis settings can obscure the domain gap between real coarse predictors (reanalysis or regional model outputs) and high-resolution targets due to systematic biases and cross-variable inconsistencies. Here we present a controlled experiment for joint multivariate diffusion downscaling over Arizona, in which eight coupled near-surface variables are downscaled together to preserve cross-variable physical consistency. To quantify the domain gap, we consider two downscaling scales and, for each scale, compare Real-to-Real and Synthetic-to-Real training. The first scale uses WRF alongside a matched AORC-based synthetic pair, while the second scale uses ERA5 and CONUS404 alongside an analogous AORC synthetic pair. We further benchmark CorrDiff (U-Net mean + residual diffusion) against linear regression and random forest baselines, and evaluate deployment-relevant generalization via a transfer test that applies synthetic-trained models to real ERA5/WRF predictors. Performance is evaluated using a complementary set of distributional, event-based (threshold), and spatial-structure diagnostics to capture both overall accuracy and the realism of fine-scale variability. Overall, this design enables a controlled, matched quantification of performance degradation when models trained on synthetic coarse inputs are applied to real coarse predictors, and it clarifies which training configurations most effectively preserve physically consistent fine-scale variability in the presence of realistic biases. These insights provide practical guidance for developing robust, analysis-ready downscaled forcings suitable for operational workflows and climate-impact applications.