-
**Simulation/observation model is not specified at the level needed to assess realism, generalization, or reproducibility (Sec. 2.1–2.2).** Key ambiguities include: the instrument model for each frequency (e.g., 545/857 GHz are Planck/HFI-like rather than Simons Observatory), beams/bandpasses/noise levels per band, pixel size and patch area, whether noise is white/correlated and homogeneous/inhomogeneous, and which sky components are included beyond tSZ+CIB+instrumental noise (e.g., primary CMB, kSZ, radio sources, Galactic dust). As written, it is hard to judge whether the task reflects real component-separation difficulty or is simplified in ways that could overstate performance. *Recommendation:* Expand Sec. 2.1–2.2 with a concrete data-model table: for each channel list central frequency, bandpass assumption, beam (FWHM or full beam transfer), pixelization, and noise level/model. Explicitly state whether 545/857 GHz are assumed as external Planck-like priors in an SO-era analysis, and how they are beam-matched and calibrated. Enumerate all simulated components (tSZ, CIB, CMB, kSZ, radio/IR point sources, Galactic dust/synchrotron, etc.), and if some are omitted, justify why and discuss implications. If feasible, add at least one ablation including the primary CMB (especially relevant at 90/150 GHz) to demonstrate robustness to a dominant real-sky contaminant.
-
**SR-DAE architecture and training procedure are under-specified, preventing reproduction and making it difficult to attribute gains to the proposed design choices (Sec. 2.1–2.3).** The text mentions a U-Net with gated cross-attention, but does not provide layer-wise details (depth, channels, kernels/strides), where attention is inserted, the precise gated cross-attention formulation, normalization/activation choices, patch size, and the full training recipe (optimizer, LR schedule, batch size, augmentations, regularization, early stopping). *Recommendation:* In Sec. 2.3 (or an Appendix), add a reproducibility-focused specification: (i) input/output tensor shapes (patch size, pixel resolution), preprocessing/standardization per channel, and any beam-matching; (ii) a layer-by-layer architecture table for the SR-DAE including where and how gated cross-attention is applied (equations or pseudocode); (iii) the full training recipe (optimizer, LR schedule, batch size, weight decay, augmentations, gradient clipping, early stopping/selection criterion). If space is limited, provide a public configuration file/model card and cite it.
-
**The composite loss is central but not mathematically defined in a reproducible way (Sec. 2.3; Eq. (1)).** In particular, $L_\mathrm{spec}$ and $L_\mathrm{corr}$ lack explicit formulas, normalization, binning/$\ell$-range, masking/apodization/beam treatment, and how multi-channel CIB information enters. The weights $\lambda_1$ and $\lambda_2$ are not clearly stated/tuned. This blocks verification of what the model is actually optimizing and whether reported spectral fidelity is a direct consequence of the loss design. *Recommendation:* Extend Sec. 2.3 to explicitly define $L_\mathrm{spec}$ and $L_\mathrm{corr}$: specify the power-spectrum estimator (flat-sky Fourier vs spherical harmonics; 2D vs binned 1D $C_\ell$), binning and $\ell$-range, any $\ell$-dependent weighting (linear vs log), window/apodization/mask handling, and whether beam deconvolution is applied. For $L_\mathrm{corr}$, define the residual $r$, which CIB map(s) are used, and whether correlation is computed in pixel space (e.g., Pearson $r$), Fourier space, or in radial bins around clusters; state whether maps are mean-subtracted and how channels are aggregated. Report $\lambda_1$, $\lambda_2$ and how selected (validation sweep/heuristic), and add a brief sensitivity/ablation showing the effect of removing each term on (i) residual–CIB correlation and (ii) small-scale power recovery.
-
**The Conditional Diffusion Model is described only at a high level, limiting assessment of the claimed posterior sampling and uncertainty quantification (Sec. 2.4–2.5).** Critical missing details include: forward noising equation and time horizon $T$, exact noise/variance schedule, whether the network predicts $\epsilon/x_0/v$, the explicit conditional training loss, how conditioning on the observed multi-frequency maps and diffusion time is injected (concatenation/FiLM/cross-attention), what is frozen vs fine-tuned from SR-DAE, and sampling settings beyond “50-step DDIM” and “10 samples.” *Recommendation:* Augment Sec. 2.4 with a complete CDM specification: write down the forward process $q(y_t\mid y_0)$, the reverse parameterization, the schedule ($\beta_t/\alpha_t$, $T$), the prediction target ($\epsilon/x_0/v$) and full conditional loss with consistent notation (e.g., $y=$tSZ, $x=$observed). Describe the conditioning pathway and time embeddings and whether SR-DAE weights are reused/frozen. For sampling, state DDIM parameters, number of steps, and initialization. Add an Appendix plot showing convergence of (i) PIT calibration and (ii) at least one reconstruction metric vs number of diffusion steps and vs number of samples.
-
**CDM quantitative behavior raises a major interpretability concern: the reported MSE of the CDM ensemble mean is dramatically worse than the deterministic SR-DAE (e.g., 9.5667 vs 1.339 in scaled units; Sec. 3.3), which is not straightforward to reconcile with the interpretation of the CDM mean as a posterior mean estimate.** This could indicate weak conditioning, a normalization/evaluation mismatch, too-few sampling steps, or that samples add high-frequency structure that is plausible but uncorrelated with the specific truth realization—undermining claims about point-estimate fidelity. *Recommendation:* In Sec. 3.3 (and/or Appendix), provide a targeted diagnosis: (i) report MAE/RMSE/MSE for individual CDM samples, the CDM sample mean, and the CDM sample median, alongside SR-DAE, all computed under a clearly defined normalization; (ii) add $\ell$-space coherence or correlation diagnostics (e.g., coherence $(C_\ell^\mathrm{cross})^2/(C_\ell^\mathrm{rec}C_\ell^\mathrm{true})$ and also $C_\ell^\mathrm{rec}/C_\ell^\mathrm{true}$, not only the transfer $T(\ell)=C_\ell^\mathrm{cross}/C_\ell^\mathrm{true}$) to show whether CDM mean preserves true structure or injects excess power; (iii) check whether CDM mean approaches SR-DAE output under any limiting setting (e.g., fewer steps, different conditioning strength), which would help interpret what diffusion is adding; and (iv) clarify how the diffusion model is intended to be used scientifically if its point estimate is worse—e.g., is SR-DAE the recommended point estimator and CDM used only for uncertainty/ensembles?
-
**Baseline methods (cILC and WF) are under-described and may not reflect best-practice competitive implementations, complicating the strength/fairness of the comparison (Sec. 2.2; Sec. 3.1).** For cILC, it is unclear whether weights are global or $\ell$/needlet-dependent, how SEDs are obtained, and how beams/noise are handled. For WF, it is unclear whether filtering is per-band or joint multi-frequency, how signal/noise spectra are estimated (from truth, from observed maps, from simulations), and whether cross-spectra are used to mitigate noise bias. *Recommendation:* In Sec. 2.2, specify the exact cILC and WF pipelines (domain: map/harmonic/needlet; $\ell$-binning; beam handling; covariance estimation; constraints; any regularization). Clarify whether the baselines are intentionally “simple” references, and if so, state this explicitly and discuss how stronger variants (e.g., needlet ILC/NILC/GNILC) could change results. Ideally, add one more competitive linear baseline (e.g., needlet/constrained NILC) or demonstrate that your cILC/WF hyperparameters are tuned on validation data to near-optimal performance under the same simulation assumptions.
-
**Quantitative reporting is currently insufficiently systematic, and several internal inconsistencies reduce confidence (Sec. 2.5; Sec. 3.1–3.3; Fig. 3).** Examples: MSE values are reported in “scaled units” without a precise definition/mapping to Compton-y; SR-DAE/DAE test-set MSE is reported inconsistently (0.9458 vs 1.339); the null-test is described as zero-centered but reports mean $\approx 0.3466$ with $\sigma\approx 0.1054$; and residual/transfer/gain plots are mostly interpreted qualitatively without representative numeric values and uncertainties. *Recommendation:* Add a compact quantitative summary in Sec. 3.1–3.3: (i) a table of pixel-space metrics (MAE/RMSE/MSE, Pearson $r$) for cILC, WF, SR-DAE, and CDM mean on the main test, OOD, and high-noise splits, with error bars across patches; (ii) numeric summaries of transfer/residual metrics at representative multipoles (e.g., $\ell\approx 1000/3000/5000$) with scatter; (iii) explicitly define the scaling/normalization used for “scaled units” and provide a conversion back to physical $y$ units (or an interpretable normalization such as error normalized by $\mathrm{std}(y_\mathrm{true})$); and (iv) fix and reconcile the inconsistent MSE and null-test statements by clearly stating dataset, model variant, and normalization used in each figure/table.
-
**Dataset splitting, OOD definition, and leakage controls are not described with enough precision, and there is an apparent split arithmetic inconsistency (Sec. 2.1; Sec. 3.2).** The stated train/val/test counts (1066/228/229) sum to 1523, leaving no patches for a separate “top 5%” OOD subset ($\sim 76$ patches). Additionally, if patches overlap spatially or are drawn from the same underlying realization/lightcone region, train/test leakage could inflate performance. The OOD definition (top 5% by peak tSZ) may also not cover other realistic domain shifts (CIB SED variation, different foreground mix, calibration/beam errors). *Recommendation:* In Sec. 2.1 and Sec. 3.2, fix the dataset accounting by explicitly listing: total patch count, OOD count, and the remaining train/val/test counts (with integers that add up). State whether patches overlap and how you prevent spatial leakage (e.g., split by independent sky areas/lightcone segments or by halo IDs). Provide summary statistics (mass, redshift, peak $y$) for each split. If feasible, add a second OOD axis (e.g., altered CIB SED/noise/beam perturbation) or at least discuss that peak-$y$ OOD does not capture full real-data domain shift.
-
**Scientific validation via scaling relation is currently ambiguous in its physical interpretation (Sec. 2.5; Sec. 3.3.2).** The text refers to a $Y_\mathrm{SZ}$–“mass proxy” relation but also uses “peak tSZ signal” as a proxy, which is not a standard mass proxy in real analyses and is sensitive to beam/noise. The definition of $Y_\mathrm{SZ}$ (aperture, centering, background subtraction, beam correction) is not sufficiently specified, making it hard to interpret “tighter/less biased” claims. *Recommendation:* Clarify Sec. 2.5 and Sec. 3.3.2 by (i) explicitly defining the x-axis quantity (true halo mass $M_{500}$ from FLAMINGO vs peak $y$ vs another proxy) and renaming accordingly (e.g., $Y_\mathrm{SZ}$–$y_\mathrm{peak}$ if that is what is used); (ii) defining how $Y_\mathrm{SZ}$ is computed (aperture radius, centering, integration method, background subtraction, beam handling); and (iii) reporting fitted slope/normalization/scatter/bias with uncertainties (bootstrap) for truth, cILC, SR-DAE, and CDM.
-
**Limitations and failure modes of a simulation-trained, generative reconstruction approach are not discussed explicitly enough, despite being central for real-data applicability (Sec. 3.2–3.3; Sec. 4).** Key concerns include domain shift (single simulation suite/feedback model), learned tSZ–CIB correlations that might not match reality, sensitivity to unmodeled foregrounds and calibration/beam systematics, and the possibility of hallucinated small-scale structure in low-SNR regions (especially relevant for diffusion sampling). *Recommendation:* Strengthen Sec. 4 (or end of Sec. 3.3) with a focused limitations section: explicitly discuss domain shift risks (FLAMINGO baryonic physics, CIB modeling), potential biases in pressure profiles/scaling relations, sensitivity to missing components and instrument systematics, and hallucination risks. Outline concrete mitigations (training across multiple simulations/feedback models; foreground/model perturbation during training; domain adaptation; cross-validation with external observables such as X-ray or weak-lensing; conservative masking/SNR-based usage rules for diffusion samples).