This section audits numerical/empirical consistency: reported metrics, experimental design, baseline comparisons, statistical evidence, leakage risks, and reproducibility.
Out of 23 candidate numeric checks, 15 passed, 2 failed, and 6 were uncertain (mostly due to missing tabulated inputs for the intended recomputations). The two failures concern (i) a unit/pixel-scale consistency check for patch size vs pixel resolution, and (ii) a rounding/precision claim about matching to four decimal places for a median ratio example. Other recomputed percentages/ratios (held-out recovery percentages, skew/kurtosis ratios, algebraic constant $-1/\sqrt{2}$, and various percent recoveries) were numerically consistent within stated tolerances.
### Checked items
- ✔ C01_train_aug_count (p.3, end of §2 DATA)
- Claim: 8-fold augmentation yielding 1600 training pairs from $N_{\rm train} = 200$ paired patches.
- Checks: multiplication
- Verdict: PASS
- Notes: Exact integer arithmetic.
- ✖ C02_patch_pixels_area_consistency (p.2, §2 DATA)
- Claim: $5^\circ\times5^\circ$ patches at $256\times256$ pixels correspond to 5 arcmin pixel scale.
- Checks: unit_consistency
- Verdict: FAIL
- Notes: $60\times5/256=1.1719$ arcmin; inconsistent with 5 arcmin unless additional downsampling/definition. Flag for check.
- ✔ C03_patch8_percentile_from_1523 (p.3, §2 DATA (Patch 8 description))
- Claim: Patch 8 sits in the top 3.8% on deepest decrement and the top 10.8% on $\sigma$ (in $N_{\rm patch} = 1523$).
- Checks: percentage_to_count
- Verdict: PASS
- Notes: $0.038\times1523\approx58$, $0.108\times1523\approx165$.
- ✔ C04_bandavg_from_bins_ST_tSZ (p.6, Table 2)
- Claim: Band-average [500,6000] for ST+jm tSZ is $1.088 \pm 0.040$, given per-band values 1.128, 1.115, 1.062 over contiguous bins.
- Checks: weighted_average_by_interval
- Verdict: PASS
- Notes: Width-weighted average across [500,1500],[1500,3000],[3000,6000] gives 1.08845, consistent with 1.088 (noting averaging-definition caveat).
- ⚠ C05_bandavg_from_bins_ST_CIB (p.6, Table 2)
- Claim: Band-average [500,6000] for ST+jm CIB is $1.010 \pm 0.014$, given per-band values 0.996, 1.022, 1.009.
- Checks: weighted_average_by_interval
- Verdict: UNCERTAIN
- Notes: Missing bin ranges/ratios or reported total for weighted average.
- ⚠ C06_bandavg_from_bins_DDPM_tSZ (p.6, Table 2)
- Claim: Band-average [500,6000] for DDPM+jm tSZ is $1.065 \pm 0.123$, given per-band values 1.141, 1.079, 1.034.
- Checks: weighted_average_by_interval
- Verdict: UNCERTAIN
- Notes: Missing bin ranges/ratios or reported total for weighted average.
- ⚠ C07_bandavg_from_bins_DDPM_CIB (p.6, Table 2)
- Claim: Band-average [500,6000] for DDPM+jm CIB is $1.013 \pm 0.011$, given per-band values 1.027, 1.013, 1.008.
- Checks: weighted_average_by_interval
- Verdict: UNCERTAIN
- Notes: Missing bin ranges/ratios or reported total for weighted average.
- ✔ C08_skew_kurt_ratio_claims (p.8, §4.3 PDF Statistics)
- Claim: For tSZ, skewness ratio is 0.98 (ST) and 0.98 (DDPM); excess-kurtosis ratio is 0.96 (ST) and 0.94 (DDPM).
- Checks: ratio_recompute
- Verdict: PASS
- Notes: Recomputed ratios: skew ST/truth=0.9803, DDPM/truth=0.9777; kurt ST/truth=0.9599, DDPM/truth=0.9393.
- ✔ C09_min_core_recovery_percent (p.8, §4.3 PDF Statistics)
- Claim: Deepest cluster core is recovered to within 2–10%: truth min $-350$ $\mu$KCMB, ST $-342$, DDPM $-318$.
- Checks: relative_error
- Verdict: PASS
- Notes: Relative errors: ST 0.02286 (2.29%), DDPM 0.09143 (9.14%), within 2–10%.
- ✔ C10_crossr_recovery_heldout_ST (p.7, held-out evaluation paragraph + p.11/p.12 held-out bullets + Table 7)
- Claim: Held-out pixel $r_{\mathrm{tSZ}\times\mathrm{CIB}}$: gen $-0.159$ versus test-truth $-0.172$ corresponds to 92% recovery.
- Checks: percentage_recompute
- Verdict: PASS
- Notes: Recomputed recovery is 92.44%, consistent with 92% after rounding.
- ✔ C11_crossr_recovery_heldout_DDPM (p.12, DDPM held-out protocol bullets + Table 7)
- Claim: DDPM held-out: $r_{\mathrm{gen}} = -0.158$ vs test-truth $-0.173$ corresponds to 91.7% recovery.
- Checks: percentage_recompute
- Verdict: PASS
- Notes: Recomputed recovery is 91.33%, consistent within 0.5 percentage points.
- ✔ C12_core_depth_recovery_ST (p.11, held-out bullets + Table 7)
- Claim: Held-out ST deepest core: gen min $-220$ $\mu$KCMB vs test min $-337$ corresponds to $\sim$65% recovery.
- Checks: percentage_recompute
- Verdict: PASS
- Notes: Recomputed recovery is 65.28%, consistent with “$\sim$65%”.
- ✔ C13_core_depth_recovery_DDPM (p.12, DDPM held-out bullets + Table 7)
- Claim: DDPM held-out deepest core: gen min $-330$ $\mu$KCMB vs test $-345$ corresponds to 96% recovery.
- Checks: percentage_recompute
- Verdict: PASS
- Notes: Recomputed recovery is 95.65%, consistent with 96% after rounding.
- ✔ C14_algebraic_floor_minus_1_over_sqrt2 (p.6-7, §4.2.1 (residual diagnostic) + Fig. 6 text)
- Claim: Algebraic limit $r = -1/\sqrt{2} \approx -0.707$ and observed $-0.706$ are consistent.
- Checks: constant_recompute
- Verdict: PASS
- Notes: $-1/\sqrt{2} = -0.70710678$; observed $-0.706$ differs by $0.00111$, within abs tolerance.
- ✔ C15_sampling_coeff_error_factor (p.4, §3.6.1 Critical sampling coefficient)
- Claim: Coefficient error factor is $1/\sqrt{\beta_t} \approx 10$ for typical $\beta_t\sim 0.01$.
- Checks: order_of_magnitude_recompute
- Verdict: PASS
- Notes: $1/\sqrt{0.01}=10$ exactly.
- ✔ C16_corrunet_param_count_check (p.4, §3.6.2 Architecture)
- Claim: Model has 554,562 parameters.
- Checks: self_consistency_integer_format
- Verdict: PASS
- Notes: Comma-formatted integer parses consistently to 554562.
- ✖ C17_multi_freq_median_ratio_delta (p.13, §5.4 (text))
- Claim: med(CIB217/CIB150) = 2.9354 (truth) versus 2.9370 (gen+JM), and claim of matching to four decimal places.
- Checks: rounding_consistency
- Verdict: FAIL
- Notes: Rounded to 4 d.p., values are 2.9354 vs 2.9370 (not equal).
- ✔ C18_fig12_residual_dp_claim (p.13-15, §5.4 text + Fig. 12(b) caption)
- Claim: Body: cross-correlations match to four decimal places; Fig. 12(b): residual $|\Delta r|$ is $10^{-10}$ to $10^{-11}$, orders of magnitude below $10^{-4}$.
- Checks: order_of_magnitude_comparison
- Verdict: PASS
- Notes: $10^{-10}$ and $10^{-11}$ are $10^{-6}$ and $10^{-7}$ of $10^{-4}$, respectively.
- ✔ C19_crossr_ladder_percentages (p.16-18, §6 + Fig. 15)
- Claim: Cross-$r$ recovery ladder: truth $r = -0.163$; multi-channel gives $r = -0.086$ (53%); +soft gives $r = -0.093$ (57%).
- Checks: percentage_recompute
- Verdict: PASS
- Notes: Recomputed: 52.76% ($\approx53\%$), 57.06% ($\approx57\%$).
- ✔ C20_iteration_sweep_percentages (p.16-17, §6.1 + Fig. 14 / Fig. 15 text)
- Claim: Iteration sweep recovery: 32%/38%/47%/53%/56% at 200/400/800/1600/3200 steps.
- Checks: monotonicity_and_bounds
- Verdict: PASS
- Notes: Series is within [0,100] and non-decreasing.
- ✔ C21_break_even_samples_calc (p.19, §6.1.0.6)
- Claim: Break-even $N_{\rm samples} \approx 1$ h $/\,80$ s $\approx50$ samples.
- Checks: division_recompute
- Verdict: PASS
- Notes: $3600/80=45$; within loose tolerance given approximate inputs.
- ⚠ C22_bp_error_reduction_factors (p.22, §7.2 (quantitative improvements))
- Claim: Mean $|\mathrm{relative~error}|$ drops from 28.6%$\rightarrow$4.8% for $S_3(\ell)$ and 64.9%$\rightarrow$9.2% for $K_4(\ell)$.
- Checks: ratio_recompute
- Verdict: UNCERTAIN
- Notes: Unsupported ratio_recompute variant.
- ⚠ C23_peak_count_deficits_percent (p.25, Table 17)
- Claim: Peak count deficits: e.g., for $\nu=2$ truth=710 and ST=617 corresponds to $-13.1\%$, etc. Verify all percent deficits shown in parentheses.
- Checks: percentage_recompute
- Verdict: UNCERTAIN
- Notes: Missing inputs for core depth recovery recompute.
### Limitations
- Audit is restricted to the provided PDF text; no underlying map/spectrum data are available, so only arithmetic and simple logical/unit consistency checks are feasible.
- Figure-based numeric values not explicitly printed in the text/tables are not used (no plot digitization).
- Some 'band-averaged' quantities may use a definition (e.g., averaging over modes, log-bins, or patch-wise normalization) that differs from simple $\ell$-interval width weighting; related candidates are flagged as plausibility checks rather than definitive inconsistencies.
## Paper Ratings
| Dimension | Score |
|-----------|:-----:|
| Overall | 6/10 ██████░░░░ |
| Soundness | 6/10 ██████░░░░ |
| Novelty | 6/10 ██████░░░░ |
| Significance | 6/10 ██████░░░░ |
| Clarity | 5/10 █████░░░░░ |
| Evidence Quality | 6/10 ██████░░░░ |
Justification: The paper’s core linear-algebraic calibration (Eq. 12) is sound and well-supported by extensive experiments, and the generator-agnostic falsification plus the non-by-construction ladder provide useful, original insights. However, the audits flag important presentation and methodological caveats: overstatements about “phase-preserving” and simultaneous 1-pt/2-pt matching, an inconsistent Pearson definition, and missing implementation specifics for the spectral whitening/recolouring that limit reproducibility and rigor. Empirically, results are strong across a broad diagnostic suite, but several claims rely on small-N studies, paired-CDF conditioning risks, and uneven uncertainty reporting. Overall this is a solid, above-average contribution that would benefit from tightening the math claims, clarifying what is enforced vs learned, and standardizing uncertainty/implementation details.