This section audits numerical/empirical consistency: reported metrics, experimental design, baseline comparisons, statistical evidence, leakage risks, and reproducibility.
17 numeric consistency checks were executed: 16 PASS and 1 FAIL. The only detected inconsistency is a mismatch in the Gaussian blur $\sigma$ value between the methods description and a figure caption; other arithmetic/logical checks (dataset counts, equation specialization, table range/mean/std sanity, percent-to-fraction conversion, and an approximate six-fold ratio claim) are internally consistent.
### Checked items
- ✔ C1_dataset_total_maps (Page 3, Section 2.1 “Dataset and out-of-distribution proxy”)
- Claim: “The full dataset contains $20,507$ maps for training and $10,203$ maps for evaluation.”
- Checks: parts_to_total (compute total size from parts)
- Verdict: PASS
- Notes: Checked training_maps + evaluation_maps equals $30710$.
- ✔ C2_view1_gradient_feature_count (Page 3, Section 2.2.1 “View 1: Directional gradient and spectral features”)
- Claim: Directional gradient statistics: $6$ orientations, repeated at $4$ smoothing scales; retain mean and standard deviation for each scale and orientation.
- Checks: dimensionality_from_counts
- Verdict: PASS
- Notes: Computed implied gradient feature count.
- ✔ C3_view1_total_feature_count_minimum (Pages 3-4, Section 2.2.1 (View 1 includes gradient stats + $128$-bin radial power spectrum))
- Claim: View 1 comprises directional gradient statistics (implied $48$ features) plus a radial power spectrum averaged into $128$ bins.
- Checks: dimensionality_from_counts
- Verdict: PASS
- Notes: Minimum implied View 1 dimensionality from stated components.
- ✔ C4_view2_total_feature_count_minimum (Page 4, Section 2.2.2 “View 2: Compact power-spectrum and bispectrum vector”)
- Claim: View 2 includes a $128$-bin radial power spectrum, three spectral magnitude moments, a phase-coupling cosine mean, and pixel-level skewness and kurtosis.
- Checks: dimensionality_from_counts
- Verdict: PASS
- Notes: Minimum implied View 2 dimensionality from stated components.
- ✔ C5_equation1_vs_equation2_K_value (Page 2 Eq. (1) and Page 4 Eq. (2))
- Claim: Eq. (1) defines score $s = (1/K) \sum_{k=1..K} \frac{\text{NLL}_k - \mu_k}{\sigma_k^2}$. Eq. (2) defines $s = (1/2) \sum_{k=1..2} \frac{\text{NLL}_k - \mu_k}{\sigma_k^2}$.
- Checks: symbolic_specialization_consistency
- Verdict: PASS
- Notes: Checked Eq.(1) specialized to $K=2$ matches Eq.(2) prefactor and summation limit.
- ✖ C6_blur_sigma_inconsistency_method_vs_figure (Page 3 Section 2.1 vs Page 5 Figure 1 caption)
- Claim: Section 2.1: Gaussian blur with $\sigma = 2.0$ pixels. Figure 1 caption: Gaussian blur ($\sigma = 1.5$ pixels).
- Checks: repeated_constant_mismatch
- Verdict: FAIL
- Notes: Expected exact match; mismatch indicates inconsistency unless explained.
- ✔ C7_image_resolution_consistency (Page 3 Section 2.2 and Page 5 Figure 1 caption)
- Claim: Maps are reconstructed to $176\times176$; Figure 1 caption also references $176\times176$ pixel map.
- Checks: repeated_constant_match
- Verdict: PASS
- Notes: Checked repeated resolution constant matches.
- ✔ C8_table1_range_ordering (Page 7, Table 1 “Score distribution statistics on the evaluation set.”)
- Claim: Score range is $[-1.432, 9.368]$.
- Checks: range_validity
- Verdict: PASS
- Notes: Checked range_min $<$ range_max.
- ✔ C9_table1_mean_within_range (Page 7, Table 1)
- Claim: Mean score (all maps) is $2.641$ and score range is $[-1.432, 9.368]$.
- Checks: mean_within_range
- Verdict: PASS
- Notes: Checked mean within stated range.
- ✔ C10_table1_std_nonnegative (Page 7, Table 1)
- Claim: Score standard deviation is $2.618$.
- Checks: nonnegativity
- Verdict: PASS
- Notes: Checked nonnegativity.
- ✔ C11_table1_range_width_vs_std_sanity (Page 7, Table 1)
- Claim: Score std is $2.618$ and score range is $[-1.432, 9.368]$.
- Checks: cheap_sanity_check_range_vs_std
- Verdict: PASS
- Notes: Computed width and width/std for sanity; no strict violation.
- ✔ C12_training_hyperparams_lr_scientific_notation (Page 4, Section 2.3)
- Claim: Learning rate is $5\times10^{-4}$.
- Checks: scientific_notation_parse
- Verdict: PASS
- Notes: Checked raw parsing and provided numeric value equal $5\times 10^{-4}$.
- ✔ C13_epochs_integer (Page 4, Section 2.3)
- Claim: “Each flow is trained for $6$ epochs.”
- Checks: integer_validity
- Verdict: PASS
- Notes: Checked value is a positive integer.
- ✔ C14_coupling_layers_integer (Page 4, Section 2.3)
- Claim: “The architecture of each flow consists of $8$ affine coupling layers.”
- Checks: integer_validity
- Verdict: PASS
- Notes: Checked value is a positive integer.
- ✔ C15_calibration_subset_vs_evaluation_size (Page 4 Section 2.4 and Page 3 Section 2.1)
- Claim: Calibration subset is $200$ maps drawn from evaluation set of $10,203$ maps.
- Checks: subset_size_feasibility
- Verdict: PASS
- Notes: Checked calibration subset size does not exceed evaluation size; computed fraction.
- ✔ C16_low_fpr_range_conversion (Page 5 Section 2.5; Page 6-7 Section 3.2; Page 8 Figure 4 caption)
- Claim: Low-FPR metric range is from $0.1\%$ to $5\%$ FPR.
- Checks: percent_to_fraction_conversion
- Verdict: PASS
- Notes: Converted percents to fractions and checked ordering.
- ✔ C17_six_fold_improvement_claim (Page 6, Section 3.2 “Overall detection performance”)
- Claim: Mean TPR is $0.8919$; baseline approximately $0.15$; claim “roughly six-fold improvement”.
- Checks: ratio_check
- Verdict: PASS
- Notes: Checked ratio against $6\times$ with provided relative tolerance (baseline is approximate).
### Limitations
- Checks are restricted to arithmetic/logical consistency using only explicit numeric values in the provided PDF text; no access to underlying data, code, or supplementary materials.
- Figure-based quantitative claims (e.g., ROC values at specific FPRs, power-spectrum ratios) cannot be verified without extracting plotted data; plot-pixel/value extraction is excluded by scope.
- Several feature-vector dimensionalities are only partially specified; dimensionality checks provided are minimum implied counts based on stated components and may not equal the implementation if additional features were included but not described.
## Paper Ratings
| Dimension | Score |
|-----------|:-----:|
| Overall | 5/10 █████░░░░░ |
| Soundness | 5/10 █████░░░░░ |
| Novelty | 6/10 ██████░░░░ |
| Significance | 5/10 █████░░░░░ |
| Clarity | 5/10 █████░░░░░ |
| Evidence Quality | 4/10 ████░░░░░░ |
Justification: The paper proposes a physically motivated, dual-view ensemble of conditional flows and shows strong low-FPR performance on a blur-based OoD proxy, which is a useful and moderately novel contribution. However, the Mathematical Consistency Audit flags a critical ambiguity in the ‘likelihood-ratio’ framing and the σ vs σ^2 normalization, and records a concrete inconsistency in the blur parameter; the Numerical Results Audit also confirms this mismatch. Empirically, evidence is narrow (single dataset/single OoD type), key baselines and ablations are missing, calibration may leak from the evaluation set, and uncertainty quantification is absent, all of which weaken the strength of the claims. Presentation is generally understandable but under-specified in several implementation and feature-extraction details and contains placeholder references, limiting reproducibility.