-
**Posterior-sample provenance and parameter-estimation configuration are insufficiently documented, making it unclear whether differences are purely waveform-driven or partly due to non-waveform analysis differences (Sec. 2.1.1; also impacts Sec. 3.1–3.4).** The paper does not clearly state the PE framework(s) used (Bilby/LALInference/RIFT/…), whether priors and likelihood settings are identical across waveform models (mass/spin/distance priors; reference frequency; parameterization), the detector network and data segment, PSD estimation, calibration-uncertainty treatment, low-frequency cutoff, marginalizations, sampler settings/convergence diagnostics, and whether all runs used matched settings. *Recommendation:* Expand Sec. 2.1.1 with a reproducibility-grade description of the PE runs and posterior sources. Include a compact configuration table listing, for each waveform model: PE code and version, likelihood and priors (explicitly confirming they are identical across models or enumerating differences), data segment duration, detector network, $f_{\rm low}$, PSD method, calibration treatment, marginalizations, sampler type and key settings, and convergence/quality diagnostics (e.g., ESS; multiple chains where applicable). If posteriors come from public LVK/GWOSC releases, provide DOIs/URLs and document any post-processing. Briefly discuss (Sec. 1 or Sec. 4) how residual non-waveform differences could bias the interpreted “waveform systematics.”
-
**Waveform-model characterization is corrupted/non-informative (Table 1; Sec. 2.1.2, 2.4, 3.3), which undermines the central grouping/attribution analysis.** The current “Key Characteristics” entries contain filler text and omit essential details (domain, higher modes, precession implementation, calibration/training range, validity limits). As written, readers cannot assess whether the groupings are correct or whether GW231123 lies near/over model validity boundaries. *Recommendation:* Replace Table 1 with accurate, citation-backed model descriptions. For each of NRSur7dq4, SEOBNRv5PHM, IMRPhenomTPHM, IMRPhenomXO4a, IMRPhenomXPHM, explicitly list: (i) time vs frequency domain; (ii) precession treatment (e.g., twisting-up vs fully precessing surrogate) and frame conventions; (iii) higher-mode content; (iv) calibration/training data and parameter-space coverage ($q$, spin magnitudes, etc.); (v) known limitations relevant to GW231123 (extrapolation risk). Then ensure Sec. 2.4 (group definitions) and Sec. 3.3 (attribution statements) directly reference these documented properties.
-
**Key numerical summary tables are truncated/incomplete or inconsistent with the narrative, preventing verification of the main quantitative claims (Table 2, Table 5; Sec. 3.1, Sec. 3.4).** Examples include the truncated SEOBNRv5PHM column label/entries (“SE…”) in Table 2 and truncated headers in Table 5; this makes it impossible to confirm reported spreads (e.g., in $m_{2}$, $\chi_{\rm eff}$, $z$) and the construction of systematic-inclusive intervals. *Recommendation:* Audit and correct all result tables (at minimum Table 2 and Table 5) directly from the underlying posterior samples. Ensure: complete medians and 90$\%$ credible intervals for all five models, untruncated headers, consistent parameter naming/units, and consistency between tables and text. Recompute Table 5 from the stated rule (Sec. 2.5.2: envelope/union of per-model 90$\%$ CIs) and explicitly cross-check that all quoted numerical ranges in Sec. 3.1 and Sec. 3.4 match the final tables.
-
**Discrepancy metrics (JS divergence; 1D/2D overlaps) are central evidence but the implementation is under-specified and lacks robustness/sensitivity validation (Sec. 2.2; Table 4; figures summarizing JS/overlap).** KDE-based JS can be sensitive to kernel choice, bandwidth, bounded-parameter leakage ($\chi_{p} \in [0,1]$, $\cos $ tilts $\in [-1,1]$), grid choice, and numerical integration. The JS bound claim ([$0,1$]) depends on log base and normalization, which is not fully specified. *Recommendation:* In Sec. 2.2.1–2.2.2, provide full computational details: KDE kernel, bandwidth method (Scott’s rule is not sufficient alone—state kernel and any multivariate bandwidth strategy), boundary handling for bounded variables, grid/domain definitions (global vs per-model support), discretization and numerical integration scheme, and the JS definition including log base. Add sanity checks: overlap(model, model)$\approx 1$; stability under grid refinement; and at least a small sensitivity study for key parameters (e.g., $m_2$, $\chi_{\rm eff}$, $z$, $\chi_{p}$) varying bandwidth and/or using an alternative divergence estimator (e.g., histogram with principled binning or kNN-based divergence). Report uncertainties (e.g., bootstrap) for representative JS/overlap values in Table 4 and/or an appendix.
-
**The grouping-based “attribution” claims are currently stronger than justified by the analysis design (Sec. 2.4; Sec. 3.3).** Pooling posteriors within groups (domain/calibration/precession) does not isolate a single modeling attribute because group labels are confounded with waveform family and implementation details; group sizes are small; memberships overlap; and intra-group variability is not systematically compared to inter-group separation. As a result, statements such as “domain is the primary driver” read as causal rather than associative. *Recommendation:* Revise Sec. 2.4 and Sec. 3.3 to (i) explicitly state assumptions behind pooling and acknowledge confounding/overlap, (ii) soften causal language to “correlated with” unless controlled contrasts are demonstrated, and (iii) add quantitative tests comparing intra-group vs inter-group differences (e.g., variance-ratio/ANOVA-style summaries on key parameters or on PC/IC projections). Where possible, add within-family comparisons that reduce confounding (e.g., IMRPhenomXPHM vs IMRPhenomXO4a; IMRPhenomTPHM vs IMRPhenomX*; or EOB vs surrogate within similar modeling choices). Include bootstrap intervals for group-level metrics and perform sensitivity checks excluding potential outliers to see if trends persist.
-
**PCA/ICA analysis mixes samples across models in a way that can conflate inter-model shifts with physical degeneracy structure, and weighting/balancing across models is not clearly defined (Sec. 2.3; Sec. 3.2).** If one model contributes more samples or broader posteriors, it can dominate the standardization and component directions. ICA results are not fully reproducible because the scaling/whitening conventions and loading matrices are not provided. *Recommendation:* Clarify and, if needed, revise the PCA/ICA procedure in Sec. 2.3: specify sample counts per model, whether you downsample/weight so each model contributes equally, the exact standardization convention (computed on pooled samples vs per-model then combined), and PCA$\rightarrow$ICA whitening details, random seeds, convergence criteria, and software versions. Add: (i) explained-variance ratios for the first several PCs; (ii) the full loading matrix (or an appendix) for PCs used in figures; (iii) an ICA loading table (clearly defining whether values are mixing/unmixing loadings or correlations). Consider adding a stability check: repeat PCA/ICA with equal-per-model downsampling and report whether the qualitative conclusions in Sec. 3.2 and Sec. 3.3 are unchanged.
-
**Mass–redshift/distance degeneracy and cosmology assumptions are not discussed with enough care given that redshift is a headline systematic (Sec. 3.1; Sec. 3.4).** It is unclear what cosmology is used for converting luminosity distance to redshift, and whether the dominant disagreement is in distance/$(1+z)$ rather than detector-frame masses. Without showing detector-frame masses and luminosity distance, some “mass systematics” may be largely reparameterizations of distance/redshift differences. *Recommendation:* State explicitly the cosmology used for distance$\rightarrow$redshift conversion and whether it is fixed across models. Add detector-frame mass posteriors (e.g., $M_{\rm chirp,det}$, total mass in detector frame) and luminosity distance posteriors alongside source-frame masses and $z$ (either in Sec. 3.1 or an appendix). Discuss results in terms of the mass–distance–inclination/redshift degeneracy and clarify whether inter-model differences primarily appear in $D_L/z$ or also in detector-frame intrinsic scales.
-
**One model (IMRPhenomXO4a) appears to behave as an outlier for key parameters (notably $m_2$ and remnant quantities; Sec. 3.1 and related figures), but the paper does not demonstrate that this is a well-sampled solution exploring the same likelihood mode rather than a sampling/pathology/prior-boundary issue.** *Recommendation:* Provide diagnostics showing the XO4a posterior is reliable and comparable: sampler convergence/ESS (or equivalent), checks for multimodality, and (if available) comparisons of maximum log-likelihood (or matched-filter/logL summaries) across waveform runs. Confirm identical priors/support ($q$, spins, distance) and identical waveform settings (modes included, $f_{\rm min}$, reference frequency). If XO4a is near a model-validity boundary or extrapolating, explicitly state this (Table 1 + Sec. 3.1/4) and temper interpretation accordingly.