-
**Outcome distributions are incompatible with the current ordinary least squares (OLS) regression framework, undermining inference for key claims (Sec. 2.4.2, Sec. 3). Several outcomes are binary (STM_Perseverative_Error, LTM_Perseverative_Error), others are counts with likely overdispersion/zero inflation (STM/LTM perseveration counts), and Time_to_First_Reward is a time-to-event variable with explicit right-censoring at 3 hours. The manuscript itself notes strong non-normality (e.g., brain volume Shapiro–Wilk $p < 0.001$; non-normal residuals), making $p$-values/SEs from OLS unreliable for the central results and for downstream residual-based “resilience” indices.** *Recommendation:* Refit models using outcome-appropriate likelihoods (and report effect sizes with CIs): (i) logistic regression (binomial GLM) for binary outcomes, reporting odds ratios; (ii) Poisson/negative binomial GLMs for counts (assess overdispersion; consider zero-inflated/hurdle variants if many zeros); (iii) survival models for Time_to_First_Reward that treat non-finders as censored at $3\,{\rm h}$ (e.g., Cox PH or parametric accelerated failure-time). If linear models are retained for specific reasons, justify explicitly and provide robustness checks (e.g., HC3 robust SEs; transformation sensitivity; influence diagnostics). Update Sec. 2.4.2 to specify each model family and Sec. 3 to report the revised estimates and interpretations.
-
**Global brain volume quantification via “counting non-zero voxels” in skull-stripped mean $b=0$ images is potentially fragile and may not measure volume as intended (Sec. 2.3.2, Sec. 2.2.2). Non-zero intensity is not equivalent to a brain mask: interpolation/data-type conversion can introduce non-zero background, and true brain voxels can become zero depending on preprocessing. Without confirming that non-brain voxels are exactly zero for every subject, volume estimates could be biased and/or vary with preprocessing quirks rather than anatomy. MRI acquisition/preprocessing details and QC are also too sparse to evaluate measurement validity (Sec. 2.2.2).** *Recommendation:* Clarify the skull-stripping pipeline in Sec. 2.2.2–2.3.2 (software, parameters, whether a binary mask was produced and applied). Prefer computing volume directly from a binarized brain mask in native space rather than intensity-based “non-zero” counting; if thresholding is used, define the threshold and show robustness. Add essential acquisition parameters (scanner, field strength, voxel size, TR/TE, diffusion directions, $b$-values) and QC steps (e.g., examples of masks, outlier handling, scan/session consistency). Report whether voxel sizes/resolution are identical across scans; if not, show how this is handled. These additions are necessary before concluding there is “no association” between age and global brain volume.
-
**Key confounds for interpreting global brain volume (and its null association with age) are not addressed: body size/allometry and scan/session effects (Secs. 2.1–2.2, Sec. 3, Sec. 4.4). Total brain volume is strongly related to head/body size and can be sensitive to hydration/positioning/protocol variation. Without controlling for body size proxies (e.g., body mass, forearm length, head size) or verifying protocol uniformity (or including session covariates), between-subject variability could mask age effects or create spurious associations (including the reported brain volume–STM perseverative error relationship).** *Recommendation:* If available, include a body-size covariate (or intracranial volume/head size proxy) in brain-volume and brain-volume$\rightarrow$cognition models (Sec. 2.4.2, Sec. 3). Explicitly state whether all animals were scanned on the same scanner and protocol; if multiple sessions/protocol variations exist, include session/date as a covariate or random effect and report sensitivity. If these covariates are unavailable, add a clear limitation and temper claims about (lack of) atrophy/resistance in Sec. 4.4.
-
**The epigenetic clock variable is insufficiently documented, and the manuscript is unclear about chronological age vs DNAm age usage, limiting biological interpretation of “epigenetic age” effects (Sec. 2.1, Sec. 2.2.1, Sec. 3). The clock is referenced but not fully cited/described (training sample size/age range, tissue, performance metrics such as $R^2$/MAE). It is also unclear whether the age range reported is DNAm age or chronological age, how closely they correspond in this cohort, and whether “age acceleration” (DNAm residual vs chronological age) is considered.** *Recommendation:* In Sec. 2.2.1, provide a full citation and a concise description of the clock (species/tissue, training $N$, age range, cross-validated performance—$R^2$ and MAE). In Sec. 2.1 and Sec. 3, state explicitly which age variable is used in each analysis (DNAm age only vs chronological vs both). If chronological age exists, report its correlation with DNAm age and consider age-acceleration analyses (or justify not doing so). If chronological age is unknown/unreliable, state that clearly and moderate language equating DNAm age with “biological aging” (Sec. 4.3–4.4).
-
**Behavioral paradigm and derived metrics are under-specified, and missingness/censoring handling is not transparent, reducing reproducibility and interpretability (Secs. 2.2.3, 2.3.3, 2.4.1, Sec. 3). Critical task details (arena geometry, number/layout of boxes, fixed vs randomized box identities across bats/phases, phase durations, inter-phase intervals) are missing. Time_to_First_Reward appears capped at 3 hours, but the analytic treatment of non-finders (censoring vs fixed maximum) is unclear. Reported degrees of freedom vary across outcomes (e.g., $t(29)$ vs $t(24)$), conflicting with the stated “final analytical sample size of 33 bats with complete data,” implying outcome-specific missingness not described.** *Recommendation:* Expand Sec. 2.2.3 and Sec. 2.3.3 with a concise but complete apparatus/procedure description (arena size, box number/positions, randomization, phase lengths, delays—explicitly including the $18$-hour LTM delay—and reward rules). Explicitly define how “no reward within $3\,{\rm h}$” cases are handled (preferably as censored in a survival model; see Major Issue 1). Add a per-metric missingness table: $N$ used in each model in Sec. 3, number censored for Time_to_First_Reward, and reasons for missing data (non-participation, tracking failures, incomplete logs). Update Sec. 2.4.1 to describe whether listwise deletion, inner-join merging, or metric-specific inclusion was used.
-
**Mediation and “cognitive resilience” analyses are under-specified and not reported in a way that supports the claims (Sec. 2.4.2, Sec. 3). Mediation (Age $\rightarrow$ Brain_Volume $\rightarrow$ Cognition) is described but indirect-effect estimates/CIs are not clearly presented, and mediation is conceptually unlikely given the reported null Age$\rightarrow$Brain_Volume path. The resilience indices are residuals from age–cognition regressions, but (i) residualization models appear inconsistent with later covariate inclusion (sex/colony), (ii) residual definitions are not straightforward for binary/count/time-to-event outcomes, and (iii) metric-selection rules for resilience are inconsistent across Methods/Results (e.g., inclusion of STM_Perseveration_Count despite unclear age association).** *Recommendation:* Either (a) remove mediation from the manuscript if it is not central/supported, or (b) implement and report it fully: specify the mediator and outcome models (including covariates), bootstrap details, and report indirect effects with $95\%$ CIs in Sec. 3. For resilience, define it in a model-consistent way: residualize from the full baseline model including DNAm age + sex + colony (and other key covariates), and for non-Gaussian outcomes use appropriate residuals (e.g., deviance residuals for GLMs; martingale/deviance residuals or time-ratio residuals for survival/AFT). Pre-specify or clearly justify which metrics are used and provide a table aligning metric selection between Sec. 2.4.2 and Sec. 3.
-
**Multiplicity, power, and interpretational scope are not adequately addressed given many models, borderline $p$-values, $N\approx 33$, and a cross-sectional design (Secs. 3, 4.3–4.4). Multiple behavioral outcomes and several brain-volume/cognition/resilience models are tested; selective emphasis on $p\approx 0.05$ findings risks false positives. In addition, concluding “resistance to global brain atrophy” is too strong from cross-sectional null results over a limited age window relative to lifespan.** *Recommendation:* In Sec. 2.4.2, define primary hypotheses/endpoints (or explicitly label the work exploratory) and apply a multiple-comparisons approach across the behavioral family (e.g., Benjamini–Hochberg FDR). In Sec. 3, report effect sizes with $95\%$ CIs (not only $p$-values) and explicitly note borderline/uncertain results. In Sec. 4.3–4.4, rephrase causal/mechanistic claims as hypotheses (“consistent with…”, “may reflect…”) and frame the brain-volume null as “no detectable cross-sectional association in this age range with this measurement.” Consider adding a brief sensitivity/power statement (e.g., what decline slope could be detected given observed variance and $N$).