-
**External validity is limited by a single very small empirical design (N=10, 1,000 days, 60-day rolling window) and one bespoke two-factor specification (PC1 market + a hand-built tech long–short factor) (Sec.** 2.1, Sec. 2.2.1, Sec. 3.1–3.2). With N=10 the setting is not “high-dimensional,” and factor models typically show their main benefits in larger universes and/or richer factor sets; conversely, the extreme condition numbers reported for the factor model may be idiosyncratic to this universe, window length, and factor definition. As written, conclusions in Sec. 4 can read as broadly ruling out structural factor covariance models under heteroskedasticity, which is stronger than the current evidence supports. *Recommendation:* Add robustness checks in Sec. 3 that vary at least: (i) window length (e.g., 40/60/120 days), (ii) asset universe size/composition (e.g., expand to 30–50 equities; try a different sector mix or market), and (iii) factor specification complexity (market-only; alternative sector split; optionally a standard style factor if available). Report how realized risk, condition numbers, and turnover move across these variants. If expansion is infeasible, explicitly narrow the claim in Sec. 4 to the studied small-universe/short-window setting and discuss why results might differ for larger universes where factor structure is typically stabilizing.
-
**Key methodological details are missing, limiting reproducibility and making it difficult to assess whether the factor model’s instability is intrinsic or implementation-induced (Sec.** 2.1–2.2.2). Missing/unclear items include: the exact GARCH(1,1) mean specification and innovation distribution (Gaussian vs t), parameter constraints, estimation method, and whether GARCH parameters are re-estimated each window (Sec. 2.1); the exact construction of the technology subset and long–short factor (constituents, weights, normalization/standardization, time-invariance) and any subsequent scaling after Gram–Schmidt (Sec. 2.2.1); PCA preprocessing (demeaning, correlation vs covariance matrix); and which Ledoit–Wolf constant-correlation variant/target/intensity formula and software implementation are used (Sec. 2.2.2). *Recommendation:* Expand Sec. 2.1–2.2 (or add an implementation appendix) specifying: (a) the full GARCH model (mean, distribution, estimation routine, re-estimation frequency, convergence handling); (b) factor definitions with an explicit ticker list for the tech leg, long/short weighting scheme, normalization (e.g., dollar-neutral and unit-variance), and whether factors/loadings are re-scaled after orthogonalization; (c) PCA computation details (demeaning, matrix choice, sign convention handling); and (d) the exact Ledoit–Wolf reference/variant and how δ_t and the constant-correlation target F_t are computed, including library/code used. This will make the pipeline auditable and help interpret the source of numerical problems.
-
**There is a material inconsistency/ambiguity in the factor-model fit (R²): the text reports R² values “mostly between 0.4 and 0.6” (Sec.** 3.2), but Fig. 2 (as described in the unstructured report and noted in the structured report) appears to show values around ~0.83–1.00, with spikes to 1.0. This is central because Sec. 3.2–4 uses R² stability/level to argue factor span is adequate and that instability instead comes from Ψ_t and rescaling. If R² is miscomputed, aggregated differently than stated, or affected by leakage/look-ahead, the causal narrative becomes unreliable. *Recommendation:* Audit and reconcile the R² definition and plotting in Sec. 3.2 / Fig. 2: state precisely whether Fig. 2 shows (i) cross-sectional average of per-asset OLS R², (ii) a variance-explained ratio from PCA, or (iii) something else; clarify whether R² is in-sample within the 60-day window or evaluated out-of-sample; and report summary stats (mean/median/IQR/min/max) across time and assets. Plot R² on the full [0,1] y-axis (optionally with an inset) and investigate spikes to exactly 1.0 (potential degenerate windows, near-collinearity, or implementation errors). If any look-ahead is present, correct it and update the conclusions in Sec. 4 accordingly.
-
**Time-indexing around GARCH standardization and rescaling is ambiguous/inconsistent (Sec.** 2.1–2.3; Eq. (1) vs narrative; Eqs. (3) and (5)). The manuscript describes one-step-ahead forecasts (for day t+1) but standardizes as z_{i,t}=r_{i,t}/\hat\sigma_{i,t} (Eq. (1)) and rescales using diag(\hat\sigma_t) (Eqs. (3)/(5)). Without explicit timing, it is unclear whether Σ_t used for weights targets Cov(r_{t+1}|F_t) or Cov(r_t|F_{t-1}), and mismatched indices could also contribute to apparent instability. *Recommendation:* Make timing explicit and consistent throughout Sec. 2: define whether \hat\sigma_{i,t} denotes the conditional s.d. for r_{i,t} given information at t−1, or the forecast for r_{i,t+1} given information at t. Then update Eq. (1) and the rescaling in Eqs. (3)/(5) to use matching indices (e.g., use diag(\hat\sigma_{t+1|t}) if Σ_t is meant to forecast next-day return covariance). Add one sentence in Sec. 2.3 clarifying which Σ is optimized to generate w_t and which realized return r_{t+1} evaluates it.
-
**The diagnosis of the factor model’s extreme ill-conditioning is plausible but remains largely qualitative and under-identified: it is unclear whether instability originates in (i) innovation-space factor estimation, (ii) near-zero/noisy idiosyncratic variances Ψ_t, (iii) the GARCH rescaling step amplifying dispersion in vol forecasts, or (iv) PD/regularization/solver handling (Sec.** 3.1–3.2, Sec. 4). The reported mean condition numbers (≈88,480 for the factor model) are unusually large for a low-rank-plus-diagonal covariance unless some ψ_{i,t} are extremely small or numerical handling is problematic. *Recommendation:* Add targeted diagnostics in Sec. 3.2 to isolate the mechanism: (a) report condition numbers for innovation covariances Σ_{z,t} (before rescaling) for both methods; (b) decompose factor covariance conditioning by reporting κ(B_tΩ_tB_tᵀ), κ(B_tΩ_tB_tᵀ+Ψ_t) in innovation space, and κ after rescaling; (c) report the empirical distribution over time of diagonal ψ_{i,t} (min/percentiles) and of \hat\sigma_{i,t} (min/percentiles), and show whether spikes in κ line up with extreme ψ or σ; (d) implement minimal regularizations—e.g., floor ψ_{i,t}≥ε, shrink Ψ_t toward a constant-diagonal target, or smooth Ψ_t over time—and show the impact on κ, realized risk, and turnover. These additions would convert the narrative in Sec. 4 from conjecture to evidence.
-
**Performance evaluation is not statistically characterized and the realized “variance” metric is potentially misinterpreted (Sec.** 2.3.2, Sec. 3.1). The reported daily realized variance uses wᵀ r rᵀ w = (wᵀ r)^2, which is a squared realized portfolio return (a second moment), not a variance estimator unless carefully aggregated and mean effects are addressed. In addition, the comparison relies mainly on time-series averages without dispersion measures, confidence intervals, or paired tests, so it is unclear whether the difference (e.g., 0.000126 vs 0.000153) is statistically/economically meaningful. *Recommendation:* In Sec. 2.3.2, rename the metric as “squared realized return” (or explicitly justify interpreting its time-average as an out-of-sample second moment under a zero-mean approximation). Complement it with a standard out-of-sample variance estimate of portfolio returns over the backtest (or rolling realized variance of portfolio returns). In Sec. 3.1, add dispersion (SD/IQR) for realized risk, condition numbers, and turnover; compute confidence intervals for mean differences (e.g., block bootstrap over days); and run simple paired tests on daily squared returns. Optionally report basic return metrics (mean return, volatility, Sharpe) to contextualize whether lower risk coincides with comparable returns.
-
**Figures and key result presentation contain omissions and potential errors that materially affect interpretability (Sec.** 3.1–3.2). Figure 1 is described as multi-panel (variance/condition number/turnover) but appears incomplete; axis labels/units/time scale are unclear; and condition numbers likely require log scaling to be readable. Figure 2 has the R² discrepancy noted above and the y-axis treatment may visually overstate changes. These presentation issues impede verification of the main claims. *Recommendation:* Rebuild Figure 1 as a true 3-panel figure (or separate clearly labeled subfigures) with explicit units (daily vs annualized), date axis, and a legend placed outside the plotting area; plot condition numbers on a log10 scale. For Figure 2, after reconciling R², use the full [0,1] scale (optionally add an inset), label the x-axis with dates, and include summary statistics in the caption. Ensure captions state clearly: rolling window length, whether GARCH filtering is applied, and whether quantities are in innovation space or rescaled return space.
-
**Portfolio optimization/PD handling is under-specified despite being central given the paper’s emphasis on ill-conditioning (Sec.** 2.3.1, Sec. 3.1). With extreme condition numbers, results can depend heavily on whether Σ_t is enforced to be PSD/PD (eigenvalue clipping, εI jitter), how the QP is solved, and solver tolerances. Without these details, it is hard to attribute differences to covariance estimators rather than numerical optimization choices. *Recommendation:* In Sec. 2.3.1, specify the solver/library used for the long-only QP, tolerances, and how non-PD or nearly singular Σ_t is treated (symmetrization, eigenvalue clipping, ridge adjustment εI, using singular values for κ). Report how often PD fixes were needed under each estimator and whether any days were dropped. Consider adding weight-stability diagnostics (max weight, effective number of holdings 1/∑w_i²) to connect ill-conditioning to economically meaningful portfolio concentration beyond turnover.