Published audits and honest negatives.
Research artifacts from Chart Library’s own eval pipeline. Positive findings ship into the product; null results get the same publication treatment — the review discipline is the point. Methodology at /learn/methodology.
What held up
Within a single shape-similar cohort (NVDA 1h, K=300), volatility regime at the anchor date splits forward outcomes from 70% win in high vol (n=77, +2.73% mean 5d) to 26% win in low vol (n=104, −4.45% mean 5d). Same shape, opposite-sign outcomes. Stratification surfaces what averaging hides.
Within feature_importance for news-day cohorts, the sign-alignment feature carries a negative coefficient — analogs where sentiment direction disagreed with price direction outperformed 5d. Empirically recovers the priced-in vs narrative-change distinction. Coverage limited to post-2024 anchors.
Caveats: News article ingestion starts 2024-01-01. Cohorts dominated by older analogs see news features dropped from the regression by the >50% missing-data filter.
Raw retrieval [p10, p90] covers ~68% empirically on a nominal 80% band. Conformal offsets (CQR-style) restore coverage to 82.5% on held-out.
A prior cross-symbol eval silently reused symbols across splits and overstated accuracy. Moving to symbol-disjoint MD5-bucketed splits + 10-day purge/embargo closed the leak. The inflated baseline claim was retracted.
9,400 rows for delisted tickers backfilled into the pattern library. Forward returns now include the subset of the past where companies did not survive — a conservative correction against a common retrieval-side bias.
What didn’t — and why we published it anyway
Five independent regime filters (variance risk premium, VIX term structure, credit spread, yield curve, market breadth) layered on top of shape retrieval. Across 200 anchors × 6 modes, IQR shifts were below 0.4pp. Shape already captures regime implicitly.
Caveats: Loose ±0.15 percentile bucketing. Tight bucketing (±0.05) may restore meaningful effect at the cost of variance. Filter stacking untested.
One anchor suggested earnings-window patterns underperform by −3.65pp. Re-running across 100 anchors, the population effect is −0.52pp. The paired test against a dividend placebo yields p = 0.08. The single-anchor result was an outlier, not a generalization.
Extended the H3 sample to 2020–2023 for real COVID-era high-VIX anchors. Q4 VIX bucket shows −0.69pp paired diff vs Q1 at −0.35pp — directionally consistent with the hypothesis but no paired CI excludes zero at any VIX threshold.
Caveats: Extreme tail (VIX > 40) n = 14. Directionally clean, statistically underpowered.
Distance ensemble between two retrieval spaces tested at α ∈ {0.3, 0.5, 0.7}. Best ensemble edges V5-alone by 0.2pp MAE at n = 20 anchors — within noise. V5-alone cleanly beats V2-alone. Intersection of top-500 lists typically shares < 5 pairs, so strict intersection ensembles are structurally unworkable.
Earnings-window underperformance stratified by GICS sector. Only three sectors cleared n ≥ 10 anchors after Bonferroni. All three paired CIs straddle zero. The aggregate effect is cross-sector, not concentrated.