Learn · Methodology

How cohort distributions are built, calibrated, and audited.

Every number this API returns ships with a sample size and a calibrated confidence band. Below is the evaluation protocol we hold ourselves to — plus the specific places we’ve published honest-negative results so you can see the rigor on the inside.

New to this? Read the plain-English walkthrough →

Layer · Cohort intelligence

Beyond shape: the conditional analytics on top of retrieval.

Retrieval gives you a cohort. Cohort intelligence tells you which features inside the cohort separated winners from losers — regime stratification, per-feature importance, risk profile, and a narrative-change score that distinguishes priced-in news from genuine catalysts.

The same shape can have a 44pp win-rate spread between volatility regimes. Conditional analytics surface that information instead of averaging it away. Read the cohort intelligence page → Or read the canonical what is cohort intelligence page first.

How to read this page

What we validate, and how we keep ourselves honest.

This page documents the validation discipline behind the live system — symbol-disjoint splits, calibration, conformal bands, same-symbol exclusion. It is the bar any future iteration has to clear before it ships.

Forward-looking architecture is in active research and the details stay private until release. See section 5.

1. Pipeline overview

a. Ingestion. Daily + minute bars across ~20K US equities, 2016 → today. Delisted tickers backfilled (9,400 rows) so survivorship isn’t silently inflated.
b. Pattern representation. Each (symbol, date, timeframe) is encoded into a numerical pattern vector. The representation is intentionally not documented at an architectural level on public surfaces; what matters is that the same pattern input produces the same vector deterministically, and the similarity metric is magnitude-aware (not cosine).
c. Retrieval. At query time, we nearest-neighbor search across the indexed library. A cohort of the top-k matches is returned with complete forward-return history joined from precomputed caches.
d. Calibration. Raw retrieval quantiles systematically under-cover the true distribution (section 3 below). A split-conformal correction widens the bands so nominal 80% coverage empirically hits 80% on held-out data.
e. Metadata enrichment. Each anchor carries a multi-signal feature vector: market state (vol regime, yield curve, credit spread, macro composite), per-symbol context (relative volume, momentum, position vs ATH, sector RS, calendar proximity), and news intelligence (deduped article counts, transformer-scored sentiment, narrative-change score). Joined to the cohort at retrieval time.
f. Conditional analytics. Given the enriched cohort, the analytics layer returns the outcome distribution per horizon, a within-cohort logistic regression of features → win/loss with direction and CIs, regime stratification, and a risk profile. See /learn/intelligence for the full surface.

2. Evaluation protocol

The eval methodology is fixed in advance and re-used across every claim on this site. Three rules:

Symbol-disjoint splits. Train / validation / test splits are keyed on symbol (MD5-bucketed), not on date. This closes a leakage mode we caught and fixed: a prior evaluation reused the same symbol across splits and overstated accuracy by 2–3pp.
10-day purge / embargo. No anchor within 10 trading days of a test-set date contributes to training. Prevents near-date autocorrelation from sneaking in as “predictive” signal.
Same-symbol exclusion at retrieval. Matches on the anchor symbol within the prior N calendar days are dropped from the cohort. Stops “today’s NVDA matching yesterday’s NVDA” from looking like a real signal.

3. Calibration — split conformal prediction

Nearest-neighbor retrieval returns a cohort of historical forward returns. Reading the p10 / p90 directly from that cohort gives you a band that systematically under-covers — neighbors are selected for shape similarity, not randomness, which shrinks the empirical variance. We measured this on our own data:

Band	Nominal coverage	Raw empirical	Calibrated empirical
[p10, p90], 5d	80%	~68%	~82%
[p10, p90], 10d	80%	~68%	~80%
[p25, p75], 5d	50%	~40%	~49%

The correction is split conformal for quantile regression (CQR-style):

1. Hold out a calibration set of anchors with known forward returns.
2. For each calibration anchor, compute the nonconformity score: max(p_lo − y, y − p_hi).
3. Take the ⌈(1 − α)(n + 1)⌉ / n empirical quantile of those scores — that’s your additive offset.
4. Calibrated band = [p_lo − offset, p_hi + offset]. By construction, this hits ≈ (1 − α) coverage on exchangeable data.

Every /cohort response includes calibrated_return_pct alongside the raw return_pct so the caller can size off the honest band, not the shrunk one. The raw band is kept available for ranking use cases where absolute width matters less than relative ordering.

Conformal handles the band width. A separate rolling-90-day bias correction shifts the center of the predicted distribution when our recent picks have systematically over- or under-shot. Two layers, two jobs: conformal widens, bias correction shifts.

4. Honest-negative results we’ve published

Publishing what didn’t work is as load-bearing as publishing what did. Cleared the honest-negative gate:

Regime conditioning moves IQR by ≤ 0.37pp — five independent regime filters (VRP, VIX term, credit spread, yield curve, breadth) produce near-zero shift when layered on top of shape-similarity retrieval. Shape already captures regime implicitly.
Single-anchor findings lie — one anchor suggested earnings-window patterns underperform by −3.6pp. Running it across 100 anchors, the real effect is −0.5pp; the paired test against a dividend placebo fails at p = 0.08. We published the gap.
Our own bands were mis-calibrated — nominal 80% covered 68% empirically before we shipped the conformal correction. The audit is public; the fix is public.
Outcome head learns nothing from shape alone. — a forward-return regression head added to the shape-only retrieval space yielded near-zero gain over base rate (mean ~0.04–0.15pp on 5d, within noise). Conclusion: shape is for retrieval; outcome learning lives in the analytics layer above it, conditioned on multi-signal metadata. Architecture redirected accordingly.

5. What this page deliberately will not tell you

A defensible methodology page teaches how the system is validated without teaching how to clone it. Out of scope on public surfaces:

— The specific embedding architecture (model family, parameter count, training recipe).
— The dimensionality of the vector space and the index structure.
— The exact per-bucket conformal offset values.
— The synthetic-data generation pipeline we use for certain training stages.
— Next-generation retrieval is in active research; no details until deliberate release.

Enterprise adopters with an NDA and real integration needs can request deeper technical diligence directly.

6. Limits & where we’re honest about uncertainty

— Training data overlaps the backtest period for some regimes. The as_of parameter closes retrieval-side leakage; it does not close model-training leakage.
— The forward-test corpus spans ~2 years. True tail regimes (VIX > 60) are under-represented.
— Conformal calibration assumes exchangeability. Extreme regime shifts can temporarily break the coverage guarantee; we refit periodically.
— Decomposition with small slice n (< 30) has CIs too wide to be actionable per-anchor; treat as exploration, not signal.
— Per-feature attribution from the conditional analytics layer is a within-cohort regression — coefficients are local to that shape and do not transport across cohorts.
— News features (deduped counts, sentiment, narrative-change score) are NULL pre-2024-01-01 and dropped from the regression for cohorts with > 50% missing. Coverage compounds with time.
— Narrative-change-score weights are starting points pending the meta-learning refit; threshold ranges are observed, not validated.

Historical analysis, not investment advice. See /disclaimer.