Rolling Baselines & the Lookback Tradeoff

Module 06 · Lesson 6.2

Rolling Baselines & the Lookback Tradeoff

Every signal has a baseline behind it. The choice of how long to look back to define that baseline is one of the highest-leverage modeling decisions you will make. Too short and you measure noise; too long and you measure stale conditions. The sweet spot is empirical, not theoretical, and varies by signal type.

Reading13 minDifficultyIntermediatePrereqsLesson 6.1

6.2.1The Fundamental Tradeoff

Every rolling baseline is balancing two opposing pressures. A shorter window adapts to current conditions faster but uses less data, so the estimates of mean and standard deviation are noisier. A longer window uses more data and produces more stable estimates but adapts to regime change more slowly, so the estimates remain anchored to conditions that may no longer prevail. There is no window length that wins on both axes. The choice is always a tradeoff in which signal you are willing to lose.

The right way to frame the choice is to ask: what is the time scale of the phenomenon I am trying to measure? Vol clusters operate on a 10 to 30 day scale, so a vol baseline must be short enough to capture the cluster boundaries. Trend regimes operate on a 50 to 100 day scale, so a trend baseline can be longer. Macro regimes operate on a 200 to 500 day scale, so a regime baseline must be the longest. The window should match the natural time constant of the phenomenon, not the convenience of round numbers.

The empirical reality is that within a class of signals, the optimal window is fairly tight. Walk-forward testing across multiple regimes typically shows a 30-day window beats a 10-day window for vol signals, and a 50-day window beats either 20 or 100 for short-term trend signals. The optimal window is rarely the shortest or longest; it lives in the empirically validated sweet spot for that signal type.

The baseline window is the lens through which the framework sees today. A bad lens distorts every signal that passes through it.

6.2.2Short Windows: Fast But They Lie

A 10-day window is reactive. It catches regime change within days because the new data dominates the small sample. The cost is that it confuses single-week wobbles for new regimes. A 10-day mean of SPY returns can shift by 30 basis points on a single noisy week, even when the underlying drift has not changed at all. Sizing decisions that depend on a 10-day baseline will see far more “regime changes” than actually occurred.

The math behind this is direct. The standard error of the mean shrinks with the square root of sample size. A 10-day mean has roughly three times the noise of a 90-day mean, even when both are estimating the same underlying parameter. For a metric with a daily standard deviation of 1%, the 10-day mean carries approximately 0.32% of standard error around its estimate; the 90-day mean carries roughly 0.11%. The shorter window is genuinely noisier, and that noise propagates into every σ reading and every threshold check downstream.

The use case for short windows is genuine high-frequency phenomena. Realized volatility computed on a 10-day window catches the start and end of vol clusters faster than a 90-day window will. Intraday flow imbalances on a 5-day window catch shifts in dealer positioning. The discipline is to use short windows only when the underlying phenomenon really does operate on that time scale, not because faster feels better.

6.2.3Long Windows: Stable But They Lag

A 250-day window is the opposite. It uses more than a year of data to compute the mean and standard deviation, which makes the estimates very stable but also means a regime that started two months ago is still mostly being measured against pre-regime baseline conditions. A signal that flipped sign two months ago will still read as continuing the prior trend if the baseline is long enough.

The cost is concrete. When the underlying environment shifts, the long-window baseline takes weeks to months to fully adapt. During that adaptation period, the σ readings on the metric will systematically over- or under-estimate the actual unusualness of the print. A long-window baseline applied to a signal in a regime change will keep flagging “unusual” prints that are actually just the new regime’s normal — the framework loses information density during exactly the period where information density matters most.

The use case for long windows is signals that are genuinely slow-moving. Macro regime classification, sector valuation spreads, and bond term premium all operate on multi-year time scales. A 90-day window applied to bond term premium would whip around with every quarterly Fed meeting; a 250-day or 500-day window correctly smooths that and reads the actual signal. The discipline is symmetric to the short-window case: use long windows only when the underlying phenomenon actually moves on that time scale.

6.2.4Why 90 Days Is the AZTMM Default

The AZTMM 90-day default for most cross-sectional σ calculations is empirical, not theoretical. Walk-forward testing across the 2020-2026 period on the major signal classes (flow, dark pool, breadth, put/call) shows the 90-day window beating both 60-day and 120-day on a combined accuracy and stability metric across multiple regimes. The 90-day window is long enough to smooth the worst noise but short enough to adapt to genuine regime changes within roughly two to four weeks.

The other reason 90 days is robust is that it covers roughly one calendar quarter of trading. Most institutional positioning operates on a quarterly cycle (rebalance dates, earnings cycles, end-of-quarter mark-to-market), and a 90-day baseline captures one full cycle of that institutional rhythm. Shorter windows under-sample the cycle; longer windows blur multiple cycles together.

The default is not sacred. For specific signals, the framework uses different windows: 30-day for vol-of-vol, 50-day for trend filters, 250-day for regime baseline computations. The 90-day is the cross-sectional default for daily flow, dark pool, breadth, and put/call. When in doubt, 90-day is the right starting point and only deviated from with empirical justification.

Signal Type	Window	Rationale
Realized vol, vol-of-vol	30-day	vol clusters on 10–30 day scale
Trend filters (50DMA cross)	50-day	matches trend signal natural period
Cross-sectional flow, dark pool, P/C	90-day	quarterly institutional cycle
Breadth indicators	90-day	matches sector rotation periodicity
HMM regime training, transitions	250-day	regime persistence is multi-month
Macro / valuation spreads	500-day	multi-year time constant

Learning Check

A trader is computing a sigma reading on SPY’s daily realized volatility using a 90-day window. The vol regime has shifted in the last two weeks — realized vol jumped from 12% annualized to 19% annualized. What is the problem with the 90-day window for this metric, and what is the fix?

The 90-day window is mismatched to the time scale of vol regime change. Vol clusters operate on a 10 to 30 day scale, so the 90-day baseline is still measuring against the pre-shift quiet period. The new realized vol of 19% will read as a +3 to +4σ move every day for weeks until the baseline adapts, when in fact the new vol is just the new normal. The fix is to use a 30-day baseline for vol metrics specifically. The framework correctly uses 90-day as the cross-sectional default but switches to 30-day for vol because vol genuinely operates on a faster time scale than flow or breadth do. Match window to phenomenon.

6.2.5Choosing Window Per Signal

The discipline is to never use a single global baseline window. The framework uses the matrix above, and any new signal added requires an explicit answer to “what is the natural time constant of this phenomenon?” before the baseline is set. The questions to ask in order: How quickly does the phenomenon itself change? How much noise is acceptable in the estimate? How important is fast adaptation versus stable measurement?

For most signals, the answer falls into one of three buckets. Fast-moving phenomena (vol, intraday flow) want 10 to 30 day windows. Mid-frequency phenomena (cross-sectional positioning, breadth, put/call) want 60 to 120 day windows; the AZTMM default of 90 days is in the middle of this range. Slow-moving phenomena (regime, valuation spreads, term premium) want 200 to 500 day windows. Outside these buckets, you are usually trying to use a window for a phenomenon it does not match, and the signal will degrade.

One subtle point: the baseline window for the mean and the baseline window for the standard deviation do not have to be identical. Some signals benefit from a longer mean window (more stable estimate of the average) and a shorter stdev window (faster adaptation to volatility regime change). The AZTMM default uses the same window for both because the simplicity is worth more than the marginal accuracy gain, but for production research, this decoupling can squeeze additional signal.

6.2.6Walk-Forward Testing

The empirical justification for any baseline window is walk-forward testing. The procedure is: pick a starting date, train on the prior N days, predict the next M days, compute the accuracy or signal-to-noise of the predictions, advance N days, and repeat. The output is a series of out-of-sample performance numbers across multiple windows of different length. The window with the best out-of-sample numbers across the most regimes is the right default.

The discipline is to walk forward across at least two regimes, ideally three or four. A window that performs well only in Bull regimes is overfit to Bull; a window that performs well in both Bull and Crisis is genuinely robust. The 90-day default for cross-sectional signals was selected by walk-forward testing across 2020-2026 covering the COVID crash, the 2021 melt-up, the 2022 bear market, and multiple subsequent transitions. A window that wins across all four is the default; a window that wins only in two of four is regime-dependent and should not be used as a global default.

Walk-forward also protects against the most common sin in baseline selection: tuning the window to recent data. A window that looks great on 2024 data may be terrible on 2022 data. The walk-forward procedure forces the analyst to evaluate against periods that were not used in selection, which is the only honest test of robustness.

Learning Check

You backtest a new signal using a 50-day baseline window and the results look excellent. You do not have time to walk-forward across multiple regimes. What is the disciplined response?

Do not put the signal into production. A backtest on a single window without walk-forward across multiple regimes is the most common way overfit signals make it into live trading. The minimum viable validation is walk-forward across at least two regimes (one Bull, one Crisis) with the window itself selected by training-set performance and evaluated on hold-out. If you do not have time for that, you do not have time for the signal. Sized at zero in live until validated is the only safe answer; the temptation to “trade it small while we validate” is the path most overfit signals enter the framework, where they then quietly bleed.

6.2.7Common Mistakes

Using a global window for all signals. Vol, flow, regime, and macro have different natural time scales.
Using too-short windows for slow phenomena. A 10-day window on a regime signal is mostly noise.
Using too-long windows for fast phenomena. A 250-day window on vol misses every cluster.
Tuning the window on recent data. The window must validate across at least two regimes.
Treating round numbers as natural windows. 30, 90, 250 happen to work; 50, 100, 200 are not magic.
Failing to revalidate after a regime shift. A window that worked in 2020-2021 may not work in 2022-2026.

Key Takeaways

Every baseline trades adaptation speed against estimate noise. There is no window that wins on both.
Match the window to the time scale of the phenomenon: 30 for vol, 50 for trend, 90 for cross-sectional, 250+ for regime.
The AZTMM 90-day default is empirically validated for cross-sectional signals, not theoretically derived.
Walk-forward across multiple regimes is the only honest validation of a window choice.
The mean-window and stdev-window do not have to be identical; advanced applications decouple them.
A backtest without walk-forward is a hypothesis, not a result. Do not size against it.

← Previous · Lesson 6.1σ Deviation as Normalized Signal Next · Lesson 6.5 →Multi-Factor Models

Rolling Baselines & the Lookback Tradeoff