Rolling Baselines & the Lookback Tradeoff
Every signal has a baseline behind it. The choice of how long to look back to define that baseline is one of the highest-leverage modeling decisions you will make. Too short and you measure noise; too long and you measure stale conditions. The sweet spot is empirical, not theoretical, and varies by signal type.
6.2.1The Fundamental Tradeoff
Every rolling baseline is balancing two opposing pressures. A shorter window adapts to current conditions faster but uses less data, so the estimates of mean and standard deviation are noisier. A longer window uses more data and produces more stable estimates but adapts to regime change more slowly, so the estimates remain anchored to conditions that may no longer prevail. There is no window length that wins on both axes. The choice is always a tradeoff in which signal you are willing to lose.
The right way to frame the choice is to ask: what is the time scale of the phenomenon I am trying to measure? Vol clusters operate on a 10 to 30 day scale, so a vol baseline must be short enough to capture the cluster boundaries. Trend regimes operate on a 50 to 100 day scale, so a trend baseline can be longer. Macro regimes operate on a 200 to 500 day scale, so a regime baseline must be the longest. The window should match the natural time constant of the phenomenon, not the convenience of round numbers.
The empirical reality is that within a class of signals, the optimal window is fairly tight. Walk-forward testing across multiple regimes typically shows a 30-day window beats a 10-day window for vol signals, and a 50-day window beats either 20 or 100 for short-term trend signals. The optimal window is rarely the shortest or longest; it lives in the empirically validated sweet spot for that signal type.
The baseline window is the lens through which the framework sees today. A bad lens distorts every signal that passes through it.
6.2.2Short Windows: Fast But They Lie
A 10-day window is reactive. It catches regime change within days because the new data dominates the small sample. The cost is that it confuses single-week wobbles for new regimes. A 10-day mean of SPY returns can shift by 30 basis points on a single noisy week, even when the underlying drift has not changed at all. Sizing decisions that depend on a 10-day baseline will see far more “regime changes” than actually occurred.
The math behind this is direct. The standard error of the mean shrinks with the square root of sample size. A 10-day mean has roughly three times the noise of a 90-day mean, even when both are estimating the same underlying parameter. For a metric with a daily standard deviation of 1%, the 10-day mean carries approximately 0.32% of standard error around its estimate; the 90-day mean carries roughly 0.11%. The shorter window is genuinely noisier, and that noise propagates into every σ reading and every threshold check downstream.
The use case for short windows is genuine high-frequency phenomena. Realized volatility computed on a 10-day window catches the start and end of vol clusters faster than a 90-day window will. Intraday flow imbalances on a 5-day window catch shifts in dealer positioning. The discipline is to use short windows only when the underlying phenomenon really does operate on that time scale, not because faster feels better.
6.2.3Long Windows: Stable But They Lag
A 250-day window is the opposite. It uses more than a year of data to compute the mean and standard deviation, which makes the estimates very stable but also means a regime that started two months ago is still mostly being measured against pre-regime baseline conditions. A signal that flipped sign two months ago will still read as continuing the prior trend if the baseline is long enough.
The cost is concrete. When the underlying environment shifts, the long-window baseline takes weeks to months to fully adapt. During that adaptation period, the σ readings on the metric will systematically over- or under-estimate the actual unusualness of the print. A long-window baseline applied to a signal in a regime change will keep flagging “unusual” prints that are actually just the new regime’s normal — the framework loses information density during exactly the period where information density matters most.
The use case for long windows is signals that are genuinely slow-moving. Macro regime classification, sector valuation spreads, and bond term premium all operate on multi-year time scales. A 90-day window applied to bond term premium would whip around with every quarterly Fed meeting; a 250-day or 500-day window correctly smooths that and reads the actual signal. The discipline is symmetric to the short-window case: use long windows only when the underlying phenomenon actually moves on that time scale.
6.2.4Why 90 Days Is the AZTMM Default
The AZTMM 90-day default for most cross-sectional σ calculations is empirical, not theoretical. Walk-forward testing across the 2020-2026 period on the major signal classes (flow, dark pool, breadth, put/call) shows the 90-day window beating both 60-day and 120-day on a combined accuracy and stability metric across multiple regimes. The 90-day window is long enough to smooth the worst noise but short enough to adapt to genuine regime changes within roughly two to four weeks.
The other reason 90 days is robust is that it covers roughly one calendar quarter of trading. Most institutional positioning operates on a quarterly cycle (rebalance dates, earnings cycles, end-of-quarter mark-to-market), and a 90-day baseline captures one full cycle of that institutional rhythm. Shorter windows under-sample the cycle; longer windows blur multiple cycles together.
The default is not sacred. For specific signals, the framework uses different windows: 30-day for vol-of-vol, 50-day for trend filters, 250-day for regime baseline computations. The 90-day is the cross-sectional default for daily flow, dark pool, breadth, and put/call. When in doubt, 90-day is the right starting point and only deviated from with empirical justification.
| Signal Type | Window | Rationale |
|---|---|---|
| Realized vol, vol-of-vol | 30-day | vol clusters on 10–30 day scale |
| Trend filters (50DMA cross) | 50-day | matches trend signal natural period |
| Cross-sectional flow, dark pool, P/C | 90-day | quarterly institutional cycle |
| Breadth indicators | 90-day | matches sector rotation periodicity |
| HMM regime training, transitions | 250-day | regime persistence is multi-month |
| Macro / valuation spreads | 500-day | multi-year time constant |
6.2.5Choosing Window Per Signal
The discipline is to never use a single global baseline window. The framework uses the matrix above, and any new signal added requires an explicit answer to “what is the natural time constant of this phenomenon?” before the baseline is set. The questions to ask in order: How quickly does the phenomenon itself change? How much noise is acceptable in the estimate? How important is fast adaptation versus stable measurement?
For most signals, the answer falls into one of three buckets. Fast-moving phenomena (vol, intraday flow) want 10 to 30 day windows. Mid-frequency phenomena (cross-sectional positioning, breadth, put/call) want 60 to 120 day windows; the AZTMM default of 90 days is in the middle of this range. Slow-moving phenomena (regime, valuation spreads, term premium) want 200 to 500 day windows. Outside these buckets, you are usually trying to use a window for a phenomenon it does not match, and the signal will degrade.
One subtle point: the baseline window for the mean and the baseline window for the standard deviation do not have to be identical. Some signals benefit from a longer mean window (more stable estimate of the average) and a shorter stdev window (faster adaptation to volatility regime change). The AZTMM default uses the same window for both because the simplicity is worth more than the marginal accuracy gain, but for production research, this decoupling can squeeze additional signal.
6.2.6Walk-Forward Testing
The empirical justification for any baseline window is walk-forward testing. The procedure is: pick a starting date, train on the prior N days, predict the next M days, compute the accuracy or signal-to-noise of the predictions, advance N days, and repeat. The output is a series of out-of-sample performance numbers across multiple windows of different length. The window with the best out-of-sample numbers across the most regimes is the right default.
The discipline is to walk forward across at least two regimes, ideally three or four. A window that performs well only in Bull regimes is overfit to Bull; a window that performs well in both Bull and Crisis is genuinely robust. The 90-day default for cross-sectional signals was selected by walk-forward testing across 2020-2026 covering the COVID crash, the 2021 melt-up, the 2022 bear market, and multiple subsequent transitions. A window that wins across all four is the default; a window that wins only in two of four is regime-dependent and should not be used as a global default.
Walk-forward also protects against the most common sin in baseline selection: tuning the window to recent data. A window that looks great on 2024 data may be terrible on 2022 data. The walk-forward procedure forces the analyst to evaluate against periods that were not used in selection, which is the only honest test of robustness.
6.2.7Common Mistakes
- Using a global window for all signals. Vol, flow, regime, and macro have different natural time scales.
- Using too-short windows for slow phenomena. A 10-day window on a regime signal is mostly noise.
- Using too-long windows for fast phenomena. A 250-day window on vol misses every cluster.
- Tuning the window on recent data. The window must validate across at least two regimes.
- Treating round numbers as natural windows. 30, 90, 250 happen to work; 50, 100, 200 are not magic.
- Failing to revalidate after a regime shift. A window that worked in 2020-2021 may not work in 2022-2026.
Key Takeaways
- Every baseline trades adaptation speed against estimate noise. There is no window that wins on both.
- Match the window to the time scale of the phenomenon: 30 for vol, 50 for trend, 90 for cross-sectional, 250+ for regime.
- The AZTMM 90-day default is empirically validated for cross-sectional signals, not theoretically derived.
- Walk-forward across multiple regimes is the only honest validation of a window choice.
- The mean-window and stdev-window do not have to be identical; advanced applications decouple them.
- A backtest without walk-forward is a hypothesis, not a result. Do not size against it.
