What is temporal awareness?

Large language models constantly make decisions that play out over time. Answer now, or think longer? Optimize for this turn, or the whole conversation? Take the sure thing today, or the bigger payoff later?

Temporal awareness is how a model internally represents these time horizons — whether it’s leaning short-term or long-term — and whether that internal sense matches what it actually says out loud.

Internal vs. stated horizon

Here’s the key distinction this project is built around.

These don’t always agree. A model might say something short-term and harmless while its internal state is oriented toward long-horizon planning — or vice versa. That gap is the safety-relevant signal. If we can measure it, we can notice when a model’s words and its internal “intentions” come apart.

Formally, the project grounds this in intertemporal preference: a value function that weighs a reward by how far away in time it is, and an “internal horizon” — the point where the model stops caring about the future. The question is whether that internal horizon matches the one the model states. See the research program.

Why it matters for AI safety

Three reasons this is more than a curiosity:

  1. Agents now run for a long time. As models are deployed in long, autonomous loops — dozens of conversational turns, hundreds of documents — their behavior can drift. The project finds that safety behavior in particular can degrade over repetitive sessions, with some safeguards weakening faster than others.

  2. Words are not enough to monitor. If we only watch what a model says, we miss internal shifts until they surface as bad outputs. Reading the internal time-horizon gives us an earlier signal.

  3. It’s measurable today. A model’s temporal scope turns out to be linearly readable from a single layer of activations. That makes a cheap, real-time probe possible — the kind of concrete, deployable monitor that safety work needs.

This isn’t a lone bet

Reading a safety-relevant property straight from a model’s activations is already a working idea, not a hope. Independent researchers have used simple linear probes to spot hidden “sleeper-agent” backdoors, and to catch a 70-billion-parameter model being strategically deceptive — concealing insider trading, or quietly underperforming on a safety test — nearly perfectly (detection scores up to 0.999). We’re pointing that same well-tested playbook at a new target: a model’s sense of time.

What this project actually does

Early signs, honestly held

Two early findings hint at why this is worth doing — and one keeps us honest:

See the verified numbers on What we score, poke at the geometry on Explore, then pick an issue and join in.