← Back to Dashboard

How This Works

Testing methodology for continual learning research

What We're Studying

Continual learning (also called lifelong learning) addresses a fundamental problem: when neural networks learn new tasks, they tend to forget previously learned ones. This is called catastrophic forgetting.

We're researching methods to prevent forgetting in small transformer models (<10M parameters), aiming for techniques that are practical and efficient.

Testing Framework

Two-Stage Validation

We use a two-stage approach to balance speed with rigor:

Stage 1: DISCOVERY (fast) Stage 2: VALIDATION (thorough) ───────────────────────── ───────────────────────────── • Synthetic conflict sequences • Permuted/Split MNIST • 1 seed, ~10-15s per run • 3-5 seeds, 30s-2min per run • Try many ideas quickly • Prove ideas actually work • Controlled interference • Real benchmark performance

Most ideas are tested in Stage 1. Only promising results graduate to Stage 2.

Task Suites

Suite Time Use Case
Synthetic Conflict Sequences ~10s Stage 1: Transformer-oriented, controlled interference
Permuted MNIST ~30s Stage 2: Proven benchmark, attention conflicts
Split MNIST ~1min Stage 2: Standard benchmark

Synthetic Conflict Sequences are designed specifically to stress transformer attention and embedding layers, with controlled interference patterns (cue conflicts, key-value remapping, rule switches).

Gradient Conflict Validation

Before running experiments, we validate that our benchmarks create real gradient conflict. This ensures we're measuring actual catastrophic forgetting, not just different tasks.

mean_cosine
-0.08
conflict_score
0.6

mean_cosine < 0 means task gradients point in opposite directions (conflict). conflict_score > 0.5 means majority of task pairs interfere. Our cue_conflict benchmark has 100% opposite labels between consecutive tasks.

Experiment Tiers

Tier Max Time Seeds When to Use
Quick 30s 1 Debugging, sanity checks
Standard 5min 3 Real experiments
Full 15min 5+ Validating breakthroughs

No experiment should take hours. If too slow, we shrink the model or data.

Key Metrics

Retention
Mean accuracy on all tasks
Forgetting
Avg drop after learning new tasks

Retention is the primary metric. After training on all tasks sequentially, we measure accuracy on each task. Retention = average of these accuracies.

Forgetting measures how much accuracy drops on earlier tasks after learning later ones. Lower is better.

How Experiments Run

┌─────────────────────────────────────────────────────────────────┐ │ ONE model, trained sequentially on tasks │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Task 1 Task 2 Task 3 Task 4 Task 5 │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ Train ───▶ Train ───▶ Train ───▶ Train ───▶ Train │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ Eval: Eval: Eval: Eval: Eval: │ │ T1=95% T1=82% T1=75% T1=70% T1=68% │ │ T2=93% T2=85% T2=80% T2=77% │ │ T3=94% T3=87% T3=83% │ │ T4=95% T4=89% │ │ T5=96% │ │ │ │ Retention = mean(68, 77, 83, 89, 96) = 82.6% │ │ Forgetting = mean drop from best = (95-68 + 93-77 + ...) / 4 │ └─────────────────────────────────────────────────────────────────┘

Breakthrough Requirements

To claim a real breakthrough (not just a lucky run), we require:

Methods We Test

Method Approach Retention Forgetting
Baseline Naive fine-tuning (no protection) 51% 20%
EWC-1000 Elastic Weight Consolidation - penalize important weight changes 58% 39%
Replay-200 Store and replay examples from previous tasks 57% 23%
Hybrid EWC + Replay combined - -

Research goal: Beat replay on retention while maintaining low forgetting. Note: 50% retention = random chance on binary classification.

Research is conducted autonomously and results sync to this dashboard every 5 minutes.