How This Works

Continual learning (also called lifelong learning) addresses a fundamental problem: when neural networks learn new tasks, they tend to forget previously learned ones. This is called catastrophic forgetting.

We're researching methods to prevent forgetting in small transformer models (<10M parameters), aiming for techniques that are practical and efficient.

Testing Framework

Two-Stage Validation

Stage 1: DISCOVERY (fast) Stage 2: VALIDATION (thorough) ───────────────────────── ───────────────────────────── • Synthetic conflict sequences • Permuted/Split MNIST • 1 seed, ~10-15s per run • 3-5 seeds, 30s-2min per run • Try many ideas quickly • Prove ideas actually work • Controlled interference • Real benchmark performance

Task Suites

Suite	Time	Use Case
Synthetic Conflict Sequences	~10s	Stage 1: Transformer-oriented, controlled interference
Permuted MNIST	~30s	Stage 2: Proven benchmark, attention conflicts
Split MNIST	~1min	Stage 2: Standard benchmark

Synthetic Conflict Sequences are designed specifically to stress transformer attention and embedding layers, with controlled interference patterns (cue conflicts, key-value remapping, rule switches).

Gradient Conflict Validation

Before running experiments, we validate that our benchmarks create real gradient conflict. This ensures we're measuring actual catastrophic forgetting, not just different tasks.

mean_cosine < 0 means task gradients point in opposite directions (conflict). conflict_score > 0.5 means majority of task pairs interfere. Our cue_conflict benchmark has 100% opposite labels between consecutive tasks.

Experiment Tiers

Key Metrics

Tier	Max Time	Seeds	When to Use
Quick	30s	1	Debugging, sanity checks
Standard	5min	3	Real experiments
Full	15min	5+	Validating breakthroughs

Retention is the primary metric. After training on all tasks sequentially, we measure accuracy on each task. Retention = average of these accuracies.

Forgetting measures how much accuracy drops on earlier tasks after learning later ones. Lower is better.

How Experiments Run

┌─────────────────────────────────────────────────────────────────┐ │ ONE model, trained sequentially on tasks │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Task 1 Task 2 Task 3 Task 4 Task 5 │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ Train ───▶ Train ───▶ Train ───▶ Train ───▶ Train │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ Eval: Eval: Eval: Eval: Eval: │ │ T1=95% T1=82% T1=75% T1=70% T1=68% │ │ T2=93% T2=85% T2=80% T2=77% │ │ T3=94% T3=87% T3=83% │ │ T4=95% T4=89% │ │ T5=96% │ │ │ │ Retention = mean(68, 77, 83, 89, 96) = 82.6% │ │ Forgetting = mean drop from best = (95-68 + 93-77 + ...) / 4 │ └─────────────────────────────────────────────────────────────────┘

Breakthrough Requirements

Methods We Test

Method	Approach	Retention	Forgetting
Baseline	Naive fine-tuning (no protection)	51%	20%
EWC-1000	Elastic Weight Consolidation - penalize important weight changes	58%	39%
Replay-200	Store and replay examples from previous tasks	57%	23%
Hybrid	EWC + Replay combined	-	-

Research goal: Beat replay on retention while maintaining low forgetting. Note: 50% retention = random chance on binary classification.

Research is conducted autonomously and results sync to this dashboard every 5 minutes.

What We're Studying