Testing methodology for continual learning research
Continual learning (also called lifelong learning) addresses a fundamental problem: when neural networks learn new tasks, they tend to forget previously learned ones. This is called catastrophic forgetting.
We're researching methods to prevent forgetting in small transformer models (<10M parameters), aiming for techniques that are practical and efficient.
We use a two-stage approach to balance speed with rigor:
Most ideas are tested in Stage 1. Only promising results graduate to Stage 2.
| Suite | Time | Use Case |
|---|---|---|
| Synthetic Conflict Sequences | ~10s | Stage 1: Transformer-oriented, controlled interference |
| Permuted MNIST | ~30s | Stage 2: Proven benchmark, attention conflicts |
| Split MNIST | ~1min | Stage 2: Standard benchmark |
Synthetic Conflict Sequences are designed specifically to stress transformer attention and embedding layers, with controlled interference patterns (cue conflicts, key-value remapping, rule switches).
Before running experiments, we validate that our benchmarks create real gradient conflict. This ensures we're measuring actual catastrophic forgetting, not just different tasks.
mean_cosine < 0 means task gradients point in opposite directions (conflict). conflict_score > 0.5 means majority of task pairs interfere. Our cue_conflict benchmark has 100% opposite labels between consecutive tasks.
| Tier | Max Time | Seeds | When to Use |
|---|---|---|---|
| Quick | 30s | 1 | Debugging, sanity checks |
| Standard | 5min | 3 | Real experiments |
| Full | 15min | 5+ | Validating breakthroughs |
No experiment should take hours. If too slow, we shrink the model or data.
Retention is the primary metric. After training on all tasks sequentially, we measure accuracy on each task. Retention = average of these accuracies.
Forgetting measures how much accuracy drops on earlier tasks after learning later ones. Lower is better.
To claim a real breakthrough (not just a lucky run), we require:
| Method | Approach | Retention | Forgetting |
|---|---|---|---|
| Baseline | Naive fine-tuning (no protection) | 51% | 20% |
| EWC-1000 | Elastic Weight Consolidation - penalize important weight changes | 58% | 39% |
| Replay-200 | Store and replay examples from previous tasks | 57% | 23% |
| Hybrid | EWC + Replay combined | - | - |
Research goal: Beat replay on retention while maintaining low forgetting. Note: 50% retention = random chance on binary classification.
Research is conducted autonomously and results sync to this dashboard every 5 minutes.