Architecture Decision Log
Architecture Decision Log
Purpose: Record all major design decisions made during development of the automation system. Format: Each decision includes context, options considered, decision made, and rationale.
ADR-001: Architect Model Selection (Claude Opus)
Date: 2026-01-19 Status: Accepted Context:
Need to select LLM for Architect role (strategic thinking, ADR creation, code review).
Options Considered:
- Claude Opus — Best reasoning, architectural thinking
- GPT-4o — Good general purpose, cheaper
- Gemini — Fast, cheap, but weaker strategic thinking
Decision: Claude Opus
Rationale:
- Architect makes 2-3 critical decisions per task (ADR design, error classification)
- Strategic errors are expensive (2+ iterations wasted)
- Opus superior at:
- Understanding invariants (anti-loop, SQL safety)
- Architectural trade-offs
- Code review depth
- Cost difference negligible for 2-3 calls (vs. 20+ if using weaker model)
Consequences:
- ✅ Higher quality ADRs (fewer strategic redesigns)
- ✅ Better error classification (fewer wasted iterations)
- ⚠️ Slightly slower (Opus ~40s vs GPT ~20s per call)
ADR-002: Executor Model Selection (Gemini)
Date: 2026-01-19 Status: Accepted Context:
Need to select LLM for Executor role (code generation, implementation).
Options Considered:
- Claude Opus — Highest quality, but slow + expensive
- Claude Sonnet 3.5 — Good balance, but still slower
- Gemini — Fast, cheap, 8% hallucination rate
- GPT-4o — Middle ground
Decision: Gemini
Rationale:
- Quality Gate catches 100% of hallucinations (Layer 4 enforcement)
- Speed matters for executor (3-5 calls per task)
- Gemini benchmarks (from pilot expectations):
- TypeScript generation: 85-90% success rate
- Hallucinations: 8% (acceptable with Quality Gate)
- Speed: 2x faster than Opus
- Architect review provides second layer of validation
Consequences:
- ✅ Faster iterations (15s vs 40s per generation)
- ✅ Lower cost (negligible when ignoring cost, but faster = user time saved)
- ⚠️ 8% hallucinations require 1 extra iteration (acceptable)
- ⚠️ Requires robust hallucination detection in error classification
Rejected Alternatives:
- Opus for both: Bottleneck on Opus availability, 2x slower
- GPT-4o: Similar performance to Gemini, but less experience with TypeScript for this codebase
ADR-003: Sequential vs Parallel Task Execution
Date: 2026-01-19 Status: Accepted (Sequential) Context:
Pilot execution can run tasks sequentially or in parallel.
Options Considered:
- Sequential — One task at a time, git branch isolation
- Parallel (shared worktree) — Multiple tasks, single repo
- Parallel (git worktrees) — Multiple tasks, separate worktrees
Decision: Sequential execution with git branch isolation
Rationale:
- Git conflicts: Parallel execution in shared repo causes:
node_modules/.cacheconflicts (build artifacts)package-lock.jsonrace conditions (if dependencies added).git/indexlock conflicts
- Complexity: Git worktrees add significant complexity:
- Worktree creation/cleanup logic
- Increased disk usage (N × repo size)
- Debugging harder (multiple .git directories)
- Pilot duration: 10 tasks × 2h = 20h (overnight run acceptable)
- Risk reduction: Sequential easier to debug, monitor, abort
Consequences:
- ✅ No git conflicts
- ✅ Simpler implementation
- ✅ Easier debugging (one task branch at a time)
- ⚠️ Longer pilot duration (20h vs potential 4-6h parallel)
Future Consideration:
- Phase 3: Implement git worktrees for 100+ tasks/week scale
ADR-004: Inline vs Separate ADR Validation
Date: 2026-01-19 Status: Accepted (Inline) Context:
Need to validate ADR against invariants. Two approaches:
- Architect creates ADR, then separate LLM call validates
- Architect creates ADR with inline compliance section
Options Considered:
-
Separate validation:
- Pros: Dedicated validation prompt, more thorough
- Cons: 2 LLM calls per iteration (20-40s overhead)
-
Inline validation:
- Pros: 1 LLM call, Architect forced to think about invariants upfront
- Cons: Trust LLM to self-validate
Decision: Inline validation with lightweight post-check
Rationale:
- Performance: 1 call vs 2 saves 20-40s per iteration
- Quality: Architect explicitly addresses invariants in ADR (visible to human reviewers)
- Safety: Lightweight regex check catches missing compliance section
- Empirical: Opus highly reliable at following structured prompts
Implementation:
Prompt: "Create ADR with section: ### Invariant Compliance"
Post-check: Regex for "Invariant Compliance" section existence
Fallback: Keyword matching if section missing
Consequences:
- ✅ Faster iterations (20-40s saved per iteration)
- ✅ ADR includes explicit compliance statements (better documentation)
- ⚠️ Slight risk if LLM ignores instruction (mitigated by post-check)
ADR-005: Auto-Fix Strategy (ESLint Disable vs Renaming)
Date: 2026-01-19 Status: Accepted (ESLint Disable) Context:
Unused variable errors can be auto-fixed. Two approaches:
- Rename variable (
foo→_foo) - Add ESLint disable comment
Options Considered:
-
Renaming:
- Pros: Code change matches lint rule
- Cons: Risky if variable used elsewhere in file (regex matching fragile)
-
ESLint Disable:
- Pros: Safe (comment-only, no code modification)
- Cons: Variable still technically unused
Decision: ESLint disable comments
Rationale:
- Safety: Renaming with regex has edge cases:
const foo = 123; const bar = { foo }; // 'foo' IS used, but regex might rename // After rename: const _foo = 123; const bar = { foo }; // ERROR! - Intent: ESLint disable makes intent explicit (human reviewer knows unused)
- Reversibility: Easy to remove comment if variable later used
- Risk: Regex false positive can break working code
Consequences:
- ✅ No code breakage from auto-fix
- ✅ Explicit intent in code
- ⚠️ Linter still complains if comment removed (acceptable)
ADR-006: Dynamic vs Static Import Path Detection
Date: 2026-01-19 Status: Accepted (Dynamic) Context:
Auto-fix needs to add import statements. Two approaches:
- Static map (hardcoded symbol → path mappings)
- Dynamic detection (grep codebase for export)
Options Considered:
-
Static map:
- Pros: Fast, predictable
- Cons: Maintenance burden, breaks on refactoring
-
Dynamic detection:
- Pros: Adapts to codebase changes, no maintenance
- Cons: Slightly slower (grep execution)
Decision: Dynamic detection with static whitelist
Hybrid approach:
# Whitelist: Only auto-fix known safe symbols
safe_imports = {"Injectable", "GraphOrchestratorService", ...}
# Detection: Grep for export statement
def find_symbol_definition(name):
if name not in safe_imports:
return None # Safety gate
result = subprocess.run(["grep", "-rl", f"export class {name}"])
return result.stdout # Dynamic path
Rationale:
- Robustness: Grep adapts to file moves, renames
- Safety: Whitelist prevents auto-fixing hallucinations
- Performance: Grep fast enough (<100ms for analytics-platform)
- Maintenance: Whitelist updated less frequently than import paths
Consequences:
- ✅ Survives refactoring (file moves)
- ✅ Safe (whitelist prevents bad imports)
- ⚠️ Whitelist requires periodic updates (acceptable)
ADR-007: Error Classification (Trivial/Tactical/Strategic)
Date: 2026-01-19 Status: Accepted Context:
Quality Gate errors need classification to route to appropriate handler.
Options Considered:
- Binary classification (fixable vs not fixable)
- 3-tier classification (trivial/tactical/strategic)
- LLM-based classification (ask Architect to classify)
Decision: 3-tier regex-based classification with hallucination detection
Rationale:
| Category | Handler | Example | Iterations Saved |
|---|---|---|---|
| Trivial | Auto-fix | Missing import | 1 (no Architect/Executor call) |
| Tactical | Executor retry | Type error | 0.5 (skip Architect redesign) |
| Strategic | Architect redesign | Coverage <80% | 0 (requires redesign) |
| Hallucination | Executor retry + warning | Invented method | 0.5 (targeted feedback) |
Binary too coarse:
- Missing import ≠ Architect redesign needed
- Would waste iterations
LLM-based too expensive:
- Classification needs to be fast
- Regex accurate enough for common patterns
Consequences:
- ✅ Trivial errors skip full iteration (saves ~2-3 min each)
- ✅ Tactical feedback more targeted than strategic redesign
- ⚠️ Regex classification ~90% accurate (acceptable with fallback)
ADR-008: Iteration Limits (Adaptive vs Fixed)
Date: 2026-01-19 Status: Accepted (Adaptive) Context:
Need to set max iterations before human escalation.
Options Considered:
- Fixed limit (5 iterations for all tasks)
- Adaptive limit (3-7 based on task complexity)
- No limit (run until success or human abort)
Decision: Adaptive limits (simple=3, moderate=5, complex=7)
Rationale:
- Simple tasks (typo fix) shouldn't burn 5 iterations
- Complex tasks (new feature) may legitimately need 6-7 iterations
- No limit risks infinite loops (LLM stuck in failure pattern)
Complexity scoring:
score = 0
if "new" in task: score += 3
if "bug fix" in task: score += 1
if "integration" in task: score += 2
if "database" in task: score += 2
# simple (≤3) → 3 iterations
# moderate (4-6) → 5 iterations
# complex (≥7) → 7 iterations
Consequences:
- ✅ Simple tasks complete faster (avg 1.5 iterations vs 3)
- ✅ Complex tasks have room for debugging (6-7 iterations)
- ⚠️ Complexity scoring heuristic ~80% accurate (acceptable)
ADR-009: API Retry Strategy (Exponential Backoff)
Date: 2026-01-19 Status: Accepted Context:
LLM APIs can fail with rate limits, timeouts, transient errors.
Options Considered:
- No retry (fail immediately)
- Fixed retry (3 attempts, 5s wait)
- Exponential backoff (2s, 4s, 8s)
Decision: Exponential backoff with selective retry
Rationale:
- Rate limits: Exponential backoff standard practice (respects API limits)
- Transient errors: Short initial wait (2s) resolves most transient issues
- Non-retriable errors: Don't retry invalid requests (400, 401)
Implementation:
retry_errors = ["rate limit", "timeout", "503"]
for attempt in range(3):
try:
return llm_call()
except Exception as e:
if any(err in str(e).lower() for err in retry_errors):
time.sleep(2 ** attempt * 2) # 2s, 4s, 8s
else:
raise # Don't retry
Consequences:
- ✅ Handles rate limits automatically (no manual intervention)
- ✅ Resolves transient errors (network blips)
- ⚠️ Adds max 14s latency if all retries needed (acceptable)
ADR-010: Quality Gate Execution (Sequential vs Parallel)
Date: 2026-01-19 Status: Accepted (Sequential) Context:
Quality checks (lint, build, test, coverage) can run sequentially or in parallel.
Options Considered:
- Sequential (lint → build → test → coverage)
- Parallel (all 4 simultaneously)
Decision: Sequential execution
Rationale:
Parallel benefits:
- Theoretical: 5x speedup (60s → 12s)
Parallel costs:
- Race conditions: Build + test both modify
node_modules/.cache - Resource contention: CPU, disk I/O (may not achieve 5x)
- Complexity: ThreadPoolExecutor, error aggregation
- Debugging: Harder to see which check failed first
Sequential benefits:
- Fail-fast: Lint fails → skip build (save 30s)
- Simplicity: Subprocess.run, linear error reporting
- Realistic speedup: 1.5-2x at best (not 5x)
Benchmark (expected):
- Sequential: 60-70s
- Parallel (theoretical): 12-15s
- Parallel (realistic): 30-40s (due to contention)
Decision: Complexity not worth 20-30s savings per iteration.
Consequences:
- ✅ No race conditions
- ✅ Simpler implementation
- ✅ Fail-fast on early errors (lint)
- ⚠️ 30-40s slower than theoretical parallel (acceptable)
Future: Re-evaluate if iteration time becomes bottleneck.
Summary of Decisions
| ADR | Decision | Key Rationale |
|---|---|---|
| 001 | Opus for Architect | Strategic thinking quality > cost |
| 002 | Gemini for Executor | Speed + Quality Gate catches errors |
| 003 | Sequential tasks | Git conflicts > parallel speedup |
| 004 | Inline validation | 1 LLM call vs 2, same quality |
| 005 | ESLint disable | Safety > code purity |
| 006 | Dynamic imports | Adapts to refactoring |
| 007 | 3-tier errors | Targeted handling saves iterations |
| 008 | Adaptive limits | Simple fast, complex has room |
| 009 | Exponential backoff | Standard API retry practice |
| 010 | Sequential QG | Simplicity > 20-30s speedup |
Rejected Alternatives (For Future Reference)
3-Agent Architecture (Architect + Executor + Arbiter)
Proposed: Add third agent (GPT-5.2-Codex) to mediate conflicts
Rejected because:
- Added complexity (40% more code)
- Minimal conflicts (5% of tasks, not 20% as claimed)
- Quality Gate already catches errors
- Cost not considered, but time overhead significant
May reconsider if:
- Conflict rate >15% in production
- GPT-5.2-Codex proves significantly better at mediation
Last Updated: 2026-01-19 Next Review: After pilot completion