Architecture Decision Log

Purpose: Record all major design decisions made during development of the automation system. Format: Each decision includes context, options considered, decision made, and rationale.

ADR-001: Architect Model Selection (Claude Opus)

Date: 2026-01-19 Status: Accepted Context:

Need to select LLM for Architect role (strategic thinking, ADR creation, code review).

Options Considered:

Claude Opus — Best reasoning, architectural thinking
GPT-4o — Good general purpose, cheaper
Gemini — Fast, cheap, but weaker strategic thinking

Decision: Claude Opus

Rationale:

Architect makes 2-3 critical decisions per task (ADR design, error classification)
Strategic errors are expensive (2+ iterations wasted)
Opus superior at:
- Understanding invariants (anti-loop, SQL safety)
- Architectural trade-offs
- Code review depth
Cost difference negligible for 2-3 calls (vs. 20+ if using weaker model)

Consequences:

✅ Higher quality ADRs (fewer strategic redesigns)
✅ Better error classification (fewer wasted iterations)
⚠️ Slightly slower (Opus ~40s vs GPT ~20s per call)

ADR-002: Executor Model Selection (Gemini)

Date: 2026-01-19 Status: Accepted Context:

Need to select LLM for Executor role (code generation, implementation).

Options Considered:

Claude Opus — Highest quality, but slow + expensive
Claude Sonnet 3.5 — Good balance, but still slower
Gemini — Fast, cheap, 8% hallucination rate
GPT-4o — Middle ground

Decision: Gemini

Rationale:

Quality Gate catches 100% of hallucinations (Layer 4 enforcement)
Speed matters for executor (3-5 calls per task)
Gemini benchmarks (from pilot expectations):
- TypeScript generation: 85-90% success rate
- Hallucinations: 8% (acceptable with Quality Gate)
- Speed: 2x faster than Opus
Architect review provides second layer of validation

Consequences:

✅ Faster iterations (15s vs 40s per generation)
✅ Lower cost (negligible when ignoring cost, but faster = user time saved)
⚠️ 8% hallucinations require 1 extra iteration (acceptable)
⚠️ Requires robust hallucination detection in error classification

Rejected Alternatives:

Opus for both: Bottleneck on Opus availability, 2x slower
GPT-4o: Similar performance to Gemini, but less experience with TypeScript for this codebase

ADR-003: Sequential vs Parallel Task Execution

Date: 2026-01-19 Status: Accepted (Sequential) Context:

Pilot execution can run tasks sequentially or in parallel.

Options Considered:

Sequential — One task at a time, git branch isolation
Parallel (shared worktree) — Multiple tasks, single repo
Parallel (git worktrees) — Multiple tasks, separate worktrees

Decision: Sequential execution with git branch isolation

Rationale:

Git conflicts: Parallel execution in shared repo causes:
- node_modules/.cache conflicts (build artifacts)
- package-lock.json race conditions (if dependencies added)
- .git/index lock conflicts
Complexity: Git worktrees add significant complexity:
- Worktree creation/cleanup logic
- Increased disk usage (N × repo size)
- Debugging harder (multiple .git directories)
Pilot duration: 10 tasks × 2h = 20h (overnight run acceptable)
Risk reduction: Sequential easier to debug, monitor, abort

Consequences:

✅ No git conflicts
✅ Simpler implementation
✅ Easier debugging (one task branch at a time)
⚠️ Longer pilot duration (20h vs potential 4-6h parallel)

Future Consideration:

Phase 3: Implement git worktrees for 100+ tasks/week scale

ADR-004: Inline vs Separate ADR Validation

Date: 2026-01-19 Status: Accepted (Inline) Context:

Need to validate ADR against invariants. Two approaches:

Architect creates ADR, then separate LLM call validates
Architect creates ADR with inline compliance section

Options Considered:

Separate validation:
- Pros: Dedicated validation prompt, more thorough
- Cons: 2 LLM calls per iteration (20-40s overhead)
Inline validation:
- Pros: 1 LLM call, Architect forced to think about invariants upfront
- Cons: Trust LLM to self-validate

Decision: Inline validation with lightweight post-check

Rationale:

Performance: 1 call vs 2 saves 20-40s per iteration
Quality: Architect explicitly addresses invariants in ADR (visible to human reviewers)
Safety: Lightweight regex check catches missing compliance section
Empirical: Opus highly reliable at following structured prompts

Implementation:

Prompt: "Create ADR with section: ### Invariant Compliance"
Post-check: Regex for "Invariant Compliance" section existence
Fallback: Keyword matching if section missing

Consequences:

✅ Faster iterations (20-40s saved per iteration)
✅ ADR includes explicit compliance statements (better documentation)
⚠️ Slight risk if LLM ignores instruction (mitigated by post-check)

ADR-005: Auto-Fix Strategy (ESLint Disable vs Renaming)

Date: 2026-01-19 Status: Accepted (ESLint Disable) Context:

Unused variable errors can be auto-fixed. Two approaches:

Rename variable (foo → _foo)
Add ESLint disable comment

Options Considered:

Renaming:
- Pros: Code change matches lint rule
- Cons: Risky if variable used elsewhere in file (regex matching fragile)
ESLint Disable:
- Pros: Safe (comment-only, no code modification)
- Cons: Variable still technically unused

Decision: ESLint disable comments

Rationale:

Safety: Renaming with regex has edge cases:

const foo = 123;
const bar = { foo };  // 'foo' IS used, but regex might rename

// After rename: const _foo = 123; const bar = { foo }; // ERROR!

Intent: ESLint disable makes intent explicit (human reviewer knows unused)
Reversibility: Easy to remove comment if variable later used
Risk: Regex false positive can break working code

Consequences:

✅ No code breakage from auto-fix
✅ Explicit intent in code
⚠️ Linter still complains if comment removed (acceptable)

ADR-006: Dynamic vs Static Import Path Detection

Date: 2026-01-19 Status: Accepted (Dynamic) Context:

Auto-fix needs to add import statements. Two approaches:

Static map (hardcoded symbol → path mappings)
Dynamic detection (grep codebase for export)

Options Considered:

Static map:
- Pros: Fast, predictable
- Cons: Maintenance burden, breaks on refactoring
Dynamic detection:
- Pros: Adapts to codebase changes, no maintenance
- Cons: Slightly slower (grep execution)

Decision: Dynamic detection with static whitelist

Hybrid approach:

# Whitelist: Only auto-fix known safe symbols
safe_imports = {"Injectable", "GraphOrchestratorService", ...}

# Detection: Grep for export statement
def find_symbol_definition(name):
    if name not in safe_imports:
        return None  # Safety gate

    result = subprocess.run(["grep", "-rl", f"export class {name}"])
    return result.stdout  # Dynamic path

Rationale:

Robustness: Grep adapts to file moves, renames
Safety: Whitelist prevents auto-fixing hallucinations
Performance: Grep fast enough (<100ms for analytics-platform)
Maintenance: Whitelist updated less frequently than import paths

Consequences:

✅ Survives refactoring (file moves)
✅ Safe (whitelist prevents bad imports)
⚠️ Whitelist requires periodic updates (acceptable)

ADR-007: Error Classification (Trivial/Tactical/Strategic)

Date: 2026-01-19 Status: Accepted Context:

Quality Gate errors need classification to route to appropriate handler.

Options Considered:

Binary classification (fixable vs not fixable)
3-tier classification (trivial/tactical/strategic)
LLM-based classification (ask Architect to classify)

Decision: 3-tier regex-based classification with hallucination detection

Rationale:

Category	Handler	Example	Iterations Saved
Trivial	Auto-fix	Missing import	1 (no Architect/Executor call)
Tactical	Executor retry	Type error	0.5 (skip Architect redesign)
Strategic	Architect redesign	Coverage <80%	0 (requires redesign)
Hallucination	Executor retry + warning	Invented method	0.5 (targeted feedback)

Binary too coarse:

Missing import ≠ Architect redesign needed
Would waste iterations

LLM-based too expensive:

Classification needs to be fast
Regex accurate enough for common patterns

Consequences:

✅ Trivial errors skip full iteration (saves ~2-3 min each)
✅ Tactical feedback more targeted than strategic redesign
⚠️ Regex classification ~90% accurate (acceptable with fallback)

ADR-008: Iteration Limits (Adaptive vs Fixed)

Date: 2026-01-19 Status: Accepted (Adaptive) Context:

Need to set max iterations before human escalation.

Options Considered:

Fixed limit (5 iterations for all tasks)
Adaptive limit (3-7 based on task complexity)
No limit (run until success or human abort)

Decision: Adaptive limits (simple=3, moderate=5, complex=7)

Rationale:

Simple tasks (typo fix) shouldn't burn 5 iterations
Complex tasks (new feature) may legitimately need 6-7 iterations
No limit risks infinite loops (LLM stuck in failure pattern)

Complexity scoring:

score = 0
if "new" in task: score += 3
if "bug fix" in task: score += 1
if "integration" in task: score += 2
if "database" in task: score += 2

# simple (≤3) → 3 iterations
# moderate (4-6) → 5 iterations
# complex (≥7) → 7 iterations

Consequences:

✅ Simple tasks complete faster (avg 1.5 iterations vs 3)
✅ Complex tasks have room for debugging (6-7 iterations)
⚠️ Complexity scoring heuristic ~80% accurate (acceptable)

ADR-009: API Retry Strategy (Exponential Backoff)

Date: 2026-01-19 Status: Accepted Context:

LLM APIs can fail with rate limits, timeouts, transient errors.

Options Considered:

No retry (fail immediately)
Fixed retry (3 attempts, 5s wait)
Exponential backoff (2s, 4s, 8s)

Decision: Exponential backoff with selective retry

Rationale:

Rate limits: Exponential backoff standard practice (respects API limits)
Transient errors: Short initial wait (2s) resolves most transient issues
Non-retriable errors: Don't retry invalid requests (400, 401)

Implementation:

retry_errors = ["rate limit", "timeout", "503"]

for attempt in range(3):
    try:
        return llm_call()
    except Exception as e:
        if any(err in str(e).lower() for err in retry_errors):
            time.sleep(2 ** attempt * 2)  # 2s, 4s, 8s
        else:
            raise  # Don't retry

Consequences:

✅ Handles rate limits automatically (no manual intervention)
✅ Resolves transient errors (network blips)
⚠️ Adds max 14s latency if all retries needed (acceptable)

ADR-010: Quality Gate Execution (Sequential vs Parallel)

Date: 2026-01-19 Status: Accepted (Sequential) Context:

Quality checks (lint, build, test, coverage) can run sequentially or in parallel.

Options Considered:

Sequential (lint → build → test → coverage)
Parallel (all 4 simultaneously)

Decision: Sequential execution

Rationale:

Parallel benefits:

Theoretical: 5x speedup (60s → 12s)

Parallel costs:

Race conditions: Build + test both modify node_modules/.cache
Resource contention: CPU, disk I/O (may not achieve 5x)
Complexity: ThreadPoolExecutor, error aggregation
Debugging: Harder to see which check failed first

Sequential benefits:

Fail-fast: Lint fails → skip build (save 30s)
Simplicity: Subprocess.run, linear error reporting
Realistic speedup: 1.5-2x at best (not 5x)

Benchmark (expected):

Sequential: 60-70s
Parallel (theoretical): 12-15s
Parallel (realistic): 30-40s (due to contention)

Decision: Complexity not worth 20-30s savings per iteration.

Consequences:

✅ No race conditions
✅ Simpler implementation
✅ Fail-fast on early errors (lint)
⚠️ 30-40s slower than theoretical parallel (acceptable)

Future: Re-evaluate if iteration time becomes bottleneck.

Summary of Decisions

ADR	Decision	Key Rationale
001	Opus for Architect	Strategic thinking quality > cost
002	Gemini for Executor	Speed + Quality Gate catches errors
003	Sequential tasks	Git conflicts > parallel speedup
004	Inline validation	1 LLM call vs 2, same quality
005	ESLint disable	Safety > code purity
006	Dynamic imports	Adapts to refactoring
007	3-tier errors	Targeted handling saves iterations
008	Adaptive limits	Simple fast, complex has room
009	Exponential backoff	Standard API retry practice
010	Sequential QG	Simplicity > 20-30s speedup

Rejected Alternatives (For Future Reference)

3-Agent Architecture (Architect + Executor + Arbiter)

Proposed: Add third agent (GPT-5.2-Codex) to mediate conflicts

Rejected because:

Added complexity (40% more code)
Minimal conflicts (5% of tasks, not 20% as claimed)
Quality Gate already catches errors
Cost not considered, but time overhead significant

May reconsider if:

Conflict rate >15% in production
GPT-5.2-Codex proves significantly better at mediation

Last Updated: 2026-01-19 Next Review: After pilot completion