ADR-004: Role-Based Talos Image Architecture
ADR-004: Role-Based Talos Image Architecture
Date: 2025-11-18
Status: Accepted
Context: CozyStack ARM64 Cluster Formation Requirements
Related: ADR-001: ARM64 Architecture Choice, ADR-002: TDG Methodology
Summary
Implement role-based Talos image architecture with separate compute and gateway node variants to enable proper Kubernetes cluster formation with Tailscale subnet routing.
Problem
Single Image with All Extensions Breaks Cluster Formation:
Current Implementation:
-EXTENSIONS="drbd zfs"
+EXTENSIONS="drbd zfs spin tailscale"
Critical Issue: Kubernetes nodes only reach “Ready” state when ALL configured Talos extensions are active and properly configured. With Tailscale extension on every node:
- Multiple Subnet Routers: Every node tries to configure as Tailscale subnet router
- Configuration Conflicts: Multiple nodes compete for same routing role
- Cluster Formation Failure: Nodes hang waiting for conflicting Tailscale configurations
- Ready State Never Achieved: Cluster never becomes operational
Architecture Requirements
CozyStack + Tailscale Integration Pattern:
- One subnet router per cluster: Exposes service/pod CIDR to external Tailscale network
- Multiple compute nodes: Run WebAssembly workloads without networking conflicts
- Clean role separation: Different node types have different extension requirements
Node Ready Condition Constraint:
- Nodes wait for ALL configured extensions to become active
- Failed extension configuration = node never reaches Ready state
- Heterogeneous extension sets = different node readiness requirements
Decision
✅ CHOSEN: Role-Based Image Architecture
1. Compute Node Images (*-compute)
EXTENSIONS="drbd zfs spin"
Purpose: WebAssembly workload execution
- Quantity: Majority of cluster nodes (scalable)
- Extensions: Only Spin WebAssembly runtime
- Ready Condition: Simple - waits only for Spin activation
- Network Role: Standard Kubernetes pod networking
2. Gateway Node Images (*-gateway)
EXTENSIONS="drbd zfs spin tailscale"
Purpose: Subnet routing + WebAssembly execution
- Quantity: Exactly one per cluster
- Extensions: Spin runtime + Tailscale subnet router
- Ready Condition: Complex - waits for both Spin + Tailscale activation
- Network Role: Tailscale subnet router for external access
Implementation Strategy
Matrix Build Strategy
strategy:
matrix:
variant:
- name: compute
extensions: "drbd zfs spin"
suffix: "-compute"
role: "WebAssembly workload nodes"
- name: gateway
extensions: "drbd zfs spin tailscale"
suffix: "-gateway"
role: "Subnet router + compute node"
Patch Generation
- Option A: Separate patches per variant
- Option B: Parameterized single patch with extension matrix
- Chosen: Matrix strategy with single parameterized patch
Architecture Benefits
1. Cluster Formation Reliability
- Compute nodes reach Ready state quickly (no Tailscale wait)
- Gateway node handles complex networking configuration independently
- No extension conflicts between node roles
2. Operational Clarity
- Clear node role designation at image selection time
- Simplified troubleshooting (role-specific extension issues)
- Predictable cluster behavior patterns
3. Scalability
- Add compute nodes without network configuration complexity
- Gateway node remains singleton (as required by Tailscale architecture)
- WebAssembly workloads can scale across all nodes
4. CozyStack Learning Demonstration
- Shows CozyStack build machinery flexibility
- Demonstrates Talos Linux customization patterns
- Provides template for other specialized node roles
Alternatives Considered
❌ Single Image with Conditional Extension Loading:
- Pros: Simpler build process
- Cons: Runtime complexity, configuration management issues
- Rejected: Violates “extensions always active” Talos principle
❌ Configuration-Time Extension Selection:
- Pros: Maximum flexibility
- Cons: Complex orchestration, error-prone deployment
- Rejected: Increases operational complexity
❌ Post-Boot Extension Management:
- Pros: Dynamic role assignment
- Cons: Not supported by Talos architecture, fragile
- Rejected: Architectural incompatibility
Validation Strategy
TDG Test Requirements
tests/cluster-formation/
├── 01-compute-only-cluster.sh # Multiple compute nodes form working cluster
├── 02-mixed-role-cluster.sh # Compute + gateway cluster formation
├── 03-tailscale-routing-test.sh # Gateway provides subnet routing
└── 04-extension-isolation.sh # No extension conflicts between roles
Success Criteria
- Compute nodes (spin-only) reach Ready state without Tailscale
- Gateway node (spin+tailscale) joins cluster and provides routing
- Mixed cluster demonstrates full WebAssembly + networking functionality
- Clear documentation guides node role selection
- TDG tests validate all cluster formation scenarios
Consequences
Positive:
- ✅ Reliable cluster formation with predictable node behavior
- ✅ Clear operational model for different node types
- ✅ Demonstrates advanced CozyStack build system usage
- ✅ Scalable architecture for larger cluster deployments
Negative:
- ⚠️ Requires image selection decision during node provisioning
- ⚠️ Slightly more complex CI build matrix
- ⚠️ Need clear documentation for role selection guidance
Neutral:
- 🔄 Two container images instead of one (manageable complexity)
- 🔄 Additional test coverage for cluster formation scenarios
Previous ADR: ADR-003: Patch Generation Best Practices
Next ADR: ADR-005: TBD (placeholder)
Next Steps: Implement matrix build strategy in CI workflow, create TDG tests for cluster formation validation, document node role selection guidance.