Test-Driven Generation Plan: CozyStack Moon and Back
Test-Driven Generation Plan: CozyStack Moon and Back
Context for Next Claude Agent
This document follows the Test-Driven Generation (TDG) methodology introduced by Chanwit Kaewkasi. We define tests/acceptance criteria FIRST, then generate code that makes those tests pass.
Reference: I was wrong about Test-Driven Generation
Project Repositories Overview
Primary Presentation Repo (NEW)
- urmanac/cozystack-moon-and-back: Conference talk demo, December 3, 2025
- Purpose: Live demo + slides for CozySummit Virtual 2025
- Content: Terraform for AWS infrastructure, talk materials, demo scripts
- Audience: CozyStack community
Supporting Infrastructure Repos
- urmanac/aws-accounts: Terraform for all Urmanac AWS infrastructure
- Current: Bastion ASG, VPC, security groups (Sandbox account)
- Owner: Urmanac, LLC (Kingdon Barrett)
Flux Bootstrap Repos
- kingdon-ci/fleet-infra: Original Flux bootstrap (may be deprecated?)
- kingdon-ci/cozy-fleet: NEW Flux bootstrap repo for CozyStack
- Purpose: GitOps management of CozyStack clusters
- Status: Determine which is active/canonical
Questions for Operator
- Which Flux repo is canonical:
fleet-infraorcozy-fleet? - Should we consolidate or keep separate?
- Are there other repos in the dependency chain?
TDG Test Suite: Infrastructure Layer
Test 1: Network Foundation Exists
#!/bin/bash
# tests/01-network-foundation.sh
# GIVEN: A clean AWS account in eu-west-1
# WHEN: Terraform apply completes
# THEN: The following resources exist
test_vpc_exists() {
vpc_id=$(aws ec2 describe-vpcs \
--filters "Name=cidr,Values=10.10.0.0/16" \
--query 'Vpcs[0].VpcId' --output text)
[ "$vpc_id" != "None" ] && [ -n "$vpc_id" ]
}
test_single_public_subnet_exists() {
# Desktop design: Single public subnet, no private subnet needed
public_subnet=$(aws ec2 describe-subnets \
--filters "Name=cidr-block,Values=10.10.0.0/24" \
--query 'Subnets[0].SubnetId' --output text)
[ "$public_subnet" != "None" ] && [ -n "$public_subnet" ]
}
test_internet_gateway_only() {
# No NAT gateway needed - IPv6 + bastion Wireguard for internet
igw_state=$(aws ec2 describe-internet-gateways \
--filters "Name=attachment.vpc-id,Values=$vpc_id" \
--query 'InternetGateways[0].State' --output text)
[ "$igw_state" = "available" ]
}
test_route_tables_configured() {
# Private subnet should route 0.0.0.0/0 to NAT gateway
# Public subnet should route 0.0.0.0/0 to Internet gateway
# Both should have local routes for VPC CIDR
# Implementation TBD based on Terraform structure
true # Placeholder
}
# Run all tests
test_vpc_exists && \
test_single_public_subnet_exists && \
test_internet_gateway_only && \
test_route_tables_configured
Status: ❌ FAIL (VPC doesn’t exist yet)
Next Step: Generate Terraform in urmanac/aws-accounts to make this pass
Test 2: Bastion with Static ENI
#!/bin/bash
# tests/02-bastion-static-eni.sh
# GIVEN: Network foundation from Test 1
# WHEN: Bastion ASG deploys with ENI attachment
# THEN: Bastion has static IP 10.10.0.100 via ENI
test_bastion_in_private_subnet() {
bastion_ip=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tf-bastion" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].PrivateIpAddress' \
--output text)
[ "$bastion_ip" = "10.20.13.140" ]
}
test_bastion_has_public_connectivity() {
# Bastion should be able to reach internet via NAT gateway
# Test by checking if it can resolve external DNS
instance_id=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tf-bastion" \
--query 'Reservations[0].Instances[0].InstanceId' \
--output text)
# This would require SSM or actual SSH test
# Simplified: check security group allows egress
true # Placeholder
}
test_bastion_reachable_from_home() {
# SSH from operator's home IPv6 address works
# Requires actual connection test or security group validation
ssh -o ConnectTimeout=5 ubuntu@10.20.13.140 "echo 'Connected'" 2>/dev/null
}
test_bastion_scheduled_correctly() {
# ASG should have scheduled actions for 5hrs/day
asg_name="tf-asg"
scheduled_actions=$(aws autoscaling describe-scheduled-actions \
--auto-scaling-group-name "$asg_name" \
--query 'length(ScheduledUpdateGroupActions)')
[ "$scheduled_actions" -ge 2 ] # At least start and stop actions
}
# Run all tests
test_bastion_in_private_subnet && \
test_bastion_has_public_connectivity && \
test_bastion_reachable_from_home && \
test_bastion_scheduled_correctly
Status: ❌ FAIL (Bastion still in public subnet)
Next Step: Modify existing ASG/launch template in urmanac/aws-accounts
Test 3: Netboot Infrastructure Running
#!/bin/bash
# tests/03-netboot-infrastructure.sh
# GIVEN: Bastion running in private subnet
# WHEN: User data script completes
# THEN: All Docker containers are operational
test_docker_containers_running() {
containers=(
"dnsmasq"
"matchbox"
"registry-docker.io"
"registry-gcr.io"
"registry-ghcr.io"
"registry-quay.io"
"registry-registry.k8s.io"
"pihole"
)
for container in "${containers[@]}"; do
ssh ubuntu@10.20.13.140 "docker ps --filter name=$container --format ''" | grep -q "$container"
if [ $? -ne 0 ]; then
echo "FAIL: Container $container not running"
return 1
fi
done
echo "PASS: All containers running"
return 0
}
test_dnsmasq_serving_dhcp() {
# Check dnsmasq config includes DHCP range for 10.20.13.0/24
ssh ubuntu@10.20.13.140 "docker exec dnsmasq cat /etc/dnsmasq.conf" | \
grep -q "dhcp-range=10.20.13"
}
test_matchbox_serving_talos() {
# Matchbox should respond on port 8080
# Check if it has Talos boot assets
curl -s http://10.20.13.140:8080/assets/talos/vmlinuz >/dev/null
}
test_registry_caches_operational() {
# All 5 registry pull-through caches should respond
for port in 5050 5051 5052 5053 5054; do
curl -s http://10.20.13.140:$port/v2/ | grep -q "401 Unauthorized"
if [ $? -ne 0 ]; then
echo "FAIL: Registry on port $port not responding"
return 1
fi
done
echo "PASS: All registry caches operational"
return 0
}
test_pihole_serving_dns() {
# Pi-hole should resolve DNS queries
dig @10.20.13.140 google.com +short | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
}
# Run all tests
test_docker_containers_running && \
test_dnsmasq_serving_dhcp && \
test_matchbox_serving_talos && \
test_registry_caches_operational && \
test_pihole_serving_dns
Status: ❌ FAIL (Bastion user data doesn’t include container orchestration yet) Next Step: Generate user data script with Docker compose or shell orchestration
Test 4: Talos Node Netboots Successfully
#!/bin/bash
# tests/04-talos-netboot.sh
# GIVEN: Netboot infrastructure operational
# WHEN: Talos node instance launches
# THEN: Node boots Talos Linux from network
test_talos_node_gets_dhcp_lease() {
# Check dnsmasq logs for DHCP lease to new node
ssh ubuntu@10.20.13.140 "docker logs dnsmasq 2>&1 | tail -20" | \
grep -q "DHCPACK"
}
test_talos_node_pulls_from_matchbox() {
# Check matchbox logs for kernel/initrd requests
ssh ubuntu@10.20.13.140 "docker logs matchbox 2>&1 | tail -20" | \
grep -q "GET /assets/talos"
}
test_talos_node_reaches_ready_state() {
# Use talosctl to check node health
# Requires node IP from previous test
node_ip=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=talos-node-1" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].PrivateIpAddress' \
--output text)
talosctl -n "$node_ip" health --wait-timeout 5m
}
test_talos_node_uses_registry_cache() {
# Check registry cache logs for image pulls from Talos node
for port in 5050 5051 5052 5053 5054; do
ssh ubuntu@10.20.13.140 "docker logs registry-*:$port 2>&1 | tail -50" | \
grep -q "$node_ip"
done
}
# Run all tests
test_talos_node_gets_dhcp_lease && \
test_talos_node_pulls_from_matchbox && \
test_talos_node_reaches_ready_state && \
test_talos_node_uses_registry_cache
Status: ❌ FAIL (No Talos nodes launched yet) Next Step: Create Talos node launch template, test manual launch
Test 5: CozyStack Cluster Operational
#!/bin/bash
# tests/05-cozystack-operational.sh
# GIVEN: 1-3 Talos nodes successfully netbooted
# WHEN: CozyStack bootstrap completes
# THEN: Kubernetes cluster is healthy with CozyStack installed
test_kubernetes_api_responding() {
# Assumes kubeconfig available from talosctl
talosctl -n 10.20.13.x kubeconfig
kubectl cluster-info | grep -q "Kubernetes control plane is running"
}
test_cozystack_installed() {
# Check for CozyStack CRDs and controllers
kubectl get crds | grep -q "cozystack.io"
kubectl get pods -n cozy-system -o wide | grep -v "0/1"
}
test_kubevirt_operational() {
# CozyStack uses KubeVirt for VMs
kubectl get pods -n kubevirt -o wide | grep -q "Running"
}
test_spinkube_extension_loaded() {
# Custom Talos image includes spin runtimeclass
kubectl get runtimeclass | grep -q "spin"
}
test_tailscale_extension_loaded() {
# Custom Talos image includes tailscale
# Check if tailscale daemon is running on nodes
talosctl -n 10.20.13.x get services | grep -q "tailscale"
}
# Run all tests
test_kubernetes_api_responding && \
test_cozystack_installed && \
test_kubevirt_operational && \
test_spinkube_extension_loaded && \
test_tailscale_extension_loaded
Status: ❌ FAIL (CozyStack not bootstrapped yet) Next Step: Follow CozyStack installation guide, document bootstrap process
Test 6: Demo Workload Runs on ARM64
#!/bin/bash
# tests/06-demo-workload.sh
# GIVEN: CozyStack cluster operational
# WHEN: SpinKube demo application deploys
# THEN: Application runs successfully on ARM64 nodes
test_spinkube_demo_deploys() {
# Deploy sample Spin application
kubectl apply -f demo/spinkube-hello-world.yaml
kubectl wait --for=condition=Ready pod -l app=spinkube-demo --timeout=2m
}
test_demo_responds_to_requests() {
# Port-forward and curl the demo app
kubectl port-forward svc/spinkube-demo 8080:80 &
PF_PID=$!
sleep 2
response=$(curl -s http://localhost:8080)
kill $PF_PID
echo "$response" | grep -q "Hello from Spin"
}
test_demo_runs_on_arm64() {
# Verify pod is scheduled on ARM64 node
node=$(kubectl get pod -l app=spinkube-demo \
-o jsonpath='{.items[0].spec.nodeName}')
arch=$(kubectl get node "$node" \
-o jsonpath='{.status.nodeInfo.architecture}')
[ "$arch" = "arm64" ]
}
test_demo_uses_cozystack_features() {
# Demonstrate CozyStack tenant isolation or other features
# TBD based on specific demo requirements
true # Placeholder
}
# Run all tests
test_spinkube_demo_deploys && \
test_demo_responds_to_requests && \
test_demo_runs_on_arm64 && \
test_demo_uses_cozystack_features
Status: ❌ FAIL (No demo workload created yet) Next Step: Create SpinKube hello-world manifest, test deployment
TDG Test Suite: Flux GitOps Layer
Test 7: Flux Bootstrap Successful
#!/bin/bash
# tests/07-flux-bootstrap.sh
# GIVEN: CozyStack cluster operational
# WHEN: Flux bootstrap completes from cozy-fleet repo
# THEN: Flux controllers are running and syncing
test_flux_namespace_exists() {
kubectl get namespace flux-system
}
test_flux_controllers_running() {
controllers=(
"source-controller"
"kustomize-controller"
"helm-controller"
"notification-controller"
)
for controller in "${controllers[@]}"; do
kubectl get deployment -n flux-system "$controller" \
-o jsonpath='{.status.availableReplicas}' | grep -q "^1$"
done
}
test_flux_syncing_from_cozy_fleet() {
# Check GitRepository points to correct repo
repo=$(kubectl get gitrepository -n flux-system flux-system \
-o jsonpath='{.spec.url}')
echo "$repo" | grep -q "kingdon-ci/cozy-fleet"
}
test_kustomizations_healthy() {
# All Kustomizations should be Ready
kubectl get kustomizations -A -o json | \
jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name' | \
[ -z "$(cat)" ]
}
# Run all tests
test_flux_namespace_exists && \
test_flux_controllers_running && \
test_flux_syncing_from_cozy_fleet && \
test_kustomizations_healthy
Status: ❌ FAIL (Flux not bootstrapped yet) Next Step: Determine canonical Flux repo, run bootstrap command
TDG Test Suite: Cost & Compliance Layer
Test 8: Staying Within Free Tier
#!/bin/bash
# tests/08-cost-compliance.sh
# GIVEN: Infrastructure running for experiment duration
# WHEN: Checking AWS Cost Explorer
# THEN: Costs remain under target threshold
test_monthly_cost_under_target() {
# Target: < $0.10/month
current_month=$(date +%Y-%m-01)
next_month=$(date -d "$current_month + 1 month" +%Y-%m-01)
cost=$(aws ce get-cost-and-usage \
--time-period Start="$current_month",End="$next_month" \
--granularity MONTHLY \
--metrics BlendedCost \
--query 'ResultsByTime[0].Total.BlendedCost.Amount' \
--output text)
# Convert to cents for integer comparison
cost_cents=$(echo "$cost * 100" | bc | cut -d. -f1)
[ "$cost_cents" -lt 10 ]
}
test_t4g_free_tier_not_exceeded() {
# Check t4g instance hours don't exceed 750/month
# This requires custom metric or CloudWatch query
# Simplified: count running t4g instances
running_t4g=$(aws ec2 describe-instances \
--filters "Name=instance-type,Values=t4g.*" \
"Name=instance-state-name,Values=running" \
--query 'length(Reservations[*].Instances[*])')
# With 4 instances at 5hrs/day = 600hrs/month, under 750
[ "$running_t4g" -le 4 ]
}
test_no_unexpected_charges() {
# Check for charges from unexpected services
services=$(aws ce get-cost-and-usage \
--time-period Start="$current_month",End="$next_month" \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[].Keys[0]' \
--output text)
# Should only see: EC2, EBS, (maybe S3 for Terraform state)
echo "$services" | grep -qv -E "(RDS|Lambda|ECS|EKS|ElastiCache)"
}
# Run all tests
test_monthly_cost_under_target && \
test_t4g_free_tier_not_exceeded && \
test_no_unexpected_charges
Status: ⚠️ PARTIAL (Current costs ~$0.04/month, but no Talos nodes running yet) Next Step: Monitor costs during experiments, implement auto-termination
Test 9: GDPR Compliance (Zero Risk Mode)
#!/bin/bash
# tests/09-gdpr-compliance.sh
# GIVEN: Infrastructure fully deployed
# WHEN: Auditing network configuration
# THEN: No public services accessible, zero GDPR risk
test_no_public_facing_services() {
# Check security groups - no ingress from 0.0.0.0/0 except SSH to bastion
public_ingress=$(aws ec2 describe-security-groups \
--filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
--query 'SecurityGroups[].GroupId' \
--output text)
# Should only find bastion security group (if any)
# Talos nodes should have no public ingress
for sg in $public_ingress; do
name=$(aws ec2 describe-security-groups \
--group-ids "$sg" \
--query 'SecurityGroups[0].GroupName' \
--output text)
# Only bastion-sg allowed to have public SSH (from specific IPv6)
if [ "$name" != "bastion-sg" ]; then
echo "FAIL: Unexpected public security group: $name"
return 1
fi
done
}
test_no_public_ip_addresses() {
# Talos nodes should have NO public IPs
public_ips=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=talos-node-*" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[*].Instances[*].PublicIpAddress' \
--output text)
[ -z "$public_ips" ]
}
test_all_traffic_private() {
# VPC flow logs would show no traffic to/from internet
# Except through NAT gateway for egress
# Simplified: check route tables
# Talos nodes subnet should only route to NAT, not IGW
true # Placeholder - requires actual flow log analysis
}
# Run all tests
test_no_public_facing_services && \
test_no_public_ip_addresses && \
test_all_traffic_private
Status: ⚠️ PARTIAL (Need to verify after deployment) Next Step: Audit security groups and routing tables
Repository Integration Strategy
Code Generation Targets
Primary: urmanac/cozystack-moon-and-back (presentation repo)
/terraform/- Infrastructure code (may reference aws-accounts modules)/tests/- TDG test suite (these bash scripts)/demo/- SpinKube demo manifests/slides/- Talk materials (Markdown → reveal.js?)/docs/- Setup guides, troubleshooting
Secondary: urmanac/aws-accounts (infrastructure repo)
- Modify existing Terraform for new VPC/subnets
- Add bastion user data for Docker containers
- Create Talos node launch template
Tertiary: kingdon-ci/cozy-fleet (Flux bootstrap)
- Determine if this is canonical or should migrate to presentation repo
- Add CozyStack-specific Flux resources
- Configure tenants, policies, etc.
Decision Tree for Code Placement
Is it infrastructure (VPC, EC2, IAM)?
├─ YES → urmanac/aws-accounts (Terraform)
└─ NO
Is it Kubernetes/Flux configuration?
├─ YES → kingdon-ci/cozy-fleet (GitOps)
└─ NO
Is it demo-specific or talk materials?
├─ YES → urmanac/cozystack-moon-and-back
└─ NO → Determine new home or extend existing repo
Flux Repository Consolidation Question
Need operator input:
- Keep separate
cozy-fleetrepo for production GitOps? - Create new Flux bootstrap in
cozystack-moon-and-backfor demo? - Migrate everything to one canonical location?
Recommendation: Demo in cozystack-moon-and-back, production in cozy-fleet
Next Actions for Claude Agent (Priority Order)
Week 1: Foundation (Nov 17-23)
- Generate VPC Terraform → Make Test 1 pass
- Target:
urmanac/aws-accountsorcozystack-moon-and-back/terraform/ - Deliverable: VPC, subnets, NAT gateway, route tables
- Target:
- Modify Bastion for Private Subnet → Make Test 2 pass
- Target:
urmanac/aws-accounts(existing ASG/launch template) - Deliverable: Bastion at 10.20.13.140, SSH from home IPv6
- Target:
- Generate Bastion User Data → Make Test 3 pass
- Target:
cozystack-moon-and-back/terraform/user-data.sh - Deliverable: Docker containers running (dnsmasq, matchbox, registries, pihole)
- Target:
Week 2: Talos & CozyStack (Nov 24-30)
- Create Talos Launch Template → Make Test 4 pass
- Target:
urmanac/aws-accountsorcozystack-moon-and-back/terraform/ - Deliverable: Manual launch works, netboot successful
- Target:
- Bootstrap CozyStack → Make Test 5 pass
- Target: Document in
cozystack-moon-and-back/docs/bootstrap.md - Deliverable: Kubernetes cluster with CozyStack installed
- Target: Document in
- Setup Flux GitOps → Make Test 7 pass
- Target: Determine canonical repo, bootstrap Flux
- Deliverable: Flux syncing from Git, ready for app deployments
Week 3: Demo & Polish (Dec 1-4)
- Create SpinKube Demo → Make Test 6 pass
- Target:
cozystack-moon-and-back/demo/spinkube-hello.yaml - Deliverable: Working demo app on ARM64
- Target:
- Build Talk Materials
- Target:
cozystack-moon-and-back/slides/ - Deliverable: Slide deck with live demo script
- Target:
- Practice & Contingency Plans
- Fallback: Home lab demo if AWS has issues
- Prepare backup slides with cost data and architecture diagrams
Success Criteria (TDG-Style)
Minimum Viable Demo (December 3):
- Test 1-3 passing (Network + Bastion)
- Test 4 passing (At least 1 Talos node netboots)
- Test 5 partial (CozyStack installed, even if not production-ready)
- Test 6 passing (SpinKube hello-world runs)
- Test 8 passing (Cost < $0.10/month proven)
- Slides + demo script ready
Stretch Goals:
- Test 7 passing (Flux GitOps working)
- Test 9 passing (GDPR compliance audit documented)
- 3-node cluster (vs. 1-node minimum)
- Custom Talos image with Tailscale + Spin extensions built
Ultimate Goal:
- Audience leaves thinking: “I could replicate this in my own environment”
- Community feedback: “This is a realistic approach to hybrid cloud”
- Operator satisfaction: “I learned something building this, and so did they”
TDG Success Story: Custom Talos Images (Nov 16-17, 2025)
The Problem: ARM64 Talos Images with Spin + Tailscale
Initial Requirement: Build custom ARM64 Talos images with Spin runtime and Tailscale extensions for CozyStack deployment on AWS t4g instances.
Classic Anti-Pattern (What We Almost Did):
- Start writing GitHub Actions workflow from scratch
- Guess at patch format by looking at examples
- Trial-and-error approach with commit-push-check cycles
- Debug failures by reading CI logs and making assumptions
- Accumulate “almost working” patches and debugging artifacts
- End up with 20+ commits of incremental fixes and confusion
The TDG Approach (What Actually Worked)
Red Phase: Write Tests First
Before writing any GitHub Actions or patch files, we defined exactly what success looks like:
# Test: Patch should apply cleanly to upstream
cd /tmp && git clone https://github.com/cozystack/cozystack.git
cd cozystack && git apply --check /path/to/our.patch
# Test: Expected changes should be present
grep "EXTENSIONS.*spin tailscale" packages/core/installer/hack/gen-profiles.sh
grep "arch: arm64" packages/core/installer/hack/gen-profiles.sh
grep "SPIN_IMAGE\|TAILSCALE_IMAGE" packages/core/installer/hack/gen-profiles.sh
Key Insight: Tests defined the exact file changes needed BEFORE we tried to create patches.
Green Phase: Make Tests Pass (The Hard Part)
First Attempt: Manual patch construction → Failed spectacularly
- Hand-crafted unified diff format
- Wrong line numbers (humans are bad at counting)
- Malformed patch structure (“fragment without header”)
- Multiple debugging cycles with broken patches
Second Attempt: Git-generated patches → Succeeded immediately
# Make actual changes to files
cd /tmp/cozystack
sed -i 's/EXTENSIONS="drbd zfs"/EXTENSIONS="drbd zfs spin tailscale"/' hack/gen-profiles.sh
sed -i 's/arch: amd64/arch: arm64/' hack/gen-profiles.sh
# ... other changes
# Let Git create proper patch
git diff > working.patch
git apply --check working.patch # ✓ PASSES
Critical Lesson: Don’t outsmart the tools. Use git diff to create patches, not string manipulation.
Refactor Phase: Clean and Validate
Problem Discovered: Multiple patch files in directory caused sequential application failures
01-arm64-spin-tailscale.patch(working)01-gen-profiles-only.patch(leftover debugging, broken)test-*.patch(various debugging artifacts)
Solution: Cleanup + Comprehensive validation
# Remove all debugging artifacts
rm patches/test-*.patch patches/*-only.patch
# Create validation suite to prevent future regressions
./validate-complete.sh
# ✓ Patch applies cleanly to upstream
# ✓ All expected changes present
# ✓ Workflow syntax valid
# ✓ Dependencies configured
# ✓ Clean patch directory
Results: From Chaos to Confidence
Before TDG (Typical Approach):
- 15+ commits over multiple hours
- “patch fragment without header” errors
- “corrupt patch at line X” failures
- Manual debugging of GitHub Actions output
- Guessing what might be wrong
- Stream of half-working incremental fixes
After TDG (Test-First Approach):
- 3 clean commits: working patch + validation suite + docs
- Immediate success on each GitHub Actions run
- Local validation prevents CI failures
- Clear understanding of what each component does
- Reusable patterns for future patch generation
Key TDG Principles Validated
- Tests First: Writing validation scripts forced us to understand what “success” actually meant
- Red-Green-Refactor: Each cycle improved both the solution and our understanding
- Local Feedback: Running tests locally is infinitely faster than CI debugging
- Documentation: Writing ADR-003 prevented future developers (including ourselves) from repeating mistakes
Broader Applicability
This same TDG approach applies to:
- Terraform: Write
terraform planassertions before writing resources - Kubernetes: Write
kubectl waittests before creating manifests - Docker: Write container health checks before Dockerfile optimization
- Any Infrastructure Code: Define observable success criteria first
The Validation Suite Legacy
The validate-complete.sh script now ensures:
- No future patch generation mistakes
- Workflow changes are validated locally
- Repository cleanliness is maintained
- Documentation stays in sync
Future developers can run one command and know their changes will work.
Quote from the Trenches
“When you force yourself to write a test, you can run the test, and you don’t get a stream of commits of half-garbage because nobody knows how to write this stuff from scratch!”
The TDG methodology transformed debugging chaos into engineering confidence.
Handoff Notes for Next Claude Agent
Operator context:
- Works at NASA (via Navteca, LLC) but presenting personal work
- Home lab generates significant heat and power consumption
- Already has working home lab with Talos + CozyStack
- Needs cloud replica for talk demo + to prove economics
- Conference: CozySummit Virtual 2025, December 3 (~12 days)
- Budget: Stay within AWS free tier (<$0.10/month)
Technical state:
- AWS account: Sandbox (181107798310)
- Region: eu-west-1
- Existing: Bastion in public subnet, scheduled 5hrs/day
- MFA’d AWS credentials working via profile
sb-terraform-mfa-session - Terraform: Split between
urmanac/aws-accountsand new presentation repo - Flux: Unclear which repo is canonical (
fleet-infravscozy-fleet)
Immediate priorities:
- Generate Terraform for VPC/subnets (Test 1)
- Move bastion to private subnet (Test 2)
- Add Docker containers to bastion user data (Test 3)
TDG Test Suite: Integration Layer
Test 10: SpinApp GitOps Deployment
#!/bin/bash
# tests/10-spinapp-gitops.sh
# GIVEN: CozyStack cluster operational from Test 5
# WHEN: GitOps repository contains SpinApp manifest
# THEN: Application serves externally via MetalLB
test_spinapp_deployed() {
# Check SpinApp CRD exists and application is ready
kubectl get spinapp demo-spin-app -n demo \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q "True"
}
test_metallb_service_allocated() {
# Verify MetalLB allocated external IP from ARP pool
external_ip=$(kubectl get svc demo-spin-app -n demo \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
[[ "$external_ip" =~ ^10\.20\.1\.[0-9]+$ ]] # VPC subnet range
}
test_external_access_works() {
# Test HTTP access from within VPC (bastion perspective)
ssh bastion "curl -f http://$external_ip:8080/health" | grep -q "OK"
}
test_gitops_sync_working() {
# Verify Flux/ArgoCD shows application in sync
# Implementation depends on GitOps tool choice
kubectl get gitrepository cozy-apps -n flux-system \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q "True"
}
# Run all tests
test_spinapp_deployed && \
test_metallb_service_allocated && \
test_external_access_works && \
test_gitops_sync_working
Status: ❌ FAIL (No cluster yet) Dependencies: Tests 1-5 (infrastructure), GitOps repository Demo Value: ⭐⭐⭐⭐⭐ (Shows WebAssembly + GitOps + LoadBalancer)
Test 11: KubeVirt Cluster-API Integration
#!/bin/bash
# tests/11-kubevirt-cluster-api.sh
# GIVEN: CozyStack with KubeVirt provider from Test 5
# WHEN: Cluster-API creates guest Kubernetes cluster
# THEN: Nested cluster runs workloads successfully
test_cluster_api_ready() {
# Verify Cluster-API controllers operational
kubectl get clusters -A | grep -q "Provisioned.*True"
}
test_guest_cluster_accessible() {
# Extract guest cluster kubeconfig and test access
kubectl get secret guest-cluster-kubeconfig -o jsonpath='{.data.value}' \
| base64 -d > /tmp/guest-kubeconfig
KUBECONFIG=/tmp/guest-kubeconfig kubectl get nodes | grep -q "Ready"
}
test_nested_workload_scheduling() {
# Deploy simple workload to guest cluster
KUBECONFIG=/tmp/guest-kubeconfig kubectl run test-pod \
--image=nginx:alpine --restart=Never
KUBECONFIG=/tmp/guest-kubeconfig kubectl wait pod test-pod \
--for=condition=Ready --timeout=300s
}
test_vm_resource_isolation() {
# Verify VMs have proper resource limits
kubectl get virtualmachine -A -o jsonpath='{.items[*].spec.template.spec.domain.resources}'
}
# Run all tests
test_cluster_api_ready && \
test_guest_cluster_accessible && \
test_nested_workload_scheduling && \
test_vm_resource_isolation
Status: ❌ FAIL (No KubeVirt yet) Dependencies: Test 5 (CozyStack), KubeVirt + Cluster-API setup Demo Value: ⭐⭐⭐⭐ (Shows infrastructure-as-code for Kubernetes)
Test 12: Moonlander + Harvey Cross-Cluster Management
#!/bin/bash
# tests/12-moonlander-harvey-integration.sh
# GIVEN: Multiple clusters from Tests 5 + 11
# WHEN: Moonlander copies kubeconfigs for Harvey
# THEN: Harvey (Crossplane) manages all clusters uniformly
test_moonlander_secret_propagation() {
# Verify Moonlander copied guest cluster kubeconfig to Harvey namespace
kubectl get secret guest-cluster-kubeconfig -n harvey-system \
-o jsonpath='{.data.kubeconfig}' | base64 -d | grep -q "clusters:"
}
test_harvey_crossplane_connectivity() {
# Check Harvey can list resources across all clusters
kubectl get providerconfigs -n harvey-system | grep -q "guest-cluster"
# Verify Crossplane can reach guest cluster
kubectl logs -n harvey-system deployment/harvey-controller | grep -q "successfully connected to guest-cluster"
}
test_cross_cluster_workload_deployment() {
# Harvey deploys workload to guest cluster via Crossplane
cat <<EOF | kubectl apply -f -
apiVersion: harvey.io/v1alpha1
kind: CrossClusterWorkload
metadata:
name: test-cross-deployment
namespace: harvey-system
spec:
targetCluster: guest-cluster
template:
apiVersion: v1
kind: Pod
metadata:
name: harvey-managed-pod
spec:
containers:
- name: test
image: alpine:latest
command: [sleep, "3600"]
EOF
# Wait for Harvey to propagate workload
sleep 30
KUBECONFIG=/tmp/guest-kubeconfig kubectl get pod harvey-managed-pod | grep -q "Running"
}
test_unified_cluster_visibility() {
# Verify Harvey dashboard shows both host and guest clusters
kubectl port-forward -n harvey-system svc/harvey-dashboard 8080:80 &
sleep 5
curl -f http://localhost:8080/api/clusters | jq '.clusters | length' | grep -q "2"
pkill -f "kubectl port-forward"
}
# Run all tests
test_moonlander_secret_propagation && \
test_harvey_crossplane_connectivity && \
test_cross_cluster_workload_deployment && \
test_unified_cluster_visibility
Status: ❌ FAIL (No Moonlander/Harvey yet) Dependencies: Tests 5 + 11, Moonlander secret propagation, Harvey/Crossplane Demo Value: ⭐⭐⭐⭐⭐ (Shows advanced multi-cluster orchestration)
When operator returns, start with: “Welcome back! I’ve expanded the TDG test suite to include integration tests 10-12. These cover SpinApp GitOps deployment, KubeVirt nested clusters, and Moonlander+Harvey cross-cluster management. We now have 12 tests defined total. Want me to generate the VPC Terraform to make Test 1 pass first?”
Document created: 2025-11-16
TDG methodology: Write tests first, generate code to make them pass
Target: CozySummit Virtual 2025, December 3, 2025
For talk: “Home Lab to the Moon and Back” by Kingdon Barrett
Related Documentation
- 📝 Session Learnings - Deep architectural discoveries and TDG methodology application from November 16, 2025