Post-Tyranny-Tech-Infrastru.../docs/maintenance-tracking.md
Pieter 0c4d536246 feat: Add version tracking and maintenance monitoring (issue #15)
Complete implementation of automatic version tracking and drift detection:

New Scripts:
- scripts/collect-client-versions.sh: Query deployed versions from Docker
  - Connects via Ansible to running servers
  - Extracts versions from container images
  - Updates registry automatically

- scripts/check-client-versions.sh: Compare versions across clients
  - Multiple formats: table (colorized), CSV, JSON
  - Filter by outdated versions
  - Highlights drift with color coding

- scripts/detect-version-drift.sh: Identify version differences
  - Detects clients with outdated versions
  - Threshold-based staleness detection (default 30 days)
  - Actionable recommendations
  - Exit code 1 if drift detected (CI/monitoring friendly)

Updated Scripts:
- scripts/deploy-client.sh: Auto-collect versions after deployment
- scripts/rebuild-client.sh: Auto-collect versions after rebuild

Documentation:
- docs/maintenance-tracking.md: Complete maintenance guide
  - Version management workflows
  - Security update procedures
  - Monitoring integration examples
  - Troubleshooting guide

Features:
 Automatic version collection from deployed servers
 Multi-client version comparison reports
 Version drift detection with recommendations
 Integration with deployment workflows
 Export to CSV/JSON for external tools
 Canary-first update workflow support

Usage Examples:
```bash
# Collect versions
./scripts/collect-client-versions.sh dev

# Compare all clients
./scripts/check-client-versions.sh

# Detect drift
./scripts/detect-version-drift.sh

# Export for monitoring
./scripts/check-client-versions.sh --format=json
```

Closes #15

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-17 20:53:15 +01:00

477 lines
12 KiB
Markdown

# Maintenance and Version Tracking
Comprehensive guide to tracking software versions, maintenance history, and detecting version drift across all deployed clients.
## Overview
The infrastructure tracks:
- **Software versions** - Authentik, Nextcloud, Traefik, Ubuntu
- **Maintenance dates** - Last update, security patches, OS updates
- **Version drift** - Clients running different versions
- **Update history** - Audit trail of changes
All version and maintenance data is stored in [`clients/registry.yml`](../clients/registry.yml).
## Registry Structure
Each client tracks versions and maintenance:
```yaml
clients:
myclient:
versions:
authentik: "2025.10.3"
nextcloud: "30.0.17"
traefik: "v3.0"
ubuntu: "24.04"
maintenance:
last_full_update: 2026-01-17
last_security_patch: 2026-01-17
last_os_update: 2026-01-17
last_backup_verified: null
```
## Version Management Scripts
### Collect Client Versions
Query actual deployed versions from a running server:
```bash
# Collect versions from dev client
./scripts/collect-client-versions.sh dev
```
This script:
- Connects to the server via Ansible
- Queries Docker container image tags
- Queries Ubuntu OS version
- Updates the registry automatically
**Output:**
```
Collecting versions for client: dev
Querying deployed versions...
Collecting Docker container versions...
✓ Versions collected
Collected versions:
Authentik: 2025.10.3
Nextcloud: 30.0.17
Traefik: v3.0
Ubuntu: 24.04
✓ Registry updated
```
**Requirements:**
- Server must be deployed and reachable
- `HCLOUD_TOKEN` environment variable set
- Ansible configured with dynamic inventory
### Check All Client Versions
Compare versions across all clients:
```bash
# Default: Table format with color coding
./scripts/check-client-versions.sh
# Export as CSV
./scripts/check-client-versions.sh --format=csv
# Export as JSON
./scripts/check-client-versions.sh --format=json
# Show only clients with outdated versions
./scripts/check-client-versions.sh --outdated
```
**Table output:**
```
═══════════════════════════════════════════════════════════════════════════════
CLIENT VERSION REPORT
═══════════════════════════════════════════════════════════════════════════════
CLIENT STATUS AUTHENTIK NEXTCLOUD TRAEFIK UBUNTU
──────────────────────────────────────────────────────────────────────────────────────────────
dev deployed 2025.10.3 30.0.17 v3.0 24.04
client1 deployed 2025.10.2 30.0.16 v3.0 24.04
Latest versions:
Authentik: 2025.10.3
Nextcloud: 30.0.17
Traefik: v3.0
Ubuntu: 24.04
Note: Red indicates outdated version
```
**CSV output:**
```csv
client,status,authentik,nextcloud,traefik,ubuntu,last_update,outdated
dev,deployed,2025.10.3,30.0.17,v3.0,24.04,2026-01-17,no
client1,deployed,2025.10.2,30.0.16,v3.0,24.04,2026-01-10,yes
```
**JSON output:**
```json
{
"latest_versions": {
"authentik": "2025.10.3",
"nextcloud": "30.0.17",
"traefik": "v3.0",
"ubuntu": "24.04"
},
"clients": [
{
"name": "dev",
"status": "deployed",
"versions": {
"authentik": "2025.10.3",
"nextcloud": "30.0.17",
"traefik": "v3.0",
"ubuntu": "24.04"
},
"last_update": "2026-01-17",
"outdated": false
}
]
}
```
### Detect Version Drift
Identify clients with outdated versions:
```bash
# Default: Check all deployed clients
./scripts/detect-version-drift.sh
# Check clients not updated in 30+ days
./scripts/detect-version-drift.sh --threshold=30
# Check specific application only
./scripts/detect-version-drift.sh --app=authentik
# Summary output for monitoring
./scripts/detect-version-drift.sh --format=summary
```
**Output when drift detected:**
```
⚠ VERSION DRIFT DETECTED
Clients with outdated versions:
• client1
Authentik: 2025.10.2 → 2025.10.3
Nextcloud: 30.0.16 → 30.0.17
• client2
Last update: 2025-12-15 (>30 days ago)
Recommended actions:
1. Test updates on canary server first:
./scripts/rebuild-client.sh dev
2. Verify canary health:
./scripts/client-status.sh dev
3. Update outdated clients:
./scripts/rebuild-client.sh client1
./scripts/rebuild-client.sh client2
```
**Exit codes:**
- `0` - No drift detected (all clients up to date)
- `1` - Drift detected (action needed)
- `2` - Error (script failure)
**Summary format** (useful for monitoring):
```
Status: DRIFT DETECTED
Drift: Yes
Clients checked: 5
Clients with outdated versions: 2
Clients not updated in 30 days: 1
Affected clients: client1 client2
```
## Automatic Version Collection
Version collection is **automatically performed** after deployments:
### On New Deployment
`./scripts/deploy-client.sh myclient`:
1. Provisions infrastructure
2. Deploys applications
3. Updates registry with server info
4. **Collects and records versions** ← Automatic
### On Rebuild
`./scripts/rebuild-client.sh myclient`:
1. Destroys old infrastructure
2. Provisions new infrastructure
3. Deploys applications
4. Updates registry
5. **Collects and records versions** ← Automatic
If automatic collection fails (server not ready, network issue):
```
⚠ Could not collect versions automatically
Run manually later: ./scripts/collect-client-versions.sh myclient
```
## Maintenance Workflows
### Security Update Workflow
1. **Check current state**
```bash
./scripts/check-client-versions.sh
```
2. **Update canary first** (dev server)
```bash
./scripts/rebuild-client.sh dev
```
3. **Verify canary**
```bash
# Check health
./scripts/client-status.sh dev
# Verify versions updated
./scripts/collect-client-versions.sh dev
```
4. **Detect drift** (identify outdated clients)
```bash
./scripts/detect-version-drift.sh
```
5. **Roll out to production**
```bash
# Update each client
./scripts/rebuild-client.sh client1
./scripts/rebuild-client.sh client2
# Or batch update (be careful!)
for client in $(./scripts/list-clients.sh --role=production --format=csv | tail -n +2 | cut -d, -f1); do
./scripts/rebuild-client.sh "$client"
sleep 300 # Wait 5 minutes between updates
done
```
6. **Verify all updated**
```bash
./scripts/detect-version-drift.sh
```
### Monthly Maintenance Check
Run these checks monthly:
```bash
# 1. Version report
./scripts/check-client-versions.sh > reports/versions-$(date +%Y-%m).txt
# 2. Drift detection
./scripts/detect-version-drift.sh --threshold=30
# 3. Client health
for client in $(./scripts/list-clients.sh --status=deployed --format=csv | tail -n +2 | cut -d, -f1); do
./scripts/client-status.sh "$client"
done
```
### Update Maintenance Dates
Deployment scripts automatically update `last_full_update`. For other maintenance:
```bash
# After security patches (OS level)
yq eval -i ".clients.myclient.maintenance.last_security_patch = \"$(date +%Y-%m-%d)\"" clients/registry.yml
# After OS updates
yq eval -i ".clients.myclient.maintenance.last_os_update = \"$(date +%Y-%m-%d)\"" clients/registry.yml
# After backup verification
yq eval -i ".clients.myclient.maintenance.last_backup_verified = \"$(date +%Y-%m-%d)\"" clients/registry.yml
# Commit changes
git add clients/registry.yml
git commit -m "chore: Update maintenance dates"
git push
```
## Integration with Monitoring
### Continuous Drift Detection
Set up a cron job or CI pipeline:
```bash
#!/bin/bash
# check-drift.sh - Run daily
cd /path/to/infrastructure
# Check for drift
if ! ./scripts/detect-version-drift.sh --format=summary; then
# Send alert (Slack, email, etc.)
./scripts/detect-version-drift.sh | mail -s "Version Drift Detected" ops@example.com
fi
```
### Export for External Tools
```bash
# Export version data as JSON for monitoring tools
./scripts/check-client-versions.sh --format=json > /var/monitoring/client-versions.json
# Export drift status
./scripts/detect-version-drift.sh --format=summary > /var/monitoring/drift-status.txt
```
### Prometheus Metrics
Convert to Prometheus format:
```bash
#!/bin/bash
# export-metrics.sh
# Count clients by drift status
total=$(./scripts/list-clients.sh --status=deployed --format=csv | tail -n +2 | wc -l)
outdated=$(./scripts/check-client-versions.sh --format=csv --outdated | tail -n +2 | wc -l)
uptodate=$((total - outdated))
echo "# HELP clients_total Total number of deployed clients"
echo "# TYPE clients_total gauge"
echo "clients_total $total"
echo "# HELP clients_outdated Number of clients with outdated versions"
echo "# TYPE clients_outdated gauge"
echo "clients_outdated $outdated"
echo "# HELP clients_uptodate Number of clients with latest versions"
echo "# TYPE clients_uptodate gauge"
echo "clients_uptodate $uptodate"
```
## Version Pinning
To prevent automatic updates, pin versions in Ansible roles:
```yaml
# roles/authentik/defaults/main.yml
authentik_version: "2025.10.3" # Pinned version
# To update:
# 1. Change pinned version
# 2. Update canary: ./scripts/rebuild-client.sh dev
# 3. Verify and roll out
```
## Troubleshooting
### Version Collection Fails
**Problem:** `collect-client-versions.sh` cannot reach server
**Solutions:**
1. Check server is deployed and running:
```bash
./scripts/client-status.sh myclient
```
2. Verify HCLOUD_TOKEN is set:
```bash
echo $HCLOUD_TOKEN
```
3. Test Ansible connectivity:
```bash
cd ansible
ansible -i hcloud.yml myclient -m ping
```
4. Check Docker containers are running:
```bash
ansible -i hcloud.yml myclient -m shell -a "docker ps"
```
### Incorrect Version Reported
**Problem:** Registry shows wrong version
**Solutions:**
1. Re-collect versions manually:
```bash
./scripts/collect-client-versions.sh myclient
```
2. Verify Docker images:
```bash
ansible -i hcloud.yml myclient -m shell -a "docker images"
```
3. Check container inspect:
```bash
ansible -i hcloud.yml myclient -m shell -a "docker inspect authentik-server | jq '.[0].Config.Image'"
```
### Version Drift False Positives
**Problem:** Drift detected for canary with intentionally different version
**Solution:** Use `--app` filter to check specific applications:
```bash
# Check only production-critical apps
./scripts/detect-version-drift.sh --app=authentik
./scripts/detect-version-drift.sh --app=nextcloud
```
## Best Practices
1. **Always test on canary first**
- Update `dev` client before production
- Verify health before wider rollout
2. **Stagger production updates**
- Don't update all clients simultaneously
- Wait 5-10 minutes between updates
- Monitor each update for issues
3. **Track maintenance in registry**
- Keep `last_full_update` current
- Record `last_security_patch` dates
- Document backup verification
4. **Regular drift checks**
- Run weekly: `detect-version-drift.sh`
- Address drift within 7 days
- Maintain version consistency
5. **Document version changes**
- Add notes to registry when pinning versions
- Commit registry changes with descriptive messages
- Track major version upgrades separately
6. **Automate reporting**
- Export weekly version reports
- Alert on drift detection
- Dashboard for version overview
## Related Documentation
- [Client Registry](client-registry.md) - Registry system overview
- [Deployment Guide](deployment.md) - Deployment procedures
- [SSH Key Management](ssh-key-management.md) - Security and access