Post-X-Society/Post-Tyranny-Tech-Infrastructure

Pieter 171cbfbb32 WIP: Ansible base configuration - common role (#2 )

Progress on Issue #2: Ansible Base Configuration

Completed:
- ✅ Ansible installed via pipx (isolated Python environment)
- ✅ Hetzner Cloud dynamic inventory configured
- ✅ Ansible configuration (ansible.cfg)
- ✅ Common role for base system hardening:
  - SSH hardening (key-only, no root password)
  - UFW firewall configuration
  - Fail2ban for SSH protection
  - Automatic security updates
  - Timezone and system packages
- ✅ Comprehensive Ansible README with setup guide

Architecture Updates:
- Added Decision #15: pipx for isolated Python environments
- Updated ADR changelog with pipx adoption

Still TODO for #2:
- Docker role
- Traefik role
- Setup playbook
- Deploy playbook
- Testing against live server

Files added:
- ansible/README.md - Complete Ansible guide
- ansible/ansible.cfg - Ansible configuration
- ansible/hcloud.yml - Hetzner dynamic inventory
- ansible/roles/common/* - Base hardening role

Partial progress on #2

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-27 14:00:22 +01:00

24 KiB

Raw Blame History

Infrastructure Architecture Decision Record

Post-X Society Multi-Tenant VPS Platform

Document Status: Living document
Created: December 2024
Last Updated: December 2024

Executive Summary

This document captures architectural decisions for a scalable, multi-tenant infrastructure platform starting with 10 identical VPS instances running Keycloak and Nextcloud, with plans to expand both server count and application offerings.

Key Technology Choices:

OpenTofu over Terraform (truly open source, MPL 2.0)
SOPS + Age over HashiCorp Vault (simple, no server, European-friendly)
Hetzner for all infrastructure (GDPR-compliant, EU-based)

1. Infrastructure Provisioning

Decision: OpenTofu + Ansible with Dynamic Inventory

Choice: Infrastructure as Code using OpenTofu for resource provisioning and Ansible for configuration management.

Why OpenTofu over Terraform:

Truly open source (MPL 2.0) vs HashiCorp's BSL 1.1
Drop-in replacement - same syntax, same providers
Linux Foundation governance - no single company can close the license
Active community after HashiCorp's 2023 license change
No risk of future license restrictions

Approach:

OpenTofu manages Hetzner resources (VPS instances, networks, firewalls, DNS)
Ansible configures servers using the hcloud dynamic inventory plugin
No static inventory files - Ansible queries Hetzner API at runtime

Rationale:

10+ identical servers makes manual management unsustainable
Version-controlled infrastructure in Git
Dynamic inventory eliminates sync issues between OpenTofu and Ansible
Skills transfer to other providers if needed

Implementation:

# ansible.cfg
[inventory]
enable_plugins = hetzner.hcloud.hcloud

# hcloud.yml (inventory config)
plugin: hetzner.hcloud.hcloud
locations:
  - fsn1
keyed_groups:
  - key: labels.role
    prefix: role
  - key: labels.client
    prefix: client

2. Application Deployment

Decision: Modular Ansible Roles with Feature Flags

Choice: Each application is a separate Ansible role, enabled per-server via inventory variables.

Rationale:

Allows heterogeneous deployments (client A wants Pretix, client B doesn't)
Test new applications on single server before fleet rollout
Clear separation of concerns
Minimal refactoring when adding new applications

Structure:

ansible/
├── roles/
│   ├── common/          # Base setup, hardening, Docker
│   ├── traefik/         # Reverse proxy, SSL
│   ├── zitadel/         # Identity provider (Swiss, AGPL 3.0)
│   ├── nextcloud/
│   ├── pretix/          # Future
│   ├── listmonk/        # Future
│   ├── backup/          # Restic configuration
│   └── monitoring/      # Node exporter, promtail

Inventory Example:

all:
  children:
    clients:
      hosts:
        client-alpha:
          client_name: alpha
          domain: alpha.platform.nl
          apps:
            - zitadel
            - nextcloud
        client-beta:
          client_name: beta
          domain: beta.platform.nl
          apps:
            - zitadel
            - nextcloud
            - pretix

3. DNS Management

Decision: Hetzner DNS via OpenTofu

Choice: Manage all DNS records through Hetzner DNS using OpenTofu.

Rationale:

Single provider for infrastructure and DNS simplifies management
OpenTofu provider available and well-maintained (same as Terraform provider)
Cost-effective (included with Hetzner)
GDPR-compliant (EU-based)

Domain Strategy:

Start with subdomains: {client}.platform.nl
Support custom domains later via variable override
Wildcard approach not used - explicit records per service

Implementation:

resource "hcloud_server" "client" {
  for_each    = var.clients
  name        = each.key
  server_type = each.value.server_type
  # ...
}

resource "hetznerdns_record" "client_a" {
  for_each = var.clients
  zone_id  = data.hetznerdns_zone.main.id
  name     = each.value.subdomain
  type     = "A"
  value    = hcloud_server.client[each.key].ipv4_address
}

SSL Certificates: Handled by Traefik with Let's Encrypt, automatic per-domain.

4. Identity Provider

Decision: Zitadel (replacing Keycloak)

Choice: Zitadel as the identity provider for all client installations.

Why Zitadel over Keycloak:

Factor	Zitadel	Keycloak
Company HQ	🇨🇭 Switzerland	🇺🇸 USA (IBM/Red Hat)
GDPR Jurisdiction	EU-adequate	US jurisdiction
License	AGPL 3.0	Apache 2.0
Multi-tenancy	Native design	Added later (2024)
Language	Go (lightweight)	Java (resource-heavy)
Architecture	Event-sourced, API-first	Traditional

Licensing Notes:

Zitadel v3 (March 2025) changed from Apache 2.0 to AGPL 3.0
For our use case (running Zitadel as IdP), this has zero impact
AGPL only requires source disclosure if you modify Zitadel AND provide it as a service
SDKs and APIs remain Apache 2.0

Company Background:

CAOS Ltd., headquartered in St. Gallen, Switzerland
Founded 2019, $15.5M funding (Series A)
Switzerland has EU data protection adequacy status
Public product roadmap, transparent development

Deployment:

# docker-compose.yml snippet
services:
  zitadel:
    image: ghcr.io/zitadel/zitadel:v3.x.x  # Pin version
    command: start-from-init
    environment:
      ZITADEL_DATABASE_POSTGRES_HOST: postgres
      ZITADEL_EXTERNALDOMAIN: ${CLIENT_DOMAIN}
    depends_on:
      - postgres

Multi-tenancy Approach:

Each client gets isolated Zitadel organization
Single Zitadel instance can manage multiple organizations
Or: fully isolated Zitadel per client (current choice for maximum isolation)

4. Backup Strategy

Decision: Dual Backup Approach

Choice: Hetzner automated snapshots + Restic application-level backups to Hetzner Storage Box.

Layer 1: Hetzner Snapshots

Purpose: Disaster recovery (complete server loss)

Aspect	Configuration
Frequency	Daily (Hetzner automated)
Retention	7 snapshots
Cost	20% of VPS price
Restoration	Full server restore via Hetzner console/API

Limitations:

Crash-consistent only (may catch database mid-write)
Same datacenter (not true off-site)
Coarse granularity (all or nothing)

Layer 2: Restic to Hetzner Storage Box

Purpose: Granular application recovery, off-server storage

Backend Choice: Hetzner Storage Box

Rationale:

GDPR-compliant (German/EU data residency)
Same Hetzner network = fast transfers, no egress costs
Cost-effective (~€3.81/month for BX10 with 1TB)
Supports SFTP, CIFS/Samba, rsync, Restic-native
Can be accessed from all VPSs simultaneously

Storage Hierarchy:

Storage Box (BX10 or larger)
└── /backups/
    ├── /client-alpha/
    │   ├── /restic-repo/      # Encrypted Restic repository
    │   └── /manual/           # Ad-hoc exports if needed
    ├── /client-beta/
    │   └── /restic-repo/
    └── /client-gamma/
        └── /restic-repo/

Connection Method:

Primary: SFTP (native Restic support, encrypted in transit)
Optional: CIFS mount for manual file access
Each client VPS gets Storage Box sub-account or uses main credentials with path restrictions

Aspect	Configuration
Frequency	Nightly (after DB dumps)
Time	03:00 local time
Retention	7 daily, 4 weekly, 6 monthly
Encryption	Restic default (AES-256)
Repo passwords	Stored in SOPS-encrypted files

What Gets Backed Up:

/opt/docker/
├── nextcloud/
│   └── data/              # ✓ User files
├── zitadel/
│   └── db-dumps/          # ✓ PostgreSQL dumps (not live DB)
├── pretix/
│   └── data/              # ✓ When applicable
└── configs/               # ✓ docker-compose files, env

Backup Ansible Role Tasks:

Install Restic
Initialize repo (if not exists)
Configure SFTP connection to Storage Box
Create pre-backup script (database dumps)
Create backup script
Create systemd timer
Configure backup monitoring (alert on failure)

Sizing Guidance:

Start with BX10 (1TB) for 10 clients
Monitor usage monthly
Scale to BX20 (2TB) when approaching 70% capacity

Verification:

Weekly restic check via cron
Monthly test restore to staging environment
Alerts on backup job failures

5. Secrets Management

Decision: SOPS + Age Encryption

Choice: File-based secrets encryption using SOPS with Age encryption, stored in Git.

Why SOPS + Age over HashiCorp Vault:

No additional server to maintain
Truly open source (MPL 2.0 for SOPS, Apache 2.0 for Age)
Secrets versioned alongside infrastructure code
Simple to understand and debug
Age developed with European privacy values (FiloSottile)
Perfect for 10-50 server scale
No vendor lock-in concerns

How It Works:

Secrets stored in YAML files, encrypted with Age
Only the values are encrypted, keys remain readable
Decryption happens at Ansible runtime
One Age key per environment (or shared across all)

Example Encrypted File:

# secrets/client-alpha.sops.yaml
db_password: ENC[AES256_GCM,data:kH3x9...,iv:abc...,tag:def...,type:str]
keycloak_admin: ENC[AES256_GCM,data:mN4y2...,iv:ghi...,tag:jkl...,type:str]
nextcloud_admin: ENC[AES256_GCM,data:pQ5z7...,iv:mno...,tag:pqr...,type:str]
restic_repo_password: ENC[AES256_GCM,data:rS6a1...,iv:stu...,tag:vwx...,type:str]

Key Management:

keys/
├── age-key.txt           # Master key (NEVER in Git, backed up securely)
└── .sops.yaml            # SOPS configuration (in Git)

.sops.yaml Configuration:

creation_rules:
  - path_regex: secrets/.*\.sops\.yaml$
    age: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Secret Structure:

secrets/
├── .sops.yaml              # SOPS config
├── shared.sops.yaml        # Shared secrets (Storage Box, API tokens)
└── clients/
    ├── alpha.sops.yaml     # Client-specific secrets
    ├── beta.sops.yaml
    └── gamma.sops.yaml

Ansible Integration:

# Using community.sops collection
- name: Load client secrets
  community.sops.load_vars:
    file: "secrets/clients/{{ client_name }}.sops.yaml"
    name: client_secrets

- name: Use decrypted secret
  ansible.builtin.template:
    src: docker-compose.yml.j2
    dest: /opt/docker/docker-compose.yml
  vars:
    db_password: "{{ client_secrets.db_password }}"

Daily Operations:

# Encrypt a new file
sops --encrypt --age $(cat keys/age-key.pub) secrets/clients/new.yaml > secrets/clients/new.sops.yaml

# Edit existing secrets (decrypts, opens editor, re-encrypts)
SOPS_AGE_KEY_FILE=keys/age-key.txt sops secrets/clients/alpha.sops.yaml

# View decrypted content
SOPS_AGE_KEY_FILE=keys/age-key.txt sops --decrypt secrets/clients/alpha.sops.yaml

Key Backup Strategy:

Age private key stored in password manager (Bitwarden/1Password)
Printed paper backup in secure location
Key never stored in Git repository
Consider key escrow for bus factor

Advantages for Your Setup:

Aspect	Benefit
Simplicity	No Vault server to maintain, secure, update
Auditability	Git history shows who changed what secrets when
Portability	Works offline, no network dependency
Reliability	No secrets server = no secrets server downtime
Cost	Zero infrastructure cost

6. Monitoring

Decision: Centralized Uptime Kuma

Choice: Uptime Kuma on dedicated monitoring server.

Rationale:

Simple to deploy and maintain
Beautiful UI for status overview
Flexible alerting (email, Slack, webhook)
Self-hosted (data stays in-house)
Sufficient for "is it up?" monitoring at current scale

Deployment:

Dedicated VPS or container on monitoring server
Monitors all client servers and services
Public status page optional per client

Monitors per Client:

HTTPS endpoint (Nextcloud)
HTTPS endpoint (Zitadel)
TCP port checks (database, if exposed)
Docker container health (via API or agent)

Alerting:

Primary: Email
Secondary: Slack/Mattermost webhook
Escalation: SMS for extended downtime (future)

Future Expansion Path: When deeper metrics needed:

Add Prometheus + Node Exporter
Add Grafana dashboards
Add Loki for log aggregation
Uptime Kuma remains for synthetic monitoring

7. Client Isolation

Decision: Full Isolation

Choice: Maximum isolation between clients at all levels.

Implementation:

Layer	Isolation Method
Compute	Separate VPS per client
Network	Hetzner firewall rules, no inter-VPS traffic
Database	Separate PostgreSQL container per client
Storage	Separate Docker volumes
Backups	Separate Restic repositories
Secrets	Separate SOPS files per client
DNS	Separate records/domains

Network Rules:

Each VPS accepts traffic only on 80, 443, 22 (management IP only)
No private network between client VPSs
Monitoring server can reach all clients (outbound checks)

Rationale:

Security: Compromise of one client cannot spread
Compliance: Data separation demonstrable
Operations: Can maintain/upgrade clients independently
Billing: Clear resource attribution

8. Deployment Strategy

Decision: Canary Deployments with Version Pinning

Choice: Staged rollouts with explicit version control.

Version Pinning

All container images use explicit tags:

# docker-compose.yml
services:
  nextcloud:
    image: nextcloud:28.0.1  # Never use :latest
  keycloak:
    image: quay.io/keycloak/keycloak:23.0.1
  postgres:
    image: postgres:16.1

Version updates require explicit change and commit.

Canary Process

Inventory Groups:

all:
  children:
    canary:
      hosts:
        client-alpha:  # Designated test client (internal or willing partner)
    production:
      hosts:
        client-beta:
        client-gamma:
        # ... remaining clients

Deployment Script:

#!/bin/bash
set -e

echo "=== Deploying to canary ==="
ansible-playbook deploy.yml --limit canary

echo "=== Waiting for verification ==="
read -p "Canary OK? Proceed to production? [y/N] " confirm
if [[ $confirm != "y" ]]; then
    echo "Deployment aborted"
    exit 1
fi

echo "=== Deploying to production ==="
ansible-playbook deploy.yml --limit production

Rollback Procedures

Scenario 1: Bad container version

# Revert version in docker-compose
git revert HEAD
# Redeploy
ansible-playbook deploy.yml --limit affected_hosts

Scenario 2: Database migration issue

# Restore from pre-upgrade Restic backup
restic -r sftp:user@backup-server:/client-x/restic-repo restore latest --target /tmp/restore
# Restore database dump
psql < /tmp/restore/db-dumps/keycloak.sql
# Revert and redeploy application

Scenario 3: Complete server failure

# Restore Hetzner snapshot via API
hcloud server rebuild <server-id> --image <snapshot-id>
# Or via OpenTofu
tofu apply -replace="hcloud_server.client[\"affected\"]"

9. Security Baseline

Decision: Comprehensive Hardening

All servers receive the common Ansible role with:

SSH Hardening

# /etc/ssh/sshd_config (managed by Ansible)
PermitRootLogin: no
PasswordAuthentication: no
PubkeyAuthentication: yes
AllowUsers: deploy

Firewall (UFW)

- 22/tcp: Management IPs only
- 80/tcp: Any (redirects to 443)
- 443/tcp: Any
- All other: Deny

Automatic Updates

# unattended-upgrades configuration
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::AutoFixInterruptedDpkg "true";
Unattended-Upgrade::Automatic-Reboot "false";  # Manual reboot control

Fail2ban

# Jails enabled
- sshd
- traefik-auth (custom, for repeated 401s)

Container Security

# Trivy scanning in CI/CD
- Scan images before deployment
- Block critical vulnerabilities
- Weekly scheduled scans of running containers

Additional Measures

No password authentication anywhere
Secrets encrypted with SOPS + Age, never plaintext in Git
Regular dependency updates via Dependabot/Renovate
SSH keys rotated annually

10. Onboarding Procedure

New Client Checklist

## Client Onboarding: {CLIENT_NAME}

### Prerequisites
- [ ] Client agreement signed
- [ ] Domain/subdomain confirmed: _______________
- [ ] Contact email: _______________
- [ ] Desired applications: [ ] Keycloak [ ] Nextcloud [ ] Pretix [ ] Listmonk

### Infrastructure
- [ ] Add client to `tofu/variables.tf`
- [ ] Add client to `ansible/inventory/clients.yml`
- [ ] Create secrets file: `sops secrets/clients/{name}.sops.yaml`
- [ ] Create Storage Box subdirectory for backups
- [ ] Run: `tofu apply`
- [ ] Run: `ansible-playbook playbooks/setup.yml --limit {client}`

### Verification
- [ ] HTTPS accessible
- [ ] Zitadel admin login works
- [ ] Nextcloud admin login works
- [ ] Backup job runs successfully
- [ ] Monitoring checks green

### Handover
- [ ] Send credentials securely (1Password link, Signal, etc.)
- [ ] Schedule onboarding call if needed
- [ ] Add to status page (if applicable)
- [ ] Document any custom configuration

### Estimated Time: 30-45 minutes

11. Offboarding Procedure

Client Removal Checklist

## Client Offboarding: {CLIENT_NAME}

### Pre-Offboarding
- [ ] Confirm termination date: _______________
- [ ] Data export requested? [ ] Yes [ ] No
- [ ] Final invoice sent

### Data Export (if requested)
- [ ] Export Nextcloud data
- [ ] Export Zitadel organization/users
- [ ] Provide secure download link
- [ ] Confirm receipt

### Infrastructure Removal
- [ ] Disable monitoring checks (set maintenance mode first)
- [ ] Create final backup (retain per policy)
- [ ] Remove from Ansible inventory
- [ ] Remove from OpenTofu config
- [ ] Run: `tofu apply` (destroys VPS)
- [ ] Remove DNS records (automatic via OpenTofu)
- [ ] Remove/archive SOPS secrets file

### Backup Retention
- [ ] Move Restic repo to archive path
- [ ] Set deletion date: _______ (default: 90 days post-termination)
- [ ] Schedule deletion job

### Cleanup
- [ ] Remove from status page
- [ ] Update client count in documentation
- [ ] Archive client folder in documentation

### Verification
- [ ] DNS no longer resolves
- [ ] IP returns nothing
- [ ] Monitoring shows no alerts (host removed)
- [ ] Billing stopped

### Estimated Time: 15-30 minutes

Data Retention Policy

Data Type	Retention Post-Offboarding
Application data (Restic)	90 days
Hetzner snapshots	Deleted immediately (with VPS)
SOPS secrets files	Archived 90 days, then deleted
Logs	30 days
Invoices/contracts	7 years (legal requirement)

12. Repository Structure

infrastructure/
├── README.md
├── docs/
│   ├── architecture-decisions.md    # This document
│   ├── runbook.md                   # Operational procedures
│   └── clients/                     # Per-client notes
│       ├── alpha.md
│       └── beta.md
├── tofu/                            # OpenTofu configuration
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── dns.tf
│   ├── firewall.tf
│   └── versions.tf
├── ansible/
│   ├── ansible.cfg
│   ├── hcloud.yml                   # Dynamic inventory config
│   ├── playbooks/
│   │   ├── setup.yml                # Initial server setup
│   │   ├── deploy.yml               # Deploy/update applications
│   │   ├── upgrade.yml              # System updates
│   │   └── backup-restore.yml       # Manual backup/restore
│   ├── roles/
│   │   ├── common/
│   │   ├── docker/
│   │   ├── traefik/
│   │   ├── zitadel/
│   │   ├── nextcloud/
│   │   ├── backup/
│   │   └── monitoring-agent/
│   └── group_vars/
│       └── all.yml
├── secrets/                         # SOPS-encrypted secrets
│   ├── .sops.yaml                   # SOPS configuration
│   ├── shared.sops.yaml             # Shared secrets
│   └── clients/
│       ├── alpha.sops.yaml
│       └── beta.sops.yaml
├── docker/
│   ├── docker-compose.base.yml      # Common services
│   └── docker-compose.apps.yml      # Application services
└── scripts/
    ├── deploy.sh                    # Canary deployment wrapper
    ├── onboard-client.sh
    └── offboard-client.sh

Note: The Age private key (age-key.txt) is NOT stored in this repository. It must be:

Stored in a password manager
Backed up securely offline
Available on deployment machine only

13. Open Decisions / Future Considerations

To Decide Later

Shared Zitadel instance vs isolated instances per client
Central logging (Loki) - when/if needed
Prometheus metrics - when/if needed
Custom domain SSL workflow
Client self-service portal

Scaling Triggers

20+ servers: Consider Kubernetes or Nomad
Multi-region: Add OpenTofu workspaces per region
Team growth: Consider moving from SOPS to Infisical for better access control
Complex secret rotation: May need dedicated secrets server

14. Technology Choices Rationale

Why We Chose Open Source / European-Friendly Tools

Tool	Chosen	Avoided	Reason
IaC	OpenTofu	Terraform	BSL license concerns, HashiCorp trust issues
Secrets	SOPS + Age	HashiCorp Vault	Simplicity, no US vendor dependency, truly open source
Identity	Zitadel	Keycloak	Swiss company, GDPR-adequate jurisdiction, native multi-tenancy
DNS	Hetzner DNS	Cloudflare	EU-based, GDPR-native, single provider
Hosting	Hetzner	AWS/GCP/Azure	EU-based, cost-effective, GDPR-compliant
Backup	Restic + Hetzner Storage Box	Cloud backup services	Open source, EU data residency

Guiding Principles:

Prefer truly open source (OSI-approved) over source-available
Prefer EU-based services for GDPR simplicity
Avoid vendor lock-in where practical
Choose simplicity appropriate to scale (10-50 servers)

15. Development Environment and Tooling

Decision: Isolated Python Environments with pipx

Choice: Use pipx for installing Python CLI tools (Ansible) in isolated virtual environments.

Why pipx:

Prevents dependency conflicts between tools
Each tool has its own Python environment
No interference with system Python packages
Easy to upgrade/rollback individual tools
Modern best practice for Python CLI tools

Implementation:

# Install pipx
brew install pipx
pipx ensurepath

# Install Ansible in isolation
pipx install --include-deps ansible

# Inject additional dependencies as needed
pipx inject ansible requests python-dateutil

Benefits:

Aspect	Benefit
Isolation	No conflicts with other Python tools
Reproducibility	Each team member gets same isolated environment
Maintainability	Easy to upgrade Ansible without breaking other tools
Clean system	No pollution of system Python packages

Alternatives Considered:

Homebrew Ansible - Rejected: Can conflict with system Python, harder to manage dependencies
System pip install - Rejected: Pollutes global Python environment
Manual venv - Rejected: More manual work, pipx automates this

Changelog

Date	Change	Author
2024-12	Initial architecture decisions	Pieter / Claude
2024-12	Added Hetzner Storage Box as Restic backend	Pieter / Claude
2024-12	Switched from Terraform to OpenTofu (licensing concerns)	Pieter / Claude
2024-12	Switched from HashiCorp Vault to SOPS + Age (simplicity, open source)	Pieter / Claude
2024-12	Switched from Keycloak to Zitadel (Swiss company, GDPR jurisdiction)	Pieter / Claude
2024-12	Adopted pipx for isolated Python tool environments (Ansible)	Pieter / Claude

24 KiB Raw Blame History

Infrastructure Architecture Decision Record

Post-X Society Multi-Tenant VPS Platform

Executive Summary

1. Infrastructure Provisioning

Decision: OpenTofu + Ansible with Dynamic Inventory

2. Application Deployment

Decision: Modular Ansible Roles with Feature Flags

3. DNS Management

Decision: Hetzner DNS via OpenTofu

4. Identity Provider

Decision: Zitadel (replacing Keycloak)

4. Backup Strategy

Decision: Dual Backup Approach

Layer 1: Hetzner Snapshots

Layer 2: Restic to Hetzner Storage Box

5. Secrets Management

Decision: SOPS + Age Encryption

6. Monitoring

Decision: Centralized Uptime Kuma

7. Client Isolation

Decision: Full Isolation

8. Deployment Strategy

Decision: Canary Deployments with Version Pinning

Version Pinning

Canary Process

Rollback Procedures

9. Security Baseline

Decision: Comprehensive Hardening

SSH Hardening

Firewall (UFW)

Automatic Updates

Fail2ban

Container Security

Additional Measures

10. Onboarding Procedure

New Client Checklist

11. Offboarding Procedure

Client Removal Checklist

Data Retention Policy

12. Repository Structure

13. Open Decisions / Future Considerations

To Decide Later

Scaling Triggers

14. Technology Choices Rationale

Why We Chose Open Source / European-Friendly Tools

15. Development Environment and Tooling

Decision: Isolated Python Environments with pipx

Changelog

24 KiB

Raw Blame History