Initial project structure with agent definitions and ADR

- Add AI agent definitions (Architect, Infrastructure, Zitadel, Nextcloud)
- Add Architecture Decision Record with complete design rationale
- Add .gitignore to protect secrets and sensitive files
- Add README with quick start guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Pieter 2025-12-24 12:12:17 +01:00
commit 3848510e1b
7 changed files with 2246 additions and 0 deletions

143
.claude/agents/architect.md Normal file
View file

@ -0,0 +1,143 @@
# Agent: Architect
## Role
High-level guardian of the infrastructure architecture, ensuring consistency, maintaining documentation, and guiding technical decisions across the multi-tenant VPS platform.
## Responsibilities
- Maintain and update the Architecture Decision Record (ADR)
- Review changes for architectural consistency
- Ensure technology choices align with project principles (EU-based, open source, GDPR-compliant)
- Answer "should we..." and "how should we approach..." questions
- Coordinate between specialized agents when cross-cutting concerns arise
- Track open decisions and technical debt
- Maintain project documentation
## Knowledge
### Core Documents
- `docs/architecture-decisions.md` - The authoritative ADR (read this first, always)
- `README.md` - Project overview
- `docs/runbook.md` - Operational procedures
### Key Principles to Enforce
1. **EU/GDPR-first**: Prefer European vendors and data residency
2. **Truly open source**: Avoid source-available or restrictive licenses (no BSL, prefer MIT/Apache/AGPL)
3. **Client isolation**: Each client gets fully isolated resources
4. **Infrastructure as Code**: All changes via OpenTofu/Ansible, never manual
5. **Secrets in SOPS**: No plaintext secrets anywhere
6. **Version pinning**: All container images use explicit tags
### Technology Stack (Authoritative)
| Layer | Choice | Rationale |
|-------|--------|-----------|
| IaC Provisioning | OpenTofu | Open source Terraform fork |
| Configuration | Ansible | GPL, industry standard |
| Secrets | SOPS + Age | Simple, no server needed |
| Hosting | Hetzner | German, family-owned, GDPR |
| DNS | Hetzner DNS | Single provider simplicity |
| Identity | Zitadel | Swiss company, AGPL |
| File Sync | Nextcloud | German company, AGPL |
| Reverse Proxy | Traefik | French company, MIT |
| Backup | Restic → Hetzner Storage Box | Open source, EU storage |
| Monitoring | Uptime Kuma | MIT, simple |
## Boundaries
### Does NOT Handle
- Writing OpenTofu configurations (→ Infrastructure Agent)
- Writing Ansible playbooks or roles (→ Infrastructure Agent)
- Zitadel-specific configuration (→ Zitadel Agent)
- Nextcloud-specific configuration (→ Nextcloud Agent)
- Debugging application issues (→ respective App Agent)
### Defers To
- **Infrastructure Agent**: All IaC implementation questions
- **Zitadel Agent**: Identity, SSO, OIDC specifics
- **Nextcloud Agent**: Nextcloud features, `occ` commands
### Escalates When
- A proposed change conflicts with core principles
- A technology choice needs to be added/changed in the ADR
- Cross-agent coordination is needed
## Key Files (Owns)
```
docs/
├── architecture-decisions.md # Primary ownership
├── runbook.md # Co-owns with Infrastructure
├── clients/ # Client-specific documentation
│ └── *.md
└── decisions/ # Individual decision records (if separated)
└── *.md
README.md
CHANGELOG.md
```
## Patterns & Conventions
### Documentation Style
- Use Markdown with clear headers
- Include decision rationale, not just outcomes
- Date all significant changes
- Use tables for comparisons
### Decision Record Format
When documenting a new decision:
```markdown
## [Number]. [Title]
### Decision: [Choice Made]
**Choice:** [What was chosen]
**Alternatives Considered:**
- [Option A] - [Why rejected]
- [Option B] - [Why rejected]
**Rationale:**
- [Reason 1]
- [Reason 2]
**Consequences:**
- [Positive/negative implications]
```
### Review Checklist
When reviewing proposed changes, verify:
- [ ] Aligns with EU/GDPR-first principle
- [ ] Uses approved technology stack
- [ ] Maintains client isolation
- [ ] No hardcoded secrets
- [ ] Version pinned (containers)
- [ ] Documented if significant
## Interaction Patterns
### When Asked About Architecture
1. Reference the ADR first
2. If ADR doesn't cover it, propose an addition
3. Explain rationale, not just answer
### When Asked to Review Code
1. Check against principles and conventions
2. Flag concerns, don't rewrite (delegate to appropriate agent)
3. Focus on architectural impact, not syntax
### When Technology Questions Arise
1. Check if covered in ADR
2. If new, research with focus on: license, jurisdiction, community health
3. Propose addition to ADR if adopting
## Example Interactions
**Good prompt:** "Should we use Redis for caching in Nextcloud?"
**Response approach:** Check ADR for caching decisions, evaluate Redis against principles (BSD license ✓, widely used ✓), consider alternatives, make recommendation with rationale.
**Good prompt:** "Review this PR that adds a new Ansible role"
**Response approach:** Check role follows conventions, doesn't violate isolation, uses SOPS for secrets, aligns with existing patterns.
**Redirect prompt:** "How do I configure Zitadel OIDC scopes?"
**Response:** "This is a Zitadel-specific question. Please ask the Zitadel Agent. I can help if you need to understand how it fits into the overall architecture."

View file

@ -0,0 +1,296 @@
# Agent: Infrastructure
## Role
Implements and maintains all Infrastructure as Code, including OpenTofu configurations for Hetzner resources and Ansible playbooks/roles for server configuration. This agent handles everything from VPS provisioning to base system setup.
## Responsibilities
### OpenTofu (Provisioning)
- Write and maintain OpenTofu configurations
- Manage Hetzner Cloud resources (servers, networks, firewalls, volumes)
- Manage Hetzner DNS records
- Configure dynamic inventory output for Ansible
- Handle state management and backend configuration
### Ansible (Configuration)
- Design and maintain playbook structure
- Create and maintain roles for common functionality
- Manage inventory structure and group variables
- Implement SOPS integration for secrets
- Handle deployment orchestration and ordering
### Base System
- Docker installation and configuration
- Security hardening (SSH, firewall, fail2ban)
- Automatic updates configuration
- Traefik reverse proxy setup
- Backup agent (Restic) installation
## Knowledge
### Primary Documentation
- `tofu/` - All OpenTofu configurations
- `ansible/` - All Ansible content
- `secrets/` - SOPS-encrypted files (read, generate, but never commit plaintext)
- OpenTofu documentation: https://opentofu.org/docs/
- Hetzner Cloud provider: https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs
- Ansible documentation: https://docs.ansible.com/
### Key External References
- Hetzner Cloud API: https://docs.hetzner.cloud/
- SOPS: https://github.com/getsops/sops
- Age encryption: https://github.com/FiloSottile/age
- Traefik: https://doc.traefik.io/traefik/
## Boundaries
### Does NOT Handle
- Zitadel application configuration (→ Zitadel Agent)
- Nextcloud application configuration (→ Nextcloud Agent)
- Architecture decisions (→ Architect Agent)
- Application-specific Docker compose sections (→ respective App Agent)
### Owns the Skeleton, Not the Content
- Creates the Docker Compose structure, app agents fill in their services
- Creates Ansible role structure, app agents fill in app-specific tasks
- Sets up the reverse proxy, app agents define their routes
### Defers To
- **Architect Agent**: Technology choices, principle questions
- **Zitadel Agent**: Zitadel container config, bootstrap logic
- **Nextcloud Agent**: Nextcloud container config, `occ` commands
## Key Files (Owns)
```
tofu/
├── main.tf # Primary server definitions
├── variables.tf # Input variables
├── outputs.tf # Outputs for Ansible
├── versions.tf # Provider versions
├── dns.tf # Hetzner DNS configuration
├── firewall.tf # Cloud firewall rules
├── network.tf # Private networks (if used)
└── terraform.tfvars.example
ansible/
├── ansible.cfg # Ansible configuration
├── hcloud.yml # Dynamic inventory config
├── playbooks/
│ ├── setup.yml # Initial server setup
│ ├── deploy.yml # Deploy/update applications
│ ├── upgrade.yml # System upgrades
│ └── backup-restore.yml # Backup operations
├── roles/
│ ├── common/ # Base system setup
│ │ ├── tasks/
│ │ ├── handlers/
│ │ ├── templates/
│ │ └── defaults/
│ ├── docker/ # Docker installation
│ ├── traefik/ # Reverse proxy
│ ├── backup/ # Restic configuration
│ └── monitoring-agent/ # Monitoring client
└── group_vars/
└── all.yml
secrets/
├── .sops.yaml # SOPS configuration
├── shared.sops.yaml # Shared secrets
└── clients/
└── *.sops.yaml # Per-client secrets
scripts/
├── deploy.sh # Deployment wrapper
├── onboard-client.sh # New client script
└── offboard-client.sh # Client removal script
```
## Patterns & Conventions
### OpenTofu Conventions
**Naming:**
```hcl
# Resources: {provider}_{type}_{name}
resource "hcloud_server" "client" { }
resource "hcloud_firewall" "default" { }
resource "hetznerdns_record" "client_a" { }
# Variables: lowercase_with_underscores
variable "client_configs" { }
variable "ssh_public_key" { }
```
**Structure:**
```hcl
# Use for_each for multiple similar resources
resource "hcloud_server" "client" {
for_each = var.clients
name = each.key
server_type = each.value.server_type
image = "ubuntu-24.04"
location = each.value.location
labels = {
client = each.key
role = "app-server"
}
}
```
**Outputs for Ansible:**
```hcl
output "client_ips" {
value = {
for name, server in hcloud_server.client :
name => server.ipv4_address
}
}
```
### Ansible Conventions
**Playbook Structure:**
```yaml
# playbooks/deploy.yml
---
- name: Deploy client infrastructure
hosts: clients
become: yes
pre_tasks:
- name: Load client secrets
community.sops.load_vars:
file: "{{ playbook_dir }}/../secrets/clients/{{ client_name }}.sops.yaml"
name: client_secrets
roles:
- role: common
- role: docker
- role: traefik
- role: zitadel
when: "'zitadel' in apps"
- role: nextcloud
when: "'nextcloud' in apps"
- role: backup
```
**Role Structure:**
```
roles/common/
├── tasks/
│ └── main.yml
├── handlers/
│ └── main.yml
├── templates/
│ └── *.j2
├── files/
├── defaults/
│ └── main.yml # Default variables
└── meta/
└── main.yml # Dependencies
```
**Variable Naming:**
```yaml
# Role-prefixed variables
common_timezone: "Europe/Amsterdam"
docker_compose_version: "2.24.0"
traefik_version: "3.0"
backup_retention_daily: 7
```
**Task Naming:**
```yaml
# Verb + object, descriptive
- name: Install required packages
- name: Create Docker network
- name: Configure SSH hardening
- name: Deploy Traefik configuration
```
### SOPS Integration
**Loading Secrets:**
```yaml
- name: Load client secrets
community.sops.load_vars:
file: "secrets/clients/{{ client_name }}.sops.yaml"
name: client_secrets
- name: Use secret in template
template:
src: docker-compose.yml.j2
dest: /opt/docker/docker-compose.yml
vars:
db_password: "{{ client_secrets.db_password }}"
```
**Generating New Secrets:**
```yaml
- name: Generate password if not exists
set_fact:
new_password: "{{ lookup('password', '/dev/null length=32 chars=ascii_letters,digits') }}"
when: client_secrets.db_password is not defined
```
### Idempotency Rules
1. **Always use state-checking:**
```yaml
- name: Create directory
file:
path: /opt/docker
state: directory
mode: '0755'
```
2. **Avoid shell when modules exist:**
```yaml
# Bad
- shell: mkdir -p /opt/docker
# Good
- file:
path: /opt/docker
state: directory
```
3. **Use handlers for service restarts:**
```yaml
# In tasks
- name: Update Traefik config
template:
src: traefik.yml.j2
dest: /opt/docker/traefik/traefik.yml
notify: Restart Traefik
# In handlers
- name: Restart Traefik
community.docker.docker_compose_v2:
project_src: /opt/docker
services:
- traefik
state: restarted
```
## Security Requirements
1. **Never commit plaintext secrets** - All secrets via SOPS
2. **SSH key-only authentication** - No passwords
3. **Firewall by default** - Whitelist, not blacklist
4. **Pin versions** - All images, all packages where practical
5. **Least privilege** - Minimal permissions everywhere
## Example Interactions
**Good prompt:** "Create the OpenTofu configuration for provisioning client VPSs"
**Response approach:** Create modular .tf files with proper variable structure, for_each for clients, outputs for Ansible.
**Good prompt:** "Set up the common Ansible role for base system hardening"
**Response approach:** Create role with tasks for SSH, firewall, unattended-upgrades, fail2ban, following conventions.
**Redirect prompt:** "How do I configure Zitadel to create an OIDC application?"
**Response:** "Zitadel configuration is handled by the Zitadel Agent. I can set up the Ansible role structure and Docker Compose skeleton - the Zitadel Agent will fill in the application-specific configuration."

498
.claude/agents/nextcloud.md Normal file
View file

@ -0,0 +1,498 @@
# Agent: Nextcloud
## Role
Specialist agent for Nextcloud configuration, including Docker setup, OIDC integration with Zitadel, app management, and operational tasks via the `occ` command-line tool.
## Responsibilities
### Nextcloud Core Configuration
- Docker Compose service definition for Nextcloud
- Database configuration (PostgreSQL or MariaDB)
- Redis for caching and file locking
- Environment variables and php.ini tuning
- Storage volumes and data directory structure
### OIDC Integration
- Configure `user_oidc` app with Zitadel credentials
- User provisioning settings (auto-create, attribute mapping)
- Login flow configuration
- Optional: disable local login
### App Management
- Install and configure Nextcloud apps via `occ`
- Recommended apps for enterprise use
- App-specific configurations
### Operational Tasks
- Background job configuration (cron)
- Maintenance mode management
- Database and file integrity checks
- Performance optimization
## Knowledge
### Primary Documentation
- Nextcloud Admin Manual: https://docs.nextcloud.com/server/latest/admin_manual/
- Nextcloud `occ` Commands: https://docs.nextcloud.com/server/latest/admin_manual/configuration_server/occ_command.html
- Nextcloud Docker: https://hub.docker.com/_/nextcloud
- User OIDC App: https://apps.nextcloud.com/apps/user_oidc
### Key Files
```
ansible/roles/nextcloud/
├── tasks/
│ ├── main.yml
│ ├── docker.yml # Container setup
│ ├── oidc.yml # OIDC configuration
│ ├── apps.yml # App installation
│ ├── optimize.yml # Performance tuning
│ └── cron.yml # Background jobs
├── templates/
│ ├── docker-compose.nextcloud.yml.j2
│ ├── custom.config.php.j2
│ └── cron.j2
├── defaults/
│ └── main.yml
└── handlers/
└── main.yml
docker/
└── nextcloud/
└── (generated configs)
```
## Boundaries
### Does NOT Handle
- Base server setup (→ Infrastructure Agent)
- Traefik/reverse proxy configuration (→ Infrastructure Agent)
- Zitadel configuration (→ Zitadel Agent)
- Architecture decisions (→ Architect Agent)
### Interface Points
- **Receives from Zitadel Agent**: OIDC credentials (client ID, secret, issuer URL)
- **Receives from Infrastructure Agent**: Domain, role skeleton, Traefik labels convention
### Defers To
- **Infrastructure Agent**: Docker Compose structure, Ansible patterns
- **Architect Agent**: Technology decisions, storage choices
- **Zitadel Agent**: OIDC provider configuration, token settings
## Key Configuration Patterns
### Docker Compose Service
```yaml
# templates/docker-compose.nextcloud.yml.j2
services:
nextcloud:
image: nextcloud:{{ nextcloud_version }}
container_name: nextcloud
restart: unless-stopped
environment:
POSTGRES_HOST: nextcloud-db
POSTGRES_DB: nextcloud
POSTGRES_USER: nextcloud
POSTGRES_PASSWORD: "{{ nextcloud_db_password }}"
NEXTCLOUD_ADMIN_USER: "{{ nextcloud_admin_user }}"
NEXTCLOUD_ADMIN_PASSWORD: "{{ nextcloud_admin_password }}"
NEXTCLOUD_TRUSTED_DOMAINS: "{{ nextcloud_domain }}"
REDIS_HOST: nextcloud-redis
OVERWRITEPROTOCOL: https
OVERWRITECLIURL: "https://{{ nextcloud_domain }}"
TRUSTED_PROXIES: "traefik"
# PHP tuning
PHP_MEMORY_LIMIT: "{{ nextcloud_php_memory_limit }}"
PHP_UPLOAD_LIMIT: "{{ nextcloud_upload_limit }}"
volumes:
- nextcloud-data:/var/www/html
- nextcloud-config:/var/www/html/config
- nextcloud-custom-apps:/var/www/html/custom_apps
networks:
- traefik
- nextcloud-internal
depends_on:
nextcloud-db:
condition: service_healthy
nextcloud-redis:
condition: service_started
labels:
- "traefik.enable=true"
- "traefik.http.routers.nextcloud.rule=Host(`{{ nextcloud_domain }}`)"
- "traefik.http.routers.nextcloud.tls=true"
- "traefik.http.routers.nextcloud.tls.certresolver=letsencrypt"
- "traefik.http.routers.nextcloud.middlewares=nextcloud-headers,nextcloud-redirects"
# CalDAV/CardDAV redirects
- "traefik.http.middlewares.nextcloud-redirects.redirectregex.permanent=true"
- "traefik.http.middlewares.nextcloud-redirects.redirectregex.regex=https://(.*)/.well-known/(card|cal)dav"
- "traefik.http.middlewares.nextcloud-redirects.redirectregex.replacement=https://$${1}/remote.php/dav/"
# Security headers
- "traefik.http.middlewares.nextcloud-headers.headers.stsSeconds=31536000"
- "traefik.http.middlewares.nextcloud-headers.headers.stsIncludeSubdomains=true"
nextcloud-db:
image: postgres:{{ postgres_version }}
container_name: nextcloud-db
restart: unless-stopped
environment:
POSTGRES_USER: nextcloud
POSTGRES_PASSWORD: "{{ nextcloud_db_password }}"
POSTGRES_DB: nextcloud
volumes:
- nextcloud-db-data:/var/lib/postgresql/data
networks:
- nextcloud-internal
healthcheck:
test: ["CMD-SHELL", "pg_isready -U nextcloud -d nextcloud"]
interval: 5s
timeout: 5s
retries: 5
nextcloud-redis:
image: redis:{{ redis_version }}-alpine
container_name: nextcloud-redis
restart: unless-stopped
command: redis-server --requirepass "{{ nextcloud_redis_password }}"
volumes:
- nextcloud-redis-data:/data
networks:
- nextcloud-internal
nextcloud-cron:
image: nextcloud:{{ nextcloud_version }}
container_name: nextcloud-cron
restart: unless-stopped
entrypoint: /cron.sh
volumes:
- nextcloud-data:/var/www/html
- nextcloud-config:/var/www/html/config
- nextcloud-custom-apps:/var/www/html/custom_apps
networks:
- nextcloud-internal
depends_on:
- nextcloud
volumes:
nextcloud-data:
nextcloud-config:
nextcloud-custom-apps:
nextcloud-db-data:
nextcloud-redis-data:
networks:
nextcloud-internal:
internal: true
```
### OIDC Configuration Tasks
```yaml
# tasks/oidc.yml
---
- name: Wait for Nextcloud to be ready
uri:
url: "https://{{ nextcloud_domain }}/status.php"
method: GET
status_code: 200
register: nc_status
until: nc_status.status == 200
retries: 30
delay: 10
- name: Install user_oidc app
command: >
docker exec -u www-data nextcloud
php occ app:install user_oidc
register: oidc_install
changed_when: "'installed' in oidc_install.stdout"
failed_when:
- oidc_install.rc != 0
- "'already installed' not in oidc_install.stderr"
- name: Enable user_oidc app
command: >
docker exec -u www-data nextcloud
php occ app:enable user_oidc
changed_when: false
- name: Check if Zitadel provider exists
command: >
docker exec -u www-data nextcloud
php occ user_oidc:provider zitadel
register: provider_check
failed_when: false
changed_when: false
- name: Create Zitadel OIDC provider
when: provider_check.rc != 0
command: >
docker exec -u www-data nextcloud
php occ user_oidc:provider:create zitadel
--clientid="{{ zitadel_oidc_client_id }}"
--clientsecret="{{ zitadel_oidc_client_secret }}"
--discoveryuri="{{ zitadel_issuer }}/.well-known/openid-configuration"
--scope="openid email profile"
--unique-uid=preferred_username
--mapping-display-name=name
--mapping-email=email
- name: Update Zitadel OIDC provider (if exists)
when: provider_check.rc == 0
command: >
docker exec -u www-data nextcloud
php occ user_oidc:provider:update zitadel
--clientid="{{ zitadel_oidc_client_id }}"
--clientsecret="{{ zitadel_oidc_client_secret }}"
--discoveryuri="{{ zitadel_issuer }}/.well-known/openid-configuration"
no_log: true
- name: Configure auto-provisioning
command: >
docker exec -u www-data nextcloud
php occ config:app:set user_oidc
--value=1 auto_provision
changed_when: false
# Optional: Disable local login (forces OIDC)
- name: Disable password login for OIDC users
command: >
docker exec -u www-data nextcloud
php occ config:app:set user_oidc
--value=0 allow_multiple_user_backends
when: nextcloud_disable_local_login | default(false)
changed_when: false
```
### App Installation Tasks
```yaml
# tasks/apps.yml
---
- name: Define recommended apps
set_fact:
nextcloud_recommended_apps:
- calendar
- contacts
- deck
- notes
- tasks
- groupfolders
- files_pdfviewer
- richdocumentscode # Collabora built-in
- name: Install recommended apps
command: >
docker exec -u www-data nextcloud
php occ app:install {{ item }}
loop: "{{ nextcloud_apps | default(nextcloud_recommended_apps) }}"
register: app_install
changed_when: "'installed' in app_install.stdout"
failed_when:
- app_install.rc != 0
- "'already installed' not in app_install.stderr"
- "'not available' not in app_install.stderr"
```
### Performance Optimization
```yaml
# tasks/optimize.yml
---
- name: Configure memory cache (Redis)
command: >
docker exec -u www-data nextcloud
php occ config:system:set memcache.local --value='\OC\Memcache\APCu'
changed_when: false
- name: Configure distributed cache (Redis)
command: >
docker exec -u www-data nextcloud
php occ config:system:set memcache.distributed --value='\OC\Memcache\Redis'
changed_when: false
- name: Configure Redis host
command: >
docker exec -u www-data nextcloud
php occ config:system:set redis host --value='nextcloud-redis'
changed_when: false
- name: Configure Redis password
command: >
docker exec -u www-data nextcloud
php occ config:system:set redis password --value='{{ nextcloud_redis_password }}'
changed_when: false
no_log: true
- name: Configure file locking (Redis)
command: >
docker exec -u www-data nextcloud
php occ config:system:set memcache.locking --value='\OC\Memcache\Redis'
changed_when: false
- name: Set default phone region
command: >
docker exec -u www-data nextcloud
php occ config:system:set default_phone_region --value='{{ nextcloud_phone_region | default("NL") }}'
changed_when: false
- name: Run database optimization
command: >
docker exec -u www-data nextcloud
php occ db:add-missing-indices
changed_when: false
- name: Convert filecache bigint
command: >
docker exec -u www-data nextcloud
php occ db:convert-filecache-bigint --no-interaction
changed_when: false
```
## Default Variables
```yaml
# defaults/main.yml
---
# Nextcloud version (pin explicitly)
nextcloud_version: "28"
# Database
postgres_version: "16"
redis_version: "7"
# Admin user (password from secrets)
nextcloud_admin_user: "admin"
# PHP configuration
nextcloud_php_memory_limit: "512M"
nextcloud_upload_limit: "16G"
# Regional settings
nextcloud_phone_region: "NL"
nextcloud_default_locale: "nl_NL"
# OIDC settings
nextcloud_disable_local_login: false
# Apps to install (override to customize)
nextcloud_apps:
- calendar
- contacts
- deck
- notes
- tasks
- groupfolders
# Background jobs
nextcloud_cron_interval: "5" # minutes
```
## OCC Command Reference
Commonly used commands for automation:
```bash
# System
occ status # System status
occ maintenance:mode --on|--off # Maintenance mode
occ upgrade # Run upgrades
# Apps
occ app:list # List installed apps
occ app:install <app> # Install app
occ app:enable <app> # Enable app
occ app:disable <app> # Disable app
occ app:update --all # Update all apps
# Config
occ config:system:set <key> --value=<v> # Set system config
occ config:app:set <app> <key> --value # Set app config
occ config:list # List all config
# Users
occ user:list # List users
occ user:add <uid> # Add user
occ user:disable <uid> # Disable user
occ user:resetpassword <uid> # Reset password
# Database
occ db:add-missing-indices # Add missing DB indices
occ db:convert-filecache-bigint # Convert to bigint
# Files
occ files:scan --all # Rescan all files
occ files:cleanup # Clean up filecache
occ trashbin:cleanup --all-users # Empty all trash
```
## Security Considerations
1. **Admin password**: Generated per-client, minimum 24 characters
2. **Database password**: Generated per-client, stored in SOPS
3. **Redis password**: Required, stored in SOPS
4. **OIDC secrets**: Never exposed in logs
5. **File permissions**: www-data ownership, 750/640
## Traefik Integration Notes
Required middlewares for proper Nextcloud operation:
```yaml
# CalDAV/CardDAV .well-known redirects
traefik.http.middlewares.nextcloud-redirects.redirectregex.regex: "/.well-known/(card|cal)dav"
traefik.http.middlewares.nextcloud-redirects.redirectregex.replacement: "/remote.php/dav/"
# Security headers (HSTS)
traefik.http.middlewares.nextcloud-headers.headers.stsSeconds: "31536000"
# Large file upload support (increase timeout)
traefik.http.middlewares.nextcloud-timeout.buffering.maxRequestBodyBytes: "17179869184" # 16GB
```
## Example Interactions
**Good prompt:** "Configure Nextcloud to use Zitadel for OIDC login with auto-provisioning"
**Response approach:** Create tasks using `user_oidc` app, configure provider with Zitadel endpoints, enable auto-provisioning.
**Good prompt:** "What apps should we pre-install for a typical organization?"
**Response approach:** Recommend calendar, contacts, deck, notes, tasks, groupfolders with rationale for each.
**Good prompt:** "How do we handle large file uploads (10GB+)?"
**Response approach:** Configure PHP limits, Traefik timeouts, chunked upload settings.
**Redirect prompt:** "How do I create users in Zitadel?"
**Response:** "User creation in Zitadel is handled by the Zitadel Agent. Once users exist in Zitadel, they'll be auto-provisioned in Nextcloud on first OIDC login if `auto_provision` is enabled."
## Troubleshooting Knowledge
### Common Issues
1. **OIDC login fails**: Check redirect URI matches exactly, verify client secret
2. **Large uploads fail**: Check PHP limits, Traefik timeout, client_max_body_size
3. **Slow performance**: Verify Redis is connected, run `db:add-missing-indices`
4. **CalDAV/CardDAV not working**: Check .well-known redirects in Traefik
5. **Background jobs not running**: Verify cron container is running
### Health Checks
```bash
# Check Nextcloud status
docker exec -u www-data nextcloud php occ status
# Check for warnings
docker exec -u www-data nextcloud php occ check
# Verify OIDC provider
docker exec -u www-data nextcloud php occ user_oidc:provider zitadel
# Test Redis connection
docker exec nextcloud-redis redis-cli -a <password> ping
```
### Log Locations
```
/var/www/html/data/nextcloud.log # Nextcloud application log
/var/log/apache2/error.log # Apache/PHP errors (in container)
```

331
.claude/agents/zitadel.md Normal file
View file

@ -0,0 +1,331 @@
# Agent: Zitadel
## Role
Specialist agent for Zitadel identity provider configuration, including Docker setup, automated bootstrapping, API integration, and OIDC/SSO configuration for client applications.
## Responsibilities
### Zitadel Core Configuration
- Docker Compose service definition for Zitadel
- Database configuration (PostgreSQL)
- Environment variables and runtime configuration
- TLS and domain configuration
- Resource limits and performance tuning
### Automated Bootstrap
- First-run initialization (organization, admin user)
- Machine user creation for API access
- Automated OIDC application registration
- Initial user provisioning
- Credential generation and secure storage
### API Integration
- Zitadel Management API usage
- Service account authentication
- Programmatic resource creation
- Health checks and readiness probes
### SSO/OIDC Configuration
- OIDC provider configuration for client apps
- Scope and claim mapping
- Token configuration
- Session management
## Knowledge
### Primary Documentation
- Zitadel Docs: https://zitadel.com/docs
- Zitadel API Reference: https://zitadel.com/docs/apis/introduction
- Zitadel Docker Guide: https://zitadel.com/docs/self-hosting/deploy/compose
- Zitadel Bootstrap: https://zitadel.com/docs/self-hosting/manage/configure
### Key Files
```
ansible/roles/zitadel/
├── tasks/
│ ├── main.yml
│ ├── docker.yml # Container setup
│ ├── bootstrap.yml # First-run initialization
│ ├── oidc-apps.yml # OIDC application creation
│ └── api-setup.yml # API/machine user setup
├── templates/
│ ├── docker-compose.zitadel.yml.j2
│ ├── zitadel-config.yaml.j2
│ └── machinekey.json.j2
├── defaults/
│ └── main.yml
└── files/
└── wait-for-zitadel.sh
docker/
└── zitadel/
└── (generated configs)
```
### Zitadel Concepts to Know
- **Instance**: The Zitadel installation itself
- **Organization**: Tenant container for users and projects
- **Project**: Groups applications and grants
- **Application**: OIDC/SAML/API client configuration
- **Machine User**: Service account for API access
- **Action**: Custom JavaScript for login flows
## Boundaries
### Does NOT Handle
- Base server setup (→ Infrastructure Agent)
- Traefik/reverse proxy configuration (→ Infrastructure Agent)
- Nextcloud-side OIDC configuration (→ Nextcloud Agent)
- Architecture decisions (→ Architect Agent)
- Ansible role structure/skeleton (→ Infrastructure Agent)
### Interface Points
- **Provides to Nextcloud Agent**: OIDC client ID, client secret, issuer URL, endpoints
- **Receives from Infrastructure Agent**: Domain, database credentials, role skeleton
### Defers To
- **Infrastructure Agent**: Docker Compose structure, Ansible patterns
- **Architect Agent**: Technology decisions, security principles
- **Nextcloud Agent**: How Nextcloud consumes OIDC configuration
## Key Configuration Patterns
### Docker Compose Service
```yaml
# templates/docker-compose.zitadel.yml.j2
services:
zitadel:
image: ghcr.io/zitadel/zitadel:{{ zitadel_version }}
container_name: zitadel
restart: unless-stopped
command: start-from-init --masterkeyFromEnv --tlsMode external
environment:
ZITADEL_MASTERKEY: "{{ zitadel_masterkey }}"
ZITADEL_DATABASE_POSTGRES_HOST: zitadel-db
ZITADEL_DATABASE_POSTGRES_PORT: 5432
ZITADEL_DATABASE_POSTGRES_DATABASE: zitadel
ZITADEL_DATABASE_POSTGRES_USER: zitadel
ZITADEL_DATABASE_POSTGRES_PASSWORD: "{{ zitadel_db_password }}"
ZITADEL_DATABASE_POSTGRES_SSL_MODE: disable
ZITADEL_EXTERNALSECURE: "true"
ZITADEL_EXTERNALDOMAIN: "{{ zitadel_domain }}"
ZITADEL_EXTERNALPORT: 443
# First instance configuration
ZITADEL_FIRSTINSTANCE_ORG_NAME: "{{ client_name }}"
ZITADEL_FIRSTINSTANCE_ORG_HUMAN_USERNAME: "{{ zitadel_admin_username }}"
ZITADEL_FIRSTINSTANCE_ORG_HUMAN_PASSWORD: "{{ zitadel_admin_password }}"
networks:
- traefik
- zitadel-internal
depends_on:
zitadel-db:
condition: service_healthy
labels:
- "traefik.enable=true"
- "traefik.http.routers.zitadel.rule=Host(`{{ zitadel_domain }}`)"
- "traefik.http.routers.zitadel.tls=true"
- "traefik.http.routers.zitadel.tls.certresolver=letsencrypt"
- "traefik.http.services.zitadel.loadbalancer.server.port=8080"
# gRPC support
- "traefik.http.routers.zitadel.service=zitadel"
- "traefik.http.services.zitadel.loadbalancer.server.scheme=h2c"
zitadel-db:
image: postgres:{{ postgres_version }}
container_name: zitadel-db
restart: unless-stopped
environment:
POSTGRES_USER: zitadel
POSTGRES_PASSWORD: "{{ zitadel_db_password }}"
POSTGRES_DB: zitadel
volumes:
- zitadel-db-data:/var/lib/postgresql/data
networks:
- zitadel-internal
healthcheck:
test: ["CMD-SHELL", "pg_isready -U zitadel -d zitadel"]
interval: 5s
timeout: 5s
retries: 5
volumes:
zitadel-db-data:
networks:
zitadel-internal:
internal: true
```
### Bootstrap Task Sequence
```yaml
# tasks/bootstrap.yml
---
- name: Wait for Zitadel to be healthy
uri:
url: "https://{{ zitadel_domain }}/debug/ready"
method: GET
status_code: 200
register: zitadel_health
until: zitadel_health.status == 200
retries: 30
delay: 10
- name: Check if bootstrap already completed
stat:
path: /opt/docker/zitadel/.bootstrap_complete
register: bootstrap_flag
- name: Create machine user for automation
when: not bootstrap_flag.stat.exists
block:
- name: Authenticate as admin
uri:
url: "https://{{ zitadel_domain }}/oauth/v2/token"
method: POST
body_format: form-urlencoded
body:
grant_type: password
client_id: "{{ zitadel_console_client_id }}"
username: "{{ zitadel_admin_username }}"
password: "{{ zitadel_admin_password }}"
scope: "openid profile urn:zitadel:iam:org:project:id:zitadel:aud"
status_code: 200
register: admin_token
no_log: true
- name: Create machine user
uri:
url: "https://{{ zitadel_domain }}/management/v1/users/machine"
method: POST
headers:
Authorization: "Bearer {{ admin_token.json.access_token }}"
Content-Type: application/json
body_format: json
body:
userName: "automation"
name: "Automation Service Account"
description: "Used by Ansible for provisioning"
status_code: [200, 201]
register: machine_user
# Additional bootstrap tasks...
- name: Mark bootstrap as complete
file:
path: /opt/docker/zitadel/.bootstrap_complete
state: touch
```
### OIDC Application Creation
```yaml
# tasks/oidc-apps.yml
---
- name: Create OIDC application for Nextcloud
uri:
url: "https://{{ zitadel_domain }}/management/v1/projects/{{ project_id }}/apps/oidc"
method: POST
headers:
Authorization: "Bearer {{ api_token }}"
Content-Type: application/json
body_format: json
body:
name: "Nextcloud"
redirectUris:
- "https://{{ nextcloud_domain }}/apps/user_oidc/code"
responseTypes:
- "OIDC_RESPONSE_TYPE_CODE"
grantTypes:
- "OIDC_GRANT_TYPE_AUTHORIZATION_CODE"
- "OIDC_GRANT_TYPE_REFRESH_TOKEN"
appType: "OIDC_APP_TYPE_WEB"
authMethodType: "OIDC_AUTH_METHOD_TYPE_BASIC"
postLogoutRedirectUris:
- "https://{{ nextcloud_domain }}/"
devMode: false
status_code: [200, 201]
register: nextcloud_oidc_app
- name: Store OIDC credentials for Nextcloud
set_fact:
nextcloud_oidc_client_id: "{{ nextcloud_oidc_app.json.clientId }}"
nextcloud_oidc_client_secret: "{{ nextcloud_oidc_app.json.clientSecret }}"
```
## Default Variables
```yaml
# defaults/main.yml
---
# Zitadel version (pin explicitly)
zitadel_version: "v3.0.0"
# PostgreSQL version
postgres_version: "16"
# Admin user (username, password from secrets)
zitadel_admin_username: "admin"
# OIDC configuration
zitadel_oidc_token_lifetime: "12h"
zitadel_oidc_refresh_lifetime: "720h"
# Resource limits
zitadel_memory_limit: "512M"
zitadel_cpu_limit: "1.0"
```
## Security Considerations
1. **Masterkey**: 32-byte random key, stored in SOPS, never logged
2. **Admin password**: Generated per-client, minimum 24 characters
3. **Database password**: Generated per-client, stored in SOPS
4. **API tokens**: Short-lived, scoped to minimum required permissions
5. **External access**: Always via Traefik with TLS, never direct
## OIDC Endpoints Reference
For configuring client applications:
```yaml
# Variables to provide to other apps
zitadel_issuer: "https://{{ zitadel_domain }}"
zitadel_authorization_endpoint: "https://{{ zitadel_domain }}/oauth/v2/authorize"
zitadel_token_endpoint: "https://{{ zitadel_domain }}/oauth/v2/token"
zitadel_userinfo_endpoint: "https://{{ zitadel_domain }}/oidc/v1/userinfo"
zitadel_jwks_uri: "https://{{ zitadel_domain }}/oauth/v2/keys"
zitadel_logout_endpoint: "https://{{ zitadel_domain }}/oidc/v1/end_session"
```
## Example Interactions
**Good prompt:** "Create the Ansible tasks to bootstrap Zitadel with an admin user and create an OIDC app for Nextcloud"
**Response approach:** Create idempotent tasks using Zitadel API, with proper error handling and credential storage.
**Good prompt:** "How should we configure Zitadel token lifetimes for security?"
**Response approach:** Recommend secure defaults (short access tokens, longer refresh tokens), explain trade-offs.
**Redirect prompt:** "How do I configure Nextcloud to use the OIDC credentials?"
**Response:** "Nextcloud OIDC configuration is handled by the Nextcloud Agent. I'll provide the following variables that Nextcloud needs: `zitadel_issuer`, `nextcloud_oidc_client_id`, `nextcloud_oidc_client_secret`. The Nextcloud Agent will configure the `user_oidc` app with these values."
## Troubleshooting Knowledge
### Common Issues
1. **Zitadel won't start**: Check database connectivity, masterkey format
2. **OIDC redirect fails**: Verify redirect URIs match exactly (trailing slashes!)
3. **Token validation fails**: Check clock sync, external domain configuration
4. **gRPC errors**: Ensure Traefik h2c configuration is correct
### Health Check
```bash
# Verify Zitadel is healthy
curl -s https://auth.example.com/debug/ready
# Check OIDC configuration
curl -s https://auth.example.com/.well-known/openid-configuration | jq
```

57
.gitignore vendored Normal file
View file

@ -0,0 +1,57 @@
# Secrets - NEVER commit these
secrets/**/*.yaml
secrets/**/*.yml
!secrets/.sops.yaml
keys/age-key.txt
*.key
*.pem
# OpenTofu/Terraform state and variables
tofu/.terraform/
tofu/.terraform.lock.hcl
tofu/terraform.tfstate
tofu/terraform.tfstate.backup
tofu/*.tfvars
!tofu/terraform.tfvars.example
# Ansible
ansible/*.retry
ansible/.vault_pass
# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
Thumbs.db
Desktop.ini
# Editor files
.vscode/
.idea/
*.swp
*.swo
*~
.env
.env.local
# Logs
*.log
logs/
# Backup files
*.bak
*.backup
# Python (if using scripts)
__pycache__/
*.py[cod]
*$py.class
.venv/
venv/
# Temporary files
tmp/
temp/
*.tmp

111
README.md Normal file
View file

@ -0,0 +1,111 @@
# Post-X Society Multi-Tenant Infrastructure
Infrastructure as Code for a scalable multi-tenant VPS platform running Zitadel (identity provider) and Nextcloud (file sync/share) on Hetzner Cloud.
## 🏗️ Architecture
- **Provisioning**: OpenTofu (open source Terraform fork)
- **Configuration**: Ansible with dynamic inventory
- **Secrets**: SOPS + Age encryption
- **Hosting**: Hetzner Cloud (EU-based, GDPR-compliant)
- **Identity**: Zitadel (Swiss company, AGPL 3.0)
- **Storage**: Nextcloud (German company, AGPL 3.0)
## 📁 Repository Structure
```
infrastructure/
├── .claude/agents/ # AI agent definitions for specialized tasks
├── docs/ # Architecture decisions and runbooks
├── tofu/ # OpenTofu configurations for Hetzner
├── ansible/ # Ansible playbooks and roles
├── secrets/ # SOPS-encrypted secrets (git-safe)
├── docker/ # Docker Compose configurations
└── scripts/ # Deployment and management scripts
```
## 🚀 Quick Start
### Prerequisites
- [OpenTofu](https://opentofu.org/) >= 1.6
- [Ansible](https://docs.ansible.com/) >= 2.15
- [SOPS](https://github.com/getsops/sops) + [Age](https://github.com/FiloSottile/age)
- [Hetzner Cloud account](https://www.hetzner.com/cloud)
### Initial Setup
1. **Clone repository**:
```bash
git clone <repo-url>
cd infrastructure
```
2. **Generate Age encryption key**:
```bash
age-keygen -o keys/age-key.txt
# Store securely in password manager!
```
3. **Configure OpenTofu variables**:
```bash
cp tofu/terraform.tfvars.example tofu/terraform.tfvars
# Edit with your Hetzner API token and configuration
```
4. **Provision infrastructure**:
```bash
cd tofu
tofu init
tofu plan
tofu apply
```
5. **Deploy applications**:
```bash
cd ../ansible
ansible-playbook playbooks/setup.yml
```
## 🎯 Project Principles
1. **EU/GDPR-first**: European vendors and data residency
2. **Truly open source**: Avoid source-available or restrictive licenses
3. **Client isolation**: Full separation between tenants
4. **Infrastructure as Code**: All changes via version control
5. **Security by default**: Encryption, hardening, least privilege
## 📖 Documentation
- [Architecture Decision Record](docs/architecture-decisions.md) - Complete design rationale
- [Runbook](docs/runbook.md) - Operational procedures (coming soon)
- [Agent Definitions](.claude/agents/) - Specialized AI agent instructions
## 🤝 Contributing
This project uses specialized AI agents for development:
- **Architect**: High-level design decisions
- **Infrastructure**: OpenTofu + Ansible implementation
- **Zitadel**: Identity provider configuration
- **Nextcloud**: File sync/share configuration
See individual agent files in `.claude/agents/` for responsibilities.
## 🔒 Security
- Secrets are encrypted with SOPS + Age before committing
- Age private keys are **NEVER** stored in this repository
- See `.gitignore` for protected files
## 📝 License
TBD
## 🙋 Support
For issues or questions, please create a GitHub issue with the appropriate label:
- `agent:architect` - Architecture/design questions
- `agent:infrastructure` - IaC implementation
- `agent:zitadel` - Identity provider
- `agent:nextcloud` - File sync/share

View file

@ -0,0 +1,810 @@
# Infrastructure Architecture Decision Record
## Post-X Society Multi-Tenant VPS Platform
**Document Status:** Living document
**Created:** December 2024
**Last Updated:** December 2024
---
## Executive Summary
This document captures architectural decisions for a scalable, multi-tenant infrastructure platform starting with 10 identical VPS instances running Keycloak and Nextcloud, with plans to expand both server count and application offerings.
**Key Technology Choices:**
- **OpenTofu** over Terraform (truly open source, MPL 2.0)
- **SOPS + Age** over HashiCorp Vault (simple, no server, European-friendly)
- **Hetzner** for all infrastructure (GDPR-compliant, EU-based)
---
## 1. Infrastructure Provisioning
### Decision: OpenTofu + Ansible with Dynamic Inventory
**Choice:** Infrastructure as Code using OpenTofu for resource provisioning and Ansible for configuration management.
**Why OpenTofu over Terraform:**
- Truly open source (MPL 2.0) vs HashiCorp's BSL 1.1
- Drop-in replacement - same syntax, same providers
- Linux Foundation governance - no single company can close the license
- Active community after HashiCorp's 2023 license change
- No risk of future license restrictions
**Approach:**
- **OpenTofu** manages Hetzner resources (VPS instances, networks, firewalls, DNS)
- **Ansible** configures servers using the `hcloud` dynamic inventory plugin
- No static inventory files - Ansible queries Hetzner API at runtime
**Rationale:**
- 10+ identical servers makes manual management unsustainable
- Version-controlled infrastructure in Git
- Dynamic inventory eliminates sync issues between OpenTofu and Ansible
- Skills transfer to other providers if needed
**Implementation:**
```yaml
# ansible.cfg
[inventory]
enable_plugins = hetzner.hcloud.hcloud
# hcloud.yml (inventory config)
plugin: hetzner.hcloud.hcloud
locations:
- fsn1
keyed_groups:
- key: labels.role
prefix: role
- key: labels.client
prefix: client
```
---
## 2. Application Deployment
### Decision: Modular Ansible Roles with Feature Flags
**Choice:** Each application is a separate Ansible role, enabled per-server via inventory variables.
**Rationale:**
- Allows heterogeneous deployments (client A wants Pretix, client B doesn't)
- Test new applications on single server before fleet rollout
- Clear separation of concerns
- Minimal refactoring when adding new applications
**Structure:**
```
ansible/
├── roles/
│ ├── common/ # Base setup, hardening, Docker
│ ├── traefik/ # Reverse proxy, SSL
│ ├── zitadel/ # Identity provider (Swiss, AGPL 3.0)
│ ├── nextcloud/
│ ├── pretix/ # Future
│ ├── listmonk/ # Future
│ ├── backup/ # Restic configuration
│ └── monitoring/ # Node exporter, promtail
```
**Inventory Example:**
```yaml
all:
children:
clients:
hosts:
client-alpha:
client_name: alpha
domain: alpha.platform.nl
apps:
- zitadel
- nextcloud
client-beta:
client_name: beta
domain: beta.platform.nl
apps:
- zitadel
- nextcloud
- pretix
```
---
## 3. DNS Management
### Decision: Hetzner DNS via OpenTofu
**Choice:** Manage all DNS records through Hetzner DNS using OpenTofu.
**Rationale:**
- Single provider for infrastructure and DNS simplifies management
- OpenTofu provider available and well-maintained (same as Terraform provider)
- Cost-effective (included with Hetzner)
- GDPR-compliant (EU-based)
**Domain Strategy:**
- Start with subdomains: `{client}.platform.nl`
- Support custom domains later via variable override
- Wildcard approach not used - explicit records per service
**Implementation:**
```hcl
resource "hcloud_server" "client" {
for_each = var.clients
name = each.key
server_type = each.value.server_type
# ...
}
resource "hetznerdns_record" "client_a" {
for_each = var.clients
zone_id = data.hetznerdns_zone.main.id
name = each.value.subdomain
type = "A"
value = hcloud_server.client[each.key].ipv4_address
}
```
**SSL Certificates:** Handled by Traefik with Let's Encrypt, automatic per-domain.
---
## 4. Identity Provider
### Decision: Zitadel (replacing Keycloak)
**Choice:** Zitadel as the identity provider for all client installations.
**Why Zitadel over Keycloak:**
| Factor | Zitadel | Keycloak |
|--------|---------|----------|
| Company HQ | 🇨🇭 Switzerland | 🇺🇸 USA (IBM/Red Hat) |
| GDPR Jurisdiction | EU-adequate | US jurisdiction |
| License | AGPL 3.0 | Apache 2.0 |
| Multi-tenancy | Native design | Added later (2024) |
| Language | Go (lightweight) | Java (resource-heavy) |
| Architecture | Event-sourced, API-first | Traditional |
**Licensing Notes:**
- Zitadel v3 (March 2025) changed from Apache 2.0 to AGPL 3.0
- For our use case (running Zitadel as IdP), this has zero impact
- AGPL only requires source disclosure if you modify Zitadel AND provide it as a service
- SDKs and APIs remain Apache 2.0
**Company Background:**
- CAOS Ltd., headquartered in St. Gallen, Switzerland
- Founded 2019, $15.5M funding (Series A)
- Switzerland has EU data protection adequacy status
- Public product roadmap, transparent development
**Deployment:**
```yaml
# docker-compose.yml snippet
services:
zitadel:
image: ghcr.io/zitadel/zitadel:v3.x.x # Pin version
command: start-from-init
environment:
ZITADEL_DATABASE_POSTGRES_HOST: postgres
ZITADEL_EXTERNALDOMAIN: ${CLIENT_DOMAIN}
depends_on:
- postgres
```
**Multi-tenancy Approach:**
- Each client gets isolated Zitadel organization
- Single Zitadel instance can manage multiple organizations
- Or: fully isolated Zitadel per client (current choice for maximum isolation)
---
## 4. Backup Strategy
### Decision: Dual Backup Approach
**Choice:** Hetzner automated snapshots + Restic application-level backups to Hetzner Storage Box.
#### Layer 1: Hetzner Snapshots
**Purpose:** Disaster recovery (complete server loss)
| Aspect | Configuration |
|--------|---------------|
| Frequency | Daily (Hetzner automated) |
| Retention | 7 snapshots |
| Cost | 20% of VPS price |
| Restoration | Full server restore via Hetzner console/API |
**Limitations:**
- Crash-consistent only (may catch database mid-write)
- Same datacenter (not true off-site)
- Coarse granularity (all or nothing)
#### Layer 2: Restic to Hetzner Storage Box
**Purpose:** Granular application recovery, off-server storage
**Backend Choice:** Hetzner Storage Box
**Rationale:**
- GDPR-compliant (German/EU data residency)
- Same Hetzner network = fast transfers, no egress costs
- Cost-effective (~€3.81/month for BX10 with 1TB)
- Supports SFTP, CIFS/Samba, rsync, Restic-native
- Can be accessed from all VPSs simultaneously
**Storage Hierarchy:**
```
Storage Box (BX10 or larger)
└── /backups/
├── /client-alpha/
│ ├── /restic-repo/ # Encrypted Restic repository
│ └── /manual/ # Ad-hoc exports if needed
├── /client-beta/
│ └── /restic-repo/
└── /client-gamma/
└── /restic-repo/
```
**Connection Method:**
- Primary: SFTP (native Restic support, encrypted in transit)
- Optional: CIFS mount for manual file access
- Each client VPS gets Storage Box sub-account or uses main credentials with path restrictions
| Aspect | Configuration |
|--------|---------------|
| Frequency | Nightly (after DB dumps) |
| Time | 03:00 local time |
| Retention | 7 daily, 4 weekly, 6 monthly |
| Encryption | Restic default (AES-256) |
| Repo passwords | Stored in SOPS-encrypted files |
**What Gets Backed Up:**
```
/opt/docker/
├── nextcloud/
│ └── data/ # ✓ User files
├── zitadel/
│ └── db-dumps/ # ✓ PostgreSQL dumps (not live DB)
├── pretix/
│ └── data/ # ✓ When applicable
└── configs/ # ✓ docker-compose files, env
```
**Backup Ansible Role Tasks:**
1. Install Restic
2. Initialize repo (if not exists)
3. Configure SFTP connection to Storage Box
4. Create pre-backup script (database dumps)
5. Create backup script
6. Create systemd timer
7. Configure backup monitoring (alert on failure)
**Sizing Guidance:**
- Start with BX10 (1TB) for 10 clients
- Monitor usage monthly
- Scale to BX20 (2TB) when approaching 70% capacity
**Verification:**
- Weekly `restic check` via cron
- Monthly test restore to staging environment
- Alerts on backup job failures
---
## 5. Secrets Management
### Decision: SOPS + Age Encryption
**Choice:** File-based secrets encryption using SOPS with Age encryption, stored in Git.
**Why SOPS + Age over HashiCorp Vault:**
- No additional server to maintain
- Truly open source (MPL 2.0 for SOPS, Apache 2.0 for Age)
- Secrets versioned alongside infrastructure code
- Simple to understand and debug
- Age developed with European privacy values (FiloSottile)
- Perfect for 10-50 server scale
- No vendor lock-in concerns
**How It Works:**
1. Secrets stored in YAML files, encrypted with Age
2. Only the values are encrypted, keys remain readable
3. Decryption happens at Ansible runtime
4. One Age key per environment (or shared across all)
**Example Encrypted File:**
```yaml
# secrets/client-alpha.sops.yaml
db_password: ENC[AES256_GCM,data:kH3x9...,iv:abc...,tag:def...,type:str]
keycloak_admin: ENC[AES256_GCM,data:mN4y2...,iv:ghi...,tag:jkl...,type:str]
nextcloud_admin: ENC[AES256_GCM,data:pQ5z7...,iv:mno...,tag:pqr...,type:str]
restic_repo_password: ENC[AES256_GCM,data:rS6a1...,iv:stu...,tag:vwx...,type:str]
```
**Key Management:**
```
keys/
├── age-key.txt # Master key (NEVER in Git, backed up securely)
└── .sops.yaml # SOPS configuration (in Git)
```
**.sops.yaml Configuration:**
```yaml
creation_rules:
- path_regex: secrets/.*\.sops\.yaml$
age: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
**Secret Structure:**
```
secrets/
├── .sops.yaml # SOPS config
├── shared.sops.yaml # Shared secrets (Storage Box, API tokens)
└── clients/
├── alpha.sops.yaml # Client-specific secrets
├── beta.sops.yaml
└── gamma.sops.yaml
```
**Ansible Integration:**
```yaml
# Using community.sops collection
- name: Load client secrets
community.sops.load_vars:
file: "secrets/clients/{{ client_name }}.sops.yaml"
name: client_secrets
- name: Use decrypted secret
ansible.builtin.template:
src: docker-compose.yml.j2
dest: /opt/docker/docker-compose.yml
vars:
db_password: "{{ client_secrets.db_password }}"
```
**Daily Operations:**
```bash
# Encrypt a new file
sops --encrypt --age $(cat keys/age-key.pub) secrets/clients/new.yaml > secrets/clients/new.sops.yaml
# Edit existing secrets (decrypts, opens editor, re-encrypts)
SOPS_AGE_KEY_FILE=keys/age-key.txt sops secrets/clients/alpha.sops.yaml
# View decrypted content
SOPS_AGE_KEY_FILE=keys/age-key.txt sops --decrypt secrets/clients/alpha.sops.yaml
```
**Key Backup Strategy:**
- Age private key stored in password manager (Bitwarden/1Password)
- Printed paper backup in secure location
- Key never stored in Git repository
- Consider key escrow for bus factor
**Advantages for Your Setup:**
| Aspect | Benefit |
|--------|---------|
| Simplicity | No Vault server to maintain, secure, update |
| Auditability | Git history shows who changed what secrets when |
| Portability | Works offline, no network dependency |
| Reliability | No secrets server = no secrets server downtime |
| Cost | Zero infrastructure cost |
---
## 6. Monitoring
### Decision: Centralized Uptime Kuma
**Choice:** Uptime Kuma on dedicated monitoring server.
**Rationale:**
- Simple to deploy and maintain
- Beautiful UI for status overview
- Flexible alerting (email, Slack, webhook)
- Self-hosted (data stays in-house)
- Sufficient for "is it up?" monitoring at current scale
**Deployment:**
- Dedicated VPS or container on monitoring server
- Monitors all client servers and services
- Public status page optional per client
**Monitors per Client:**
- HTTPS endpoint (Nextcloud)
- HTTPS endpoint (Zitadel)
- TCP port checks (database, if exposed)
- Docker container health (via API or agent)
**Alerting:**
- Primary: Email
- Secondary: Slack/Mattermost webhook
- Escalation: SMS for extended downtime (future)
**Future Expansion Path:**
When deeper metrics needed:
1. Add Prometheus + Node Exporter
2. Add Grafana dashboards
3. Add Loki for log aggregation
4. Uptime Kuma remains for synthetic monitoring
---
## 7. Client Isolation
### Decision: Full Isolation
**Choice:** Maximum isolation between clients at all levels.
**Implementation:**
| Layer | Isolation Method |
|-------|------------------|
| Compute | Separate VPS per client |
| Network | Hetzner firewall rules, no inter-VPS traffic |
| Database | Separate PostgreSQL container per client |
| Storage | Separate Docker volumes |
| Backups | Separate Restic repositories |
| Secrets | Separate SOPS files per client |
| DNS | Separate records/domains |
**Network Rules:**
- Each VPS accepts traffic only on 80, 443, 22 (management IP only)
- No private network between client VPSs
- Monitoring server can reach all clients (outbound checks)
**Rationale:**
- Security: Compromise of one client cannot spread
- Compliance: Data separation demonstrable
- Operations: Can maintain/upgrade clients independently
- Billing: Clear resource attribution
---
## 8. Deployment Strategy
### Decision: Canary Deployments with Version Pinning
**Choice:** Staged rollouts with explicit version control.
#### Version Pinning
All container images use explicit tags:
```yaml
# docker-compose.yml
services:
nextcloud:
image: nextcloud:28.0.1 # Never use :latest
keycloak:
image: quay.io/keycloak/keycloak:23.0.1
postgres:
image: postgres:16.1
```
Version updates require explicit change and commit.
#### Canary Process
**Inventory Groups:**
```yaml
all:
children:
canary:
hosts:
client-alpha: # Designated test client (internal or willing partner)
production:
hosts:
client-beta:
client-gamma:
# ... remaining clients
```
**Deployment Script:**
```bash
#!/bin/bash
set -e
echo "=== Deploying to canary ==="
ansible-playbook deploy.yml --limit canary
echo "=== Waiting for verification ==="
read -p "Canary OK? Proceed to production? [y/N] " confirm
if [[ $confirm != "y" ]]; then
echo "Deployment aborted"
exit 1
fi
echo "=== Deploying to production ==="
ansible-playbook deploy.yml --limit production
```
#### Rollback Procedures
**Scenario 1: Bad container version**
```bash
# Revert version in docker-compose
git revert HEAD
# Redeploy
ansible-playbook deploy.yml --limit affected_hosts
```
**Scenario 2: Database migration issue**
```bash
# Restore from pre-upgrade Restic backup
restic -r sftp:user@backup-server:/client-x/restic-repo restore latest --target /tmp/restore
# Restore database dump
psql < /tmp/restore/db-dumps/keycloak.sql
# Revert and redeploy application
```
**Scenario 3: Complete server failure**
```bash
# Restore Hetzner snapshot via API
hcloud server rebuild <server-id> --image <snapshot-id>
# Or via OpenTofu
tofu apply -replace="hcloud_server.client[\"affected\"]"
```
---
## 9. Security Baseline
### Decision: Comprehensive Hardening
All servers receive the `common` Ansible role with:
#### SSH Hardening
```yaml
# /etc/ssh/sshd_config (managed by Ansible)
PermitRootLogin: no
PasswordAuthentication: no
PubkeyAuthentication: yes
AllowUsers: deploy
```
#### Firewall (UFW)
```yaml
- 22/tcp: Management IPs only
- 80/tcp: Any (redirects to 443)
- 443/tcp: Any
- All other: Deny
```
#### Automatic Updates
```yaml
# unattended-upgrades configuration
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::AutoFixInterruptedDpkg "true";
Unattended-Upgrade::Automatic-Reboot "false"; # Manual reboot control
```
#### Fail2ban
```yaml
# Jails enabled
- sshd
- traefik-auth (custom, for repeated 401s)
```
#### Container Security
```yaml
# Trivy scanning in CI/CD
- Scan images before deployment
- Block critical vulnerabilities
- Weekly scheduled scans of running containers
```
#### Additional Measures
- No password authentication anywhere
- Secrets encrypted with SOPS + Age, never plaintext in Git
- Regular dependency updates via Dependabot/Renovate
- SSH keys rotated annually
---
## 10. Onboarding Procedure
### New Client Checklist
```markdown
## Client Onboarding: {CLIENT_NAME}
### Prerequisites
- [ ] Client agreement signed
- [ ] Domain/subdomain confirmed: _______________
- [ ] Contact email: _______________
- [ ] Desired applications: [ ] Keycloak [ ] Nextcloud [ ] Pretix [ ] Listmonk
### Infrastructure
- [ ] Add client to `tofu/variables.tf`
- [ ] Add client to `ansible/inventory/clients.yml`
- [ ] Create secrets file: `sops secrets/clients/{name}.sops.yaml`
- [ ] Create Storage Box subdirectory for backups
- [ ] Run: `tofu apply`
- [ ] Run: `ansible-playbook playbooks/setup.yml --limit {client}`
### Verification
- [ ] HTTPS accessible
- [ ] Zitadel admin login works
- [ ] Nextcloud admin login works
- [ ] Backup job runs successfully
- [ ] Monitoring checks green
### Handover
- [ ] Send credentials securely (1Password link, Signal, etc.)
- [ ] Schedule onboarding call if needed
- [ ] Add to status page (if applicable)
- [ ] Document any custom configuration
### Estimated Time: 30-45 minutes
```
---
## 11. Offboarding Procedure
### Client Removal Checklist
```markdown
## Client Offboarding: {CLIENT_NAME}
### Pre-Offboarding
- [ ] Confirm termination date: _______________
- [ ] Data export requested? [ ] Yes [ ] No
- [ ] Final invoice sent
### Data Export (if requested)
- [ ] Export Nextcloud data
- [ ] Export Zitadel organization/users
- [ ] Provide secure download link
- [ ] Confirm receipt
### Infrastructure Removal
- [ ] Disable monitoring checks (set maintenance mode first)
- [ ] Create final backup (retain per policy)
- [ ] Remove from Ansible inventory
- [ ] Remove from OpenTofu config
- [ ] Run: `tofu apply` (destroys VPS)
- [ ] Remove DNS records (automatic via OpenTofu)
- [ ] Remove/archive SOPS secrets file
### Backup Retention
- [ ] Move Restic repo to archive path
- [ ] Set deletion date: _______ (default: 90 days post-termination)
- [ ] Schedule deletion job
### Cleanup
- [ ] Remove from status page
- [ ] Update client count in documentation
- [ ] Archive client folder in documentation
### Verification
- [ ] DNS no longer resolves
- [ ] IP returns nothing
- [ ] Monitoring shows no alerts (host removed)
- [ ] Billing stopped
### Estimated Time: 15-30 minutes
```
### Data Retention Policy
| Data Type | Retention Post-Offboarding |
|-----------|---------------------------|
| Application data (Restic) | 90 days |
| Hetzner snapshots | Deleted immediately (with VPS) |
| SOPS secrets files | Archived 90 days, then deleted |
| Logs | 30 days |
| Invoices/contracts | 7 years (legal requirement) |
---
## 12. Repository Structure
```
infrastructure/
├── README.md
├── docs/
│ ├── architecture-decisions.md # This document
│ ├── runbook.md # Operational procedures
│ └── clients/ # Per-client notes
│ ├── alpha.md
│ └── beta.md
├── tofu/ # OpenTofu configuration
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── dns.tf
│ ├── firewall.tf
│ └── versions.tf
├── ansible/
│ ├── ansible.cfg
│ ├── hcloud.yml # Dynamic inventory config
│ ├── playbooks/
│ │ ├── setup.yml # Initial server setup
│ │ ├── deploy.yml # Deploy/update applications
│ │ ├── upgrade.yml # System updates
│ │ └── backup-restore.yml # Manual backup/restore
│ ├── roles/
│ │ ├── common/
│ │ ├── docker/
│ │ ├── traefik/
│ │ ├── zitadel/
│ │ ├── nextcloud/
│ │ ├── backup/
│ │ └── monitoring-agent/
│ └── group_vars/
│ └── all.yml
├── secrets/ # SOPS-encrypted secrets
│ ├── .sops.yaml # SOPS configuration
│ ├── shared.sops.yaml # Shared secrets
│ └── clients/
│ ├── alpha.sops.yaml
│ └── beta.sops.yaml
├── docker/
│ ├── docker-compose.base.yml # Common services
│ └── docker-compose.apps.yml # Application services
└── scripts/
├── deploy.sh # Canary deployment wrapper
├── onboard-client.sh
└── offboard-client.sh
```
**Note:** The Age private key (`age-key.txt`) is NOT stored in this repository. It must be:
- Stored in a password manager
- Backed up securely offline
- Available on deployment machine only
---
## 13. Open Decisions / Future Considerations
### To Decide Later
- [ ] Shared Zitadel instance vs isolated instances per client
- [ ] Central logging (Loki) - when/if needed
- [ ] Prometheus metrics - when/if needed
- [ ] Custom domain SSL workflow
- [ ] Client self-service portal
### Scaling Triggers
- **20+ servers:** Consider Kubernetes or Nomad
- **Multi-region:** Add OpenTofu workspaces per region
- **Team growth:** Consider moving from SOPS to Infisical for better access control
- **Complex secret rotation:** May need dedicated secrets server
---
## 14. Technology Choices Rationale
### Why We Chose Open Source / European-Friendly Tools
| Tool | Chosen | Avoided | Reason |
|------|--------|---------|--------|
| IaC | OpenTofu | Terraform | BSL license concerns, HashiCorp trust issues |
| Secrets | SOPS + Age | HashiCorp Vault | Simplicity, no US vendor dependency, truly open source |
| Identity | Zitadel | Keycloak | Swiss company, GDPR-adequate jurisdiction, native multi-tenancy |
| DNS | Hetzner DNS | Cloudflare | EU-based, GDPR-native, single provider |
| Hosting | Hetzner | AWS/GCP/Azure | EU-based, cost-effective, GDPR-compliant |
| Backup | Restic + Hetzner Storage Box | Cloud backup services | Open source, EU data residency |
**Guiding Principles:**
1. Prefer truly open source (OSI-approved) over source-available
2. Prefer EU-based services for GDPR simplicity
3. Avoid vendor lock-in where practical
4. Choose simplicity appropriate to scale (10-50 servers)
---
## Changelog
| Date | Change | Author |
|------|--------|--------|
| 2024-12 | Initial architecture decisions | Pieter / Claude |
| 2024-12 | Added Hetzner Storage Box as Restic backend | Pieter / Claude |
| 2024-12 | Switched from Terraform to OpenTofu (licensing concerns) | Pieter / Claude |
| 2024-12 | Switched from HashiCorp Vault to SOPS + Age (simplicity, open source) | Pieter / Claude |
| 2024-12 | Switched from Keycloak to Zitadel (Swiss company, GDPR jurisdiction) | Pieter / Claude |
```