# Bluesky Collector - Operations Guide ## Quick Reference ### Current Status (March 30, 2026) - **Collector:** ❌ STOPPED (data collection complete) - **Scheduler:** ❌ STOPPED (no further automated runs) - **Web Interface:** ✅ RUNNING (http://localhost:5001) - **Database:** ✅ RUNNING (PostgreSQL on port 5433) --- ## Starting and Stopping Services ### View Current Service Status ```bash cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector docker compose ps ``` ### Start All Services ```bash docker compose up -d ``` This starts: - `db` - PostgreSQL database (port 5433) - `web` - Web interface (port 5001) - `collector` - Data collection service - `scheduler` - Automated collection scheduler (runs every 4 hours) ### Stop Collection Only (Keep Web Interface) ```bash docker compose stop scheduler collector ``` This configuration allows browsing collected data without gathering new content. ### Start Collection Services ```bash docker compose start scheduler collector ``` ### Stop All Services ```bash docker compose down ``` **Warning:** This will stop the web interface and database. Data is preserved in Docker volumes. ### Stop and Remove Everything (Including Data) ```bash docker compose down -v ``` **⚠️ DANGER:** This deletes all collected data permanently! --- ## Service Details ### Database (PostgreSQL) - **Image:** `postgres:16-alpine` - **Port:** 5433 (external) → 5432 (internal) - **Data Volume:** `pgdata` - **Access:** ```bash docker compose exec db psql -U bluesky -d bluesky ``` ### Web Interface - **URL:** http://localhost:5001 - **Port:** 5001 - **Stack:** Flask + Gunicorn - **Pages:** - `/` - Dashboard with collection stats - `/accounts` - Account toxicity summary - `/statuses` - Posts and replies browser - `/mentions` - Mentions browser - `/analysis` - Toxicity analysis overview - `/analysis/flagged` - Flagged content with human review - `/export` - Data export options ### Collector Service - **Schedule:** Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00) - **Function:** Collects new posts and mentions from Bluesky API - **Logs:** ```bash docker compose logs -f collector ``` ### Scheduler Service - **Image:** `mcuadros/ofelia` - **Function:** Triggers collector and analyzer jobs on schedule - **Jobs:** - `collect` - Runs at 0 minutes past every 4th hour - `analyze` - Runs at 30 minutes past every 4th hour - **Logs:** ```bash docker compose logs -f scheduler ``` --- ## Manual Operations ### Run Manual Collection ```bash docker compose exec collector python -m src ``` Collects posts and mentions immediately (outside of schedule). ### Run Manual Analysis ```bash docker compose exec collector python -m src.analyzer ``` Analyzes all unscored posts/mentions using OpenAI API. **Cost Warning:** Analysis incurs OpenAI API costs. Check batch size settings. ### Analyze Specific Batch Size ```bash docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100 ``` Options: - `--batch-size N` - Number of posts per API call (default: 10) - `--limit N` - Maximum posts to analyze (default: 0 = unlimited) - `--concurrency N` - Parallel API requests (default: 3) ### View Recent Logs ```bash # All services docker compose logs --tail 100 # Specific service docker compose logs --tail 50 collector docker compose logs --tail 50 web # Follow logs in real-time docker compose logs -f collector ``` --- ## Database Operations ### Access Database Shell ```bash docker compose exec db psql -U bluesky -d bluesky ``` ### Common Queries #### Check Collection Status ```sql SELECT started_at::date as date, COUNT(*) as runs, SUM(posts_collected) as total_posts, SUM(mentions_collected) as total_mentions, SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful FROM collection_runs WHERE started_at >= '2026-01-01' GROUP BY started_at::date ORDER BY date DESC; ``` #### Count Flagged Content ```sql -- Posts/Replies SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5; -- Mentions (unique posts) SELECT COUNT(DISTINCT m.post_uri) FROM mention_toxicity_scores mts JOIN mentions m ON m.id = mts.mention_id WHERE mts.overall >= 0.5; ``` #### Human Review Progress ```sql SELECT CASE WHEN review_status IS NULL THEN 'Unreviewed' ELSE review_status END as status, COUNT(*) as count FROM toxicity_scores WHERE overall >= 0.5 GROUP BY review_status; ``` ### Backup Database ```bash docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql ``` ### Restore Database ```bash cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky ``` --- ## Rebuilding Services ### Rebuild After Code Changes ```bash # Rebuild specific service docker compose build web docker compose build collector # Rebuild and restart docker compose up -d --build web # Rebuild everything docker compose build docker compose up -d ``` ### Apply Database Migrations ```bash # View available migrations ls scripts/*.sql # Apply specific migration docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql ``` --- ## Configuration ### Environment Variables (.env file) ```bash # Database POSTGRES_USER=bluesky POSTGRES_PASSWORD=changeme POSTGRES_PORT=5433 # Web Interface WEB_PORT=5001 # Bluesky API (for authenticated search) BSKY_HANDLE=your-handle.bsky.social BSKY_APP_PASSWORD=your-app-password # OpenAI API (for toxicity analysis) OPENAI_API_KEY=sk-... # Analysis Settings ANALYZER_MODEL=gpt-4.1-nano ANALYZER_CONCURRENCY=3 ANALYZER_BATCH_SIZE=10 ANALYZER_LIMIT=0 # Collection Settings MAX_PAGES_PER_ACCOUNT=50 MENTION_LOOKBACK_HOURS=12 LOG_LEVEL=INFO ``` ### Tracked Accounts (config/accounts.yml) ```yaml accounts: - handle: example.bsky.social # Account to monitor - handle: another.bsky.social ``` Add or remove accounts, then restart collector: ```bash docker compose restart collector ``` --- ## Troubleshooting ### Web Interface Not Loading ```bash # Check if web service is running docker compose ps web # Check web logs for errors docker compose logs --tail 50 web # Restart web service docker compose restart web ``` ### Collector Not Running ```bash # Check scheduler is running docker compose ps scheduler # Check collector status docker compose ps collector # Start scheduler if stopped docker compose start scheduler # Check scheduler logs docker compose logs scheduler ``` ### Database Connection Issues ```bash # Check database health docker compose ps db # Restart database docker compose restart db # Check database logs docker compose logs db ``` ### Out of Disk Space ```bash # Check Docker disk usage docker system df # Remove unused images/containers docker system prune # Check database size docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));" ``` ### Analysis Failing (OpenAI API) ```bash # Check API key is set docker compose exec collector printenv | grep OPENAI_API_KEY # Test API connectivity docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()" # Check rate limits in logs docker compose logs collector | grep -i "rate limit" ``` --- ## Performance Tuning ### Increase Collection Speed Edit `docker-compose.yml`: ```yaml environment: MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50 MENTION_LOOKBACK_HOURS: 24 # Increase lookback ``` ### Increase Analysis Speed ```yaml environment: ANALYZER_CONCURRENCY: 5 # More parallel requests ANALYZER_BATCH_SIZE: 20 # Bigger batches ``` **Cost Warning:** Higher concurrency and batch size = higher OpenAI API costs. ### Change Collection Schedule Edit `docker-compose.yml` under collector labels: ```yaml labels: ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection ``` Restart scheduler after changes: ```bash docker compose restart scheduler ``` --- ## Data Export ### Export to CSV via Web Interface 1. Navigate to http://localhost:5001/export 2. Select date range and filters 3. Click "Export to CSV" ### Export via Command Line #### All Posts ```bash docker compose exec db psql -U bluesky -d bluesky -c "COPY ( SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type, ts.overall, ts.toxic, ts.hate_speech, ts.threat FROM posts p LEFT JOIN accounts a ON a.did = p.author_did LEFT JOIN toxicity_scores ts ON ts.uri = p.uri WHERE p.created_at >= '2026-01-01' ) TO STDOUT CSV HEADER" > posts_export.csv ``` #### Flagged Content with Reviews ```bash docker compose exec db psql -U bluesky -d bluesky -c "COPY ( SELECT p.uri, a.handle, p.text, p.created_at, ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at FROM toxicity_scores ts JOIN posts p ON p.uri = ts.uri LEFT JOIN accounts a ON a.did = p.author_did WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01' ORDER BY ts.overall DESC ) TO STDOUT CSV HEADER" > flagged_export.csv ``` --- ## Restarting Data Collection (If Needed) ### Resume Collection After Pause 1. Start services: ```bash docker compose start scheduler collector ``` 2. Verify collection runs: ```bash docker compose logs -f collector ``` 3. Check database for new entries: ```bash docker compose exec db psql -U bluesky -d bluesky -c " SELECT MAX(created_at) FROM posts; SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day'; " ``` ### Start Fresh Collection (Keep Database) 1. Stop services: ```bash docker compose down ``` 2. Start only database and web: ```bash docker compose up -d db web ``` 3. Truncate collection tracking (optional): ```bash docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;" ``` 4. Start collector: ```bash docker compose up -d scheduler collector ``` ### Complete Reset (Delete All Data) ```bash # Stop everything docker compose down # Remove data volume docker volume rm bluesky-collector_pgdata # Restart from scratch docker compose up -d ``` **⚠️ WARNING:** This deletes all collected posts, mentions, and analysis results permanently! --- ## Monitoring ### Collection Health Check ```bash # Last 5 collection runs docker compose exec db psql -U bluesky -d bluesky -c " SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors FROM collection_runs ORDER BY started_at DESC LIMIT 5; " ``` ### Analysis Progress ```bash # Count scored vs unscored docker compose exec db psql -U bluesky -d bluesky -c " SELECT (SELECT COUNT(*) FROM posts) as total_posts, (SELECT COUNT(*) FROM toxicity_scores) as scored_posts, (SELECT COUNT(*) FROM mentions) as total_mentions, (SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions; " ``` ### Disk Usage ```bash # Database size docker compose exec db psql -U bluesky -d bluesky -c " SELECT pg_size_pretty(pg_database_size('bluesky')) as db_size, pg_size_pretty(pg_total_relation_size('posts')) as posts_table, pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table; " ``` --- ## Security Notes 1. **Never commit .env file** - Contains API keys and passwords 2. **Change default passwords** - PostgreSQL default password is `changeme` 3. **Firewall rules** - Ports 5001 (web) and 5433 (database) exposed to localhost only 4. **API keys** - Bluesky and OpenAI credentials stored in environment variables 5. **Data retention** - Contains personal data (Bluesky posts); handle per GDPR requirements --- ## Support ### Documentation - Main findings: `FINDINGS.md` - This operations guide: `OPERATIONS.md` - Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector ### Logs Location - Docker logs: `docker compose logs [service]` - Application logs: `./logs/` directory (if volume mounted) ### Common Issues 1. **Port conflicts:** Change `WEB_PORT` or `POSTGRES_PORT` in .env 2. **Out of memory:** Reduce `ANALYZER_CONCURRENCY` or `ANALYZER_BATCH_SIZE` 3. **API rate limits:** Reduce collection frequency or batch size 4. **Disk full:** Run `docker system prune` and consider data export/cleanup --- **Last Updated:** March 30, 2026 **Project Status:** Data collection complete, web interface available for analysis