Added comprehensive project documentation and MIT license. Removed Claude IDE configuration files from repository tracking. Documentation added: - FINDINGS.md: Complete methodology report and research findings - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30) - Human review results: 40.4% correct, 59.6% false positives - AI toxicity detection limitations and recommendations - OPERATIONS.md: Complete operations and maintenance guide - Service start/stop procedures - Database operations and queries - Configuration options - Troubleshooting guide - Data export instructions License: - Added MIT License to README.md - Copyright 2026 Post X Society - Open source with permissive license Repository cleanup: - Added .claude/ to .gitignore - Removed .claude/settings.local.json from tracking - Prevents IDE-specific files from being committed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
Bluesky Collector - Operations Guide
Quick Reference
Current Status (March 30, 2026)
- Collector: ❌ STOPPED (data collection complete)
- Scheduler: ❌ STOPPED (no further automated runs)
- Web Interface: ✅ RUNNING (http://localhost:5001)
- Database: ✅ RUNNING (PostgreSQL on port 5433)
Starting and Stopping Services
View Current Service Status
cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
docker compose ps
Start All Services
docker compose up -d
This starts:
db- PostgreSQL database (port 5433)web- Web interface (port 5001)collector- Data collection servicescheduler- Automated collection scheduler (runs every 4 hours)
Stop Collection Only (Keep Web Interface)
docker compose stop scheduler collector
This configuration allows browsing collected data without gathering new content.
Start Collection Services
docker compose start scheduler collector
Stop All Services
docker compose down
Warning: This will stop the web interface and database. Data is preserved in Docker volumes.
Stop and Remove Everything (Including Data)
docker compose down -v
⚠️ DANGER: This deletes all collected data permanently!
Service Details
Database (PostgreSQL)
- Image:
postgres:16-alpine - Port: 5433 (external) → 5432 (internal)
- Data Volume:
pgdata - Access:
docker compose exec db psql -U bluesky -d bluesky
Web Interface
- URL: http://localhost:5001
- Port: 5001
- Stack: Flask + Gunicorn
- Pages:
/- Dashboard with collection stats/accounts- Account toxicity summary/statuses- Posts and replies browser/mentions- Mentions browser/analysis- Toxicity analysis overview/analysis/flagged- Flagged content with human review/export- Data export options
Collector Service
- Schedule: Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
- Function: Collects new posts and mentions from Bluesky API
- Logs:
docker compose logs -f collector
Scheduler Service
- Image:
mcuadros/ofelia - Function: Triggers collector and analyzer jobs on schedule
- Jobs:
collect- Runs at 0 minutes past every 4th houranalyze- Runs at 30 minutes past every 4th hour
- Logs:
docker compose logs -f scheduler
Manual Operations
Run Manual Collection
docker compose exec collector python -m src
Collects posts and mentions immediately (outside of schedule).
Run Manual Analysis
docker compose exec collector python -m src.analyzer
Analyzes all unscored posts/mentions using OpenAI API.
Cost Warning: Analysis incurs OpenAI API costs. Check batch size settings.
Analyze Specific Batch Size
docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100
Options:
--batch-size N- Number of posts per API call (default: 10)--limit N- Maximum posts to analyze (default: 0 = unlimited)--concurrency N- Parallel API requests (default: 3)
View Recent Logs
# All services
docker compose logs --tail 100
# Specific service
docker compose logs --tail 50 collector
docker compose logs --tail 50 web
# Follow logs in real-time
docker compose logs -f collector
Database Operations
Access Database Shell
docker compose exec db psql -U bluesky -d bluesky
Common Queries
Check Collection Status
SELECT
started_at::date as date,
COUNT(*) as runs,
SUM(posts_collected) as total_posts,
SUM(mentions_collected) as total_mentions,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
FROM collection_runs
WHERE started_at >= '2026-01-01'
GROUP BY started_at::date
ORDER BY date DESC;
Count Flagged Content
-- Posts/Replies
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;
-- Mentions (unique posts)
SELECT COUNT(DISTINCT m.post_uri)
FROM mention_toxicity_scores mts
JOIN mentions m ON m.id = mts.mention_id
WHERE mts.overall >= 0.5;
Human Review Progress
SELECT
CASE
WHEN review_status IS NULL THEN 'Unreviewed'
ELSE review_status
END as status,
COUNT(*) as count
FROM toxicity_scores
WHERE overall >= 0.5
GROUP BY review_status;
Backup Database
docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql
Restore Database
cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky
Rebuilding Services
Rebuild After Code Changes
# Rebuild specific service
docker compose build web
docker compose build collector
# Rebuild and restart
docker compose up -d --build web
# Rebuild everything
docker compose build
docker compose up -d
Apply Database Migrations
# View available migrations
ls scripts/*.sql
# Apply specific migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql
Configuration
Environment Variables (.env file)
# Database
POSTGRES_USER=bluesky
POSTGRES_PASSWORD=changeme
POSTGRES_PORT=5433
# Web Interface
WEB_PORT=5001
# Bluesky API (for authenticated search)
BSKY_HANDLE=your-handle.bsky.social
BSKY_APP_PASSWORD=your-app-password
# OpenAI API (for toxicity analysis)
OPENAI_API_KEY=sk-...
# Analysis Settings
ANALYZER_MODEL=gpt-4.1-nano
ANALYZER_CONCURRENCY=3
ANALYZER_BATCH_SIZE=10
ANALYZER_LIMIT=0
# Collection Settings
MAX_PAGES_PER_ACCOUNT=50
MENTION_LOOKBACK_HOURS=12
LOG_LEVEL=INFO
Tracked Accounts (config/accounts.yml)
accounts:
- handle: example.bsky.social # Account to monitor
- handle: another.bsky.social
Add or remove accounts, then restart collector:
docker compose restart collector
Troubleshooting
Web Interface Not Loading
# Check if web service is running
docker compose ps web
# Check web logs for errors
docker compose logs --tail 50 web
# Restart web service
docker compose restart web
Collector Not Running
# Check scheduler is running
docker compose ps scheduler
# Check collector status
docker compose ps collector
# Start scheduler if stopped
docker compose start scheduler
# Check scheduler logs
docker compose logs scheduler
Database Connection Issues
# Check database health
docker compose ps db
# Restart database
docker compose restart db
# Check database logs
docker compose logs db
Out of Disk Space
# Check Docker disk usage
docker system df
# Remove unused images/containers
docker system prune
# Check database size
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"
Analysis Failing (OpenAI API)
# Check API key is set
docker compose exec collector printenv | grep OPENAI_API_KEY
# Test API connectivity
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"
# Check rate limits in logs
docker compose logs collector | grep -i "rate limit"
Performance Tuning
Increase Collection Speed
Edit docker-compose.yml:
environment:
MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50
MENTION_LOOKBACK_HOURS: 24 # Increase lookback
Increase Analysis Speed
environment:
ANALYZER_CONCURRENCY: 5 # More parallel requests
ANALYZER_BATCH_SIZE: 20 # Bigger batches
Cost Warning: Higher concurrency and batch size = higher OpenAI API costs.
Change Collection Schedule
Edit docker-compose.yml under collector labels:
labels:
ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours
ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection
Restart scheduler after changes:
docker compose restart scheduler
Data Export
Export to CSV via Web Interface
- Navigate to http://localhost:5001/export
- Select date range and filters
- Click "Export to CSV"
Export via Command Line
All Posts
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
ts.overall, ts.toxic, ts.hate_speech, ts.threat
FROM posts p
LEFT JOIN accounts a ON a.did = p.author_did
LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
WHERE p.created_at >= '2026-01-01'
) TO STDOUT CSV HEADER" > posts_export.csv
Flagged Content with Reviews
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, a.handle, p.text, p.created_at,
ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
FROM toxicity_scores ts
JOIN posts p ON p.uri = ts.uri
LEFT JOIN accounts a ON a.did = p.author_did
WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
ORDER BY ts.overall DESC
) TO STDOUT CSV HEADER" > flagged_export.csv
Restarting Data Collection (If Needed)
Resume Collection After Pause
-
Start services:
docker compose start scheduler collector -
Verify collection runs:
docker compose logs -f collector -
Check database for new entries:
docker compose exec db psql -U bluesky -d bluesky -c " SELECT MAX(created_at) FROM posts; SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day'; "
Start Fresh Collection (Keep Database)
-
Stop services:
docker compose down -
Start only database and web:
docker compose up -d db web -
Truncate collection tracking (optional):
docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;" -
Start collector:
docker compose up -d scheduler collector
Complete Reset (Delete All Data)
# Stop everything
docker compose down
# Remove data volume
docker volume rm bluesky-collector_pgdata
# Restart from scratch
docker compose up -d
⚠️ WARNING: This deletes all collected posts, mentions, and analysis results permanently!
Monitoring
Collection Health Check
# Last 5 collection runs
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
FROM collection_runs
ORDER BY started_at DESC
LIMIT 5;
"
Analysis Progress
# Count scored vs unscored
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
(SELECT COUNT(*) FROM posts) as total_posts,
(SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
(SELECT COUNT(*) FROM mentions) as total_mentions,
(SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
"
Disk Usage
# Database size
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
pg_size_pretty(pg_database_size('bluesky')) as db_size,
pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
"
Security Notes
- Never commit .env file - Contains API keys and passwords
- Change default passwords - PostgreSQL default password is
changeme - Firewall rules - Ports 5001 (web) and 5433 (database) exposed to localhost only
- API keys - Bluesky and OpenAI credentials stored in environment variables
- Data retention - Contains personal data (Bluesky posts); handle per GDPR requirements
Support
Documentation
- Main findings:
FINDINGS.md - This operations guide:
OPERATIONS.md - Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector
Logs Location
- Docker logs:
docker compose logs [service] - Application logs:
./logs/directory (if volume mounted)
Common Issues
- Port conflicts: Change
WEB_PORTorPOSTGRES_PORTin .env - Out of memory: Reduce
ANALYZER_CONCURRENCYorANALYZER_BATCH_SIZE - API rate limits: Reduce collection frequency or batch size
- Disk full: Run
docker system pruneand consider data export/cleanup
Last Updated: March 30, 2026 Project Status: Data collection complete, web interface available for analysis