bluesky-collector/OPERATIONS.md
Pieter 1c3f57d7e5 Add documentation and license, remove IDE files
Added comprehensive project documentation and MIT license. Removed Claude
IDE configuration files from repository tracking.

Documentation added:
- FINDINGS.md: Complete methodology report and research findings
  - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30)
  - Human review results: 40.4% correct, 59.6% false positives
  - AI toxicity detection limitations and recommendations
- OPERATIONS.md: Complete operations and maintenance guide
  - Service start/stop procedures
  - Database operations and queries
  - Configuration options
  - Troubleshooting guide
  - Data export instructions

License:
- Added MIT License to README.md
- Copyright 2026 Post X Society
- Open source with permissive license

Repository cleanup:
- Added .claude/ to .gitignore
- Removed .claude/settings.local.json from tracking
- Prevents IDE-specific files from being committed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-30 14:39:11 +02:00

12 KiB

Bluesky Collector - Operations Guide

Quick Reference

Current Status (March 30, 2026)

  • Collector: STOPPED (data collection complete)
  • Scheduler: STOPPED (no further automated runs)
  • Web Interface: RUNNING (http://localhost:5001)
  • Database: RUNNING (PostgreSQL on port 5433)

Starting and Stopping Services

View Current Service Status

cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
docker compose ps

Start All Services

docker compose up -d

This starts:

  • db - PostgreSQL database (port 5433)
  • web - Web interface (port 5001)
  • collector - Data collection service
  • scheduler - Automated collection scheduler (runs every 4 hours)

Stop Collection Only (Keep Web Interface)

docker compose stop scheduler collector

This configuration allows browsing collected data without gathering new content.

Start Collection Services

docker compose start scheduler collector

Stop All Services

docker compose down

Warning: This will stop the web interface and database. Data is preserved in Docker volumes.

Stop and Remove Everything (Including Data)

docker compose down -v

⚠️ DANGER: This deletes all collected data permanently!


Service Details

Database (PostgreSQL)

  • Image: postgres:16-alpine
  • Port: 5433 (external) → 5432 (internal)
  • Data Volume: pgdata
  • Access:
    docker compose exec db psql -U bluesky -d bluesky
    

Web Interface

  • URL: http://localhost:5001
  • Port: 5001
  • Stack: Flask + Gunicorn
  • Pages:
    • / - Dashboard with collection stats
    • /accounts - Account toxicity summary
    • /statuses - Posts and replies browser
    • /mentions - Mentions browser
    • /analysis - Toxicity analysis overview
    • /analysis/flagged - Flagged content with human review
    • /export - Data export options

Collector Service

  • Schedule: Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
  • Function: Collects new posts and mentions from Bluesky API
  • Logs:
    docker compose logs -f collector
    

Scheduler Service

  • Image: mcuadros/ofelia
  • Function: Triggers collector and analyzer jobs on schedule
  • Jobs:
    • collect - Runs at 0 minutes past every 4th hour
    • analyze - Runs at 30 minutes past every 4th hour
  • Logs:
    docker compose logs -f scheduler
    

Manual Operations

Run Manual Collection

docker compose exec collector python -m src

Collects posts and mentions immediately (outside of schedule).

Run Manual Analysis

docker compose exec collector python -m src.analyzer

Analyzes all unscored posts/mentions using OpenAI API.

Cost Warning: Analysis incurs OpenAI API costs. Check batch size settings.

Analyze Specific Batch Size

docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100

Options:

  • --batch-size N - Number of posts per API call (default: 10)
  • --limit N - Maximum posts to analyze (default: 0 = unlimited)
  • --concurrency N - Parallel API requests (default: 3)

View Recent Logs

# All services
docker compose logs --tail 100

# Specific service
docker compose logs --tail 50 collector
docker compose logs --tail 50 web

# Follow logs in real-time
docker compose logs -f collector

Database Operations

Access Database Shell

docker compose exec db psql -U bluesky -d bluesky

Common Queries

Check Collection Status

SELECT
    started_at::date as date,
    COUNT(*) as runs,
    SUM(posts_collected) as total_posts,
    SUM(mentions_collected) as total_mentions,
    SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
FROM collection_runs
WHERE started_at >= '2026-01-01'
GROUP BY started_at::date
ORDER BY date DESC;

Count Flagged Content

-- Posts/Replies
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;

-- Mentions (unique posts)
SELECT COUNT(DISTINCT m.post_uri)
FROM mention_toxicity_scores mts
JOIN mentions m ON m.id = mts.mention_id
WHERE mts.overall >= 0.5;

Human Review Progress

SELECT
    CASE
        WHEN review_status IS NULL THEN 'Unreviewed'
        ELSE review_status
    END as status,
    COUNT(*) as count
FROM toxicity_scores
WHERE overall >= 0.5
GROUP BY review_status;

Backup Database

docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql

Restore Database

cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky

Rebuilding Services

Rebuild After Code Changes

# Rebuild specific service
docker compose build web
docker compose build collector

# Rebuild and restart
docker compose up -d --build web

# Rebuild everything
docker compose build
docker compose up -d

Apply Database Migrations

# View available migrations
ls scripts/*.sql

# Apply specific migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql

Configuration

Environment Variables (.env file)

# Database
POSTGRES_USER=bluesky
POSTGRES_PASSWORD=changeme
POSTGRES_PORT=5433

# Web Interface
WEB_PORT=5001

# Bluesky API (for authenticated search)
BSKY_HANDLE=your-handle.bsky.social
BSKY_APP_PASSWORD=your-app-password

# OpenAI API (for toxicity analysis)
OPENAI_API_KEY=sk-...

# Analysis Settings
ANALYZER_MODEL=gpt-4.1-nano
ANALYZER_CONCURRENCY=3
ANALYZER_BATCH_SIZE=10
ANALYZER_LIMIT=0

# Collection Settings
MAX_PAGES_PER_ACCOUNT=50
MENTION_LOOKBACK_HOURS=12
LOG_LEVEL=INFO

Tracked Accounts (config/accounts.yml)

accounts:
  - handle: example.bsky.social  # Account to monitor
  - handle: another.bsky.social

Add or remove accounts, then restart collector:

docker compose restart collector

Troubleshooting

Web Interface Not Loading

# Check if web service is running
docker compose ps web

# Check web logs for errors
docker compose logs --tail 50 web

# Restart web service
docker compose restart web

Collector Not Running

# Check scheduler is running
docker compose ps scheduler

# Check collector status
docker compose ps collector

# Start scheduler if stopped
docker compose start scheduler

# Check scheduler logs
docker compose logs scheduler

Database Connection Issues

# Check database health
docker compose ps db

# Restart database
docker compose restart db

# Check database logs
docker compose logs db

Out of Disk Space

# Check Docker disk usage
docker system df

# Remove unused images/containers
docker system prune

# Check database size
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"

Analysis Failing (OpenAI API)

# Check API key is set
docker compose exec collector printenv | grep OPENAI_API_KEY

# Test API connectivity
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"

# Check rate limits in logs
docker compose logs collector | grep -i "rate limit"

Performance Tuning

Increase Collection Speed

Edit docker-compose.yml:

environment:
  MAX_PAGES_PER_ACCOUNT: 100  # Increase from 50
  MENTION_LOOKBACK_HOURS: 24  # Increase lookback

Increase Analysis Speed

environment:
  ANALYZER_CONCURRENCY: 5     # More parallel requests
  ANALYZER_BATCH_SIZE: 20     # Bigger batches

Cost Warning: Higher concurrency and batch size = higher OpenAI API costs.

Change Collection Schedule

Edit docker-compose.yml under collector labels:

labels:
  ofelia.job-exec.collect.schedule: "0 0 */2 * * *"  # Every 2 hours
  ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection

Restart scheduler after changes:

docker compose restart scheduler

Data Export

Export to CSV via Web Interface

  1. Navigate to http://localhost:5001/export
  2. Select date range and filters
  3. Click "Export to CSV"

Export via Command Line

All Posts

docker compose exec db psql -U bluesky -d bluesky -c "COPY (
  SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
         ts.overall, ts.toxic, ts.hate_speech, ts.threat
  FROM posts p
  LEFT JOIN accounts a ON a.did = p.author_did
  LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
  WHERE p.created_at >= '2026-01-01'
) TO STDOUT CSV HEADER" > posts_export.csv

Flagged Content with Reviews

docker compose exec db psql -U bluesky -d bluesky -c "COPY (
  SELECT p.uri, a.handle, p.text, p.created_at,
         ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
  FROM toxicity_scores ts
  JOIN posts p ON p.uri = ts.uri
  LEFT JOIN accounts a ON a.did = p.author_did
  WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
  ORDER BY ts.overall DESC
) TO STDOUT CSV HEADER" > flagged_export.csv

Restarting Data Collection (If Needed)

Resume Collection After Pause

  1. Start services:

    docker compose start scheduler collector
    
  2. Verify collection runs:

    docker compose logs -f collector
    
  3. Check database for new entries:

    docker compose exec db psql -U bluesky -d bluesky -c "
      SELECT MAX(created_at) FROM posts;
      SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day';
    "
    

Start Fresh Collection (Keep Database)

  1. Stop services:

    docker compose down
    
  2. Start only database and web:

    docker compose up -d db web
    
  3. Truncate collection tracking (optional):

    docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;"
    
  4. Start collector:

    docker compose up -d scheduler collector
    

Complete Reset (Delete All Data)

# Stop everything
docker compose down

# Remove data volume
docker volume rm bluesky-collector_pgdata

# Restart from scratch
docker compose up -d

⚠️ WARNING: This deletes all collected posts, mentions, and analysis results permanently!


Monitoring

Collection Health Check

# Last 5 collection runs
docker compose exec db psql -U bluesky -d bluesky -c "
  SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
  FROM collection_runs
  ORDER BY started_at DESC
  LIMIT 5;
"

Analysis Progress

# Count scored vs unscored
docker compose exec db psql -U bluesky -d bluesky -c "
  SELECT
    (SELECT COUNT(*) FROM posts) as total_posts,
    (SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
    (SELECT COUNT(*) FROM mentions) as total_mentions,
    (SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
"

Disk Usage

# Database size
docker compose exec db psql -U bluesky -d bluesky -c "
  SELECT
    pg_size_pretty(pg_database_size('bluesky')) as db_size,
    pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
    pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
"

Security Notes

  1. Never commit .env file - Contains API keys and passwords
  2. Change default passwords - PostgreSQL default password is changeme
  3. Firewall rules - Ports 5001 (web) and 5433 (database) exposed to localhost only
  4. API keys - Bluesky and OpenAI credentials stored in environment variables
  5. Data retention - Contains personal data (Bluesky posts); handle per GDPR requirements

Support

Documentation

Logs Location

  • Docker logs: docker compose logs [service]
  • Application logs: ./logs/ directory (if volume mounted)

Common Issues

  1. Port conflicts: Change WEB_PORT or POSTGRES_PORT in .env
  2. Out of memory: Reduce ANALYZER_CONCURRENCY or ANALYZER_BATCH_SIZE
  3. API rate limits: Reduce collection frequency or batch size
  4. Disk full: Run docker system prune and consider data export/cleanup

Last Updated: March 30, 2026 Project Status: Data collection complete, web interface available for analysis