bluesky-collector/OPERATIONS.md

544 lines
12 KiB
Markdown
Raw Normal View History

# Bluesky Collector - Operations Guide
## Quick Reference
### Current Status (March 30, 2026)
- **Collector:** ❌ STOPPED (data collection complete)
- **Scheduler:** ❌ STOPPED (no further automated runs)
- **Web Interface:** ✅ RUNNING (http://localhost:5001)
- **Database:** ✅ RUNNING (PostgreSQL on port 5433)
---
## Starting and Stopping Services
### View Current Service Status
```bash
cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
docker compose ps
```
### Start All Services
```bash
docker compose up -d
```
This starts:
- `db` - PostgreSQL database (port 5433)
- `web` - Web interface (port 5001)
- `collector` - Data collection service
- `scheduler` - Automated collection scheduler (runs every 4 hours)
### Stop Collection Only (Keep Web Interface)
```bash
docker compose stop scheduler collector
```
This configuration allows browsing collected data without gathering new content.
### Start Collection Services
```bash
docker compose start scheduler collector
```
### Stop All Services
```bash
docker compose down
```
**Warning:** This will stop the web interface and database. Data is preserved in Docker volumes.
### Stop and Remove Everything (Including Data)
```bash
docker compose down -v
```
**⚠️ DANGER:** This deletes all collected data permanently!
---
## Service Details
### Database (PostgreSQL)
- **Image:** `postgres:16-alpine`
- **Port:** 5433 (external) → 5432 (internal)
- **Data Volume:** `pgdata`
- **Access:**
```bash
docker compose exec db psql -U bluesky -d bluesky
```
### Web Interface
- **URL:** http://localhost:5001
- **Port:** 5001
- **Stack:** Flask + Gunicorn
- **Pages:**
- `/` - Dashboard with collection stats
- `/accounts` - Account toxicity summary
- `/statuses` - Posts and replies browser
- `/mentions` - Mentions browser
- `/analysis` - Toxicity analysis overview
- `/analysis/flagged` - Flagged content with human review
- `/export` - Data export options
### Collector Service
- **Schedule:** Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
- **Function:** Collects new posts and mentions from Bluesky API
- **Logs:**
```bash
docker compose logs -f collector
```
### Scheduler Service
- **Image:** `mcuadros/ofelia`
- **Function:** Triggers collector and analyzer jobs on schedule
- **Jobs:**
- `collect` - Runs at 0 minutes past every 4th hour
- `analyze` - Runs at 30 minutes past every 4th hour
- **Logs:**
```bash
docker compose logs -f scheduler
```
---
## Manual Operations
### Run Manual Collection
```bash
docker compose exec collector python -m src
```
Collects posts and mentions immediately (outside of schedule).
### Run Manual Analysis
```bash
docker compose exec collector python -m src.analyzer
```
Analyzes all unscored posts/mentions using OpenAI API.
**Cost Warning:** Analysis incurs OpenAI API costs. Check batch size settings.
### Analyze Specific Batch Size
```bash
docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100
```
Options:
- `--batch-size N` - Number of posts per API call (default: 10)
- `--limit N` - Maximum posts to analyze (default: 0 = unlimited)
- `--concurrency N` - Parallel API requests (default: 3)
### View Recent Logs
```bash
# All services
docker compose logs --tail 100
# Specific service
docker compose logs --tail 50 collector
docker compose logs --tail 50 web
# Follow logs in real-time
docker compose logs -f collector
```
---
## Database Operations
### Access Database Shell
```bash
docker compose exec db psql -U bluesky -d bluesky
```
### Common Queries
#### Check Collection Status
```sql
SELECT
started_at::date as date,
COUNT(*) as runs,
SUM(posts_collected) as total_posts,
SUM(mentions_collected) as total_mentions,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
FROM collection_runs
WHERE started_at >= '2026-01-01'
GROUP BY started_at::date
ORDER BY date DESC;
```
#### Count Flagged Content
```sql
-- Posts/Replies
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;
-- Mentions (unique posts)
SELECT COUNT(DISTINCT m.post_uri)
FROM mention_toxicity_scores mts
JOIN mentions m ON m.id = mts.mention_id
WHERE mts.overall >= 0.5;
```
#### Human Review Progress
```sql
SELECT
CASE
WHEN review_status IS NULL THEN 'Unreviewed'
ELSE review_status
END as status,
COUNT(*) as count
FROM toxicity_scores
WHERE overall >= 0.5
GROUP BY review_status;
```
### Backup Database
```bash
docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql
```
### Restore Database
```bash
cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky
```
---
## Rebuilding Services
### Rebuild After Code Changes
```bash
# Rebuild specific service
docker compose build web
docker compose build collector
# Rebuild and restart
docker compose up -d --build web
# Rebuild everything
docker compose build
docker compose up -d
```
### Apply Database Migrations
```bash
# View available migrations
ls scripts/*.sql
# Apply specific migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql
```
---
## Configuration
### Environment Variables (.env file)
```bash
# Database
POSTGRES_USER=bluesky
POSTGRES_PASSWORD=changeme
POSTGRES_PORT=5433
# Web Interface
WEB_PORT=5001
# Bluesky API (for authenticated search)
BSKY_HANDLE=your-handle.bsky.social
BSKY_APP_PASSWORD=your-app-password
# OpenAI API (for toxicity analysis)
OPENAI_API_KEY=sk-...
# Analysis Settings
ANALYZER_MODEL=gpt-4.1-nano
ANALYZER_CONCURRENCY=3
ANALYZER_BATCH_SIZE=10
ANALYZER_LIMIT=0
# Collection Settings
MAX_PAGES_PER_ACCOUNT=50
MENTION_LOOKBACK_HOURS=12
LOG_LEVEL=INFO
```
### Tracked Accounts (config/accounts.yml)
```yaml
accounts:
- handle: example.bsky.social # Account to monitor
- handle: another.bsky.social
```
Add or remove accounts, then restart collector:
```bash
docker compose restart collector
```
---
## Troubleshooting
### Web Interface Not Loading
```bash
# Check if web service is running
docker compose ps web
# Check web logs for errors
docker compose logs --tail 50 web
# Restart web service
docker compose restart web
```
### Collector Not Running
```bash
# Check scheduler is running
docker compose ps scheduler
# Check collector status
docker compose ps collector
# Start scheduler if stopped
docker compose start scheduler
# Check scheduler logs
docker compose logs scheduler
```
### Database Connection Issues
```bash
# Check database health
docker compose ps db
# Restart database
docker compose restart db
# Check database logs
docker compose logs db
```
### Out of Disk Space
```bash
# Check Docker disk usage
docker system df
# Remove unused images/containers
docker system prune
# Check database size
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"
```
### Analysis Failing (OpenAI API)
```bash
# Check API key is set
docker compose exec collector printenv | grep OPENAI_API_KEY
# Test API connectivity
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"
# Check rate limits in logs
docker compose logs collector | grep -i "rate limit"
```
---
## Performance Tuning
### Increase Collection Speed
Edit `docker-compose.yml`:
```yaml
environment:
MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50
MENTION_LOOKBACK_HOURS: 24 # Increase lookback
```
### Increase Analysis Speed
```yaml
environment:
ANALYZER_CONCURRENCY: 5 # More parallel requests
ANALYZER_BATCH_SIZE: 20 # Bigger batches
```
**Cost Warning:** Higher concurrency and batch size = higher OpenAI API costs.
### Change Collection Schedule
Edit `docker-compose.yml` under collector labels:
```yaml
labels:
ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours
ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection
```
Restart scheduler after changes:
```bash
docker compose restart scheduler
```
---
## Data Export
### Export to CSV via Web Interface
1. Navigate to http://localhost:5001/export
2. Select date range and filters
3. Click "Export to CSV"
### Export via Command Line
#### All Posts
```bash
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
ts.overall, ts.toxic, ts.hate_speech, ts.threat
FROM posts p
LEFT JOIN accounts a ON a.did = p.author_did
LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
WHERE p.created_at >= '2026-01-01'
) TO STDOUT CSV HEADER" > posts_export.csv
```
#### Flagged Content with Reviews
```bash
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, a.handle, p.text, p.created_at,
ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
FROM toxicity_scores ts
JOIN posts p ON p.uri = ts.uri
LEFT JOIN accounts a ON a.did = p.author_did
WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
ORDER BY ts.overall DESC
) TO STDOUT CSV HEADER" > flagged_export.csv
```
---
## Restarting Data Collection (If Needed)
### Resume Collection After Pause
1. Start services:
```bash
docker compose start scheduler collector
```
2. Verify collection runs:
```bash
docker compose logs -f collector
```
3. Check database for new entries:
```bash
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT MAX(created_at) FROM posts;
SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day';
"
```
### Start Fresh Collection (Keep Database)
1. Stop services:
```bash
docker compose down
```
2. Start only database and web:
```bash
docker compose up -d db web
```
3. Truncate collection tracking (optional):
```bash
docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;"
```
4. Start collector:
```bash
docker compose up -d scheduler collector
```
### Complete Reset (Delete All Data)
```bash
# Stop everything
docker compose down
# Remove data volume
docker volume rm bluesky-collector_pgdata
# Restart from scratch
docker compose up -d
```
**⚠️ WARNING:** This deletes all collected posts, mentions, and analysis results permanently!
---
## Monitoring
### Collection Health Check
```bash
# Last 5 collection runs
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
FROM collection_runs
ORDER BY started_at DESC
LIMIT 5;
"
```
### Analysis Progress
```bash
# Count scored vs unscored
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
(SELECT COUNT(*) FROM posts) as total_posts,
(SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
(SELECT COUNT(*) FROM mentions) as total_mentions,
(SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
"
```
### Disk Usage
```bash
# Database size
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
pg_size_pretty(pg_database_size('bluesky')) as db_size,
pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
"
```
---
## Security Notes
1. **Never commit .env file** - Contains API keys and passwords
2. **Change default passwords** - PostgreSQL default password is `changeme`
3. **Firewall rules** - Ports 5001 (web) and 5433 (database) exposed to localhost only
4. **API keys** - Bluesky and OpenAI credentials stored in environment variables
5. **Data retention** - Contains personal data (Bluesky posts); handle per GDPR requirements
---
## Support
### Documentation
- Main findings: `FINDINGS.md`
- This operations guide: `OPERATIONS.md`
- Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector
### Logs Location
- Docker logs: `docker compose logs [service]`
- Application logs: `./logs/` directory (if volume mounted)
### Common Issues
1. **Port conflicts:** Change `WEB_PORT` or `POSTGRES_PORT` in .env
2. **Out of memory:** Reduce `ANALYZER_CONCURRENCY` or `ANALYZER_BATCH_SIZE`
3. **API rate limits:** Reduce collection frequency or batch size
4. **Disk full:** Run `docker system prune` and consider data export/cleanup
---
**Last Updated:** March 30, 2026
**Project Status:** Data collection complete, web interface available for analysis