544 lines
12 KiB
Markdown
544 lines
12 KiB
Markdown
|
|
# Bluesky Collector - Operations Guide
|
||
|
|
|
||
|
|
## Quick Reference
|
||
|
|
|
||
|
|
### Current Status (March 30, 2026)
|
||
|
|
- **Collector:** ❌ STOPPED (data collection complete)
|
||
|
|
- **Scheduler:** ❌ STOPPED (no further automated runs)
|
||
|
|
- **Web Interface:** ✅ RUNNING (http://localhost:5001)
|
||
|
|
- **Database:** ✅ RUNNING (PostgreSQL on port 5433)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Starting and Stopping Services
|
||
|
|
|
||
|
|
### View Current Service Status
|
||
|
|
```bash
|
||
|
|
cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
|
||
|
|
docker compose ps
|
||
|
|
```
|
||
|
|
|
||
|
|
### Start All Services
|
||
|
|
```bash
|
||
|
|
docker compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
This starts:
|
||
|
|
- `db` - PostgreSQL database (port 5433)
|
||
|
|
- `web` - Web interface (port 5001)
|
||
|
|
- `collector` - Data collection service
|
||
|
|
- `scheduler` - Automated collection scheduler (runs every 4 hours)
|
||
|
|
|
||
|
|
### Stop Collection Only (Keep Web Interface)
|
||
|
|
```bash
|
||
|
|
docker compose stop scheduler collector
|
||
|
|
```
|
||
|
|
|
||
|
|
This configuration allows browsing collected data without gathering new content.
|
||
|
|
|
||
|
|
### Start Collection Services
|
||
|
|
```bash
|
||
|
|
docker compose start scheduler collector
|
||
|
|
```
|
||
|
|
|
||
|
|
### Stop All Services
|
||
|
|
```bash
|
||
|
|
docker compose down
|
||
|
|
```
|
||
|
|
|
||
|
|
**Warning:** This will stop the web interface and database. Data is preserved in Docker volumes.
|
||
|
|
|
||
|
|
### Stop and Remove Everything (Including Data)
|
||
|
|
```bash
|
||
|
|
docker compose down -v
|
||
|
|
```
|
||
|
|
|
||
|
|
**⚠️ DANGER:** This deletes all collected data permanently!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Service Details
|
||
|
|
|
||
|
|
### Database (PostgreSQL)
|
||
|
|
- **Image:** `postgres:16-alpine`
|
||
|
|
- **Port:** 5433 (external) → 5432 (internal)
|
||
|
|
- **Data Volume:** `pgdata`
|
||
|
|
- **Access:**
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky
|
||
|
|
```
|
||
|
|
|
||
|
|
### Web Interface
|
||
|
|
- **URL:** http://localhost:5001
|
||
|
|
- **Port:** 5001
|
||
|
|
- **Stack:** Flask + Gunicorn
|
||
|
|
- **Pages:**
|
||
|
|
- `/` - Dashboard with collection stats
|
||
|
|
- `/accounts` - Account toxicity summary
|
||
|
|
- `/statuses` - Posts and replies browser
|
||
|
|
- `/mentions` - Mentions browser
|
||
|
|
- `/analysis` - Toxicity analysis overview
|
||
|
|
- `/analysis/flagged` - Flagged content with human review
|
||
|
|
- `/export` - Data export options
|
||
|
|
|
||
|
|
### Collector Service
|
||
|
|
- **Schedule:** Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
|
||
|
|
- **Function:** Collects new posts and mentions from Bluesky API
|
||
|
|
- **Logs:**
|
||
|
|
```bash
|
||
|
|
docker compose logs -f collector
|
||
|
|
```
|
||
|
|
|
||
|
|
### Scheduler Service
|
||
|
|
- **Image:** `mcuadros/ofelia`
|
||
|
|
- **Function:** Triggers collector and analyzer jobs on schedule
|
||
|
|
- **Jobs:**
|
||
|
|
- `collect` - Runs at 0 minutes past every 4th hour
|
||
|
|
- `analyze` - Runs at 30 minutes past every 4th hour
|
||
|
|
- **Logs:**
|
||
|
|
```bash
|
||
|
|
docker compose logs -f scheduler
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Manual Operations
|
||
|
|
|
||
|
|
### Run Manual Collection
|
||
|
|
```bash
|
||
|
|
docker compose exec collector python -m src
|
||
|
|
```
|
||
|
|
|
||
|
|
Collects posts and mentions immediately (outside of schedule).
|
||
|
|
|
||
|
|
### Run Manual Analysis
|
||
|
|
```bash
|
||
|
|
docker compose exec collector python -m src.analyzer
|
||
|
|
```
|
||
|
|
|
||
|
|
Analyzes all unscored posts/mentions using OpenAI API.
|
||
|
|
|
||
|
|
**Cost Warning:** Analysis incurs OpenAI API costs. Check batch size settings.
|
||
|
|
|
||
|
|
### Analyze Specific Batch Size
|
||
|
|
```bash
|
||
|
|
docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100
|
||
|
|
```
|
||
|
|
|
||
|
|
Options:
|
||
|
|
- `--batch-size N` - Number of posts per API call (default: 10)
|
||
|
|
- `--limit N` - Maximum posts to analyze (default: 0 = unlimited)
|
||
|
|
- `--concurrency N` - Parallel API requests (default: 3)
|
||
|
|
|
||
|
|
### View Recent Logs
|
||
|
|
```bash
|
||
|
|
# All services
|
||
|
|
docker compose logs --tail 100
|
||
|
|
|
||
|
|
# Specific service
|
||
|
|
docker compose logs --tail 50 collector
|
||
|
|
docker compose logs --tail 50 web
|
||
|
|
|
||
|
|
# Follow logs in real-time
|
||
|
|
docker compose logs -f collector
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Database Operations
|
||
|
|
|
||
|
|
### Access Database Shell
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common Queries
|
||
|
|
|
||
|
|
#### Check Collection Status
|
||
|
|
```sql
|
||
|
|
SELECT
|
||
|
|
started_at::date as date,
|
||
|
|
COUNT(*) as runs,
|
||
|
|
SUM(posts_collected) as total_posts,
|
||
|
|
SUM(mentions_collected) as total_mentions,
|
||
|
|
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
|
||
|
|
FROM collection_runs
|
||
|
|
WHERE started_at >= '2026-01-01'
|
||
|
|
GROUP BY started_at::date
|
||
|
|
ORDER BY date DESC;
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Count Flagged Content
|
||
|
|
```sql
|
||
|
|
-- Posts/Replies
|
||
|
|
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;
|
||
|
|
|
||
|
|
-- Mentions (unique posts)
|
||
|
|
SELECT COUNT(DISTINCT m.post_uri)
|
||
|
|
FROM mention_toxicity_scores mts
|
||
|
|
JOIN mentions m ON m.id = mts.mention_id
|
||
|
|
WHERE mts.overall >= 0.5;
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Human Review Progress
|
||
|
|
```sql
|
||
|
|
SELECT
|
||
|
|
CASE
|
||
|
|
WHEN review_status IS NULL THEN 'Unreviewed'
|
||
|
|
ELSE review_status
|
||
|
|
END as status,
|
||
|
|
COUNT(*) as count
|
||
|
|
FROM toxicity_scores
|
||
|
|
WHERE overall >= 0.5
|
||
|
|
GROUP BY review_status;
|
||
|
|
```
|
||
|
|
|
||
|
|
### Backup Database
|
||
|
|
```bash
|
||
|
|
docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql
|
||
|
|
```
|
||
|
|
|
||
|
|
### Restore Database
|
||
|
|
```bash
|
||
|
|
cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Rebuilding Services
|
||
|
|
|
||
|
|
### Rebuild After Code Changes
|
||
|
|
```bash
|
||
|
|
# Rebuild specific service
|
||
|
|
docker compose build web
|
||
|
|
docker compose build collector
|
||
|
|
|
||
|
|
# Rebuild and restart
|
||
|
|
docker compose up -d --build web
|
||
|
|
|
||
|
|
# Rebuild everything
|
||
|
|
docker compose build
|
||
|
|
docker compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
### Apply Database Migrations
|
||
|
|
```bash
|
||
|
|
# View available migrations
|
||
|
|
ls scripts/*.sql
|
||
|
|
|
||
|
|
# Apply specific migration
|
||
|
|
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
### Environment Variables (.env file)
|
||
|
|
```bash
|
||
|
|
# Database
|
||
|
|
POSTGRES_USER=bluesky
|
||
|
|
POSTGRES_PASSWORD=changeme
|
||
|
|
POSTGRES_PORT=5433
|
||
|
|
|
||
|
|
# Web Interface
|
||
|
|
WEB_PORT=5001
|
||
|
|
|
||
|
|
# Bluesky API (for authenticated search)
|
||
|
|
BSKY_HANDLE=your-handle.bsky.social
|
||
|
|
BSKY_APP_PASSWORD=your-app-password
|
||
|
|
|
||
|
|
# OpenAI API (for toxicity analysis)
|
||
|
|
OPENAI_API_KEY=sk-...
|
||
|
|
|
||
|
|
# Analysis Settings
|
||
|
|
ANALYZER_MODEL=gpt-4.1-nano
|
||
|
|
ANALYZER_CONCURRENCY=3
|
||
|
|
ANALYZER_BATCH_SIZE=10
|
||
|
|
ANALYZER_LIMIT=0
|
||
|
|
|
||
|
|
# Collection Settings
|
||
|
|
MAX_PAGES_PER_ACCOUNT=50
|
||
|
|
MENTION_LOOKBACK_HOURS=12
|
||
|
|
LOG_LEVEL=INFO
|
||
|
|
```
|
||
|
|
|
||
|
|
### Tracked Accounts (config/accounts.yml)
|
||
|
|
```yaml
|
||
|
|
accounts:
|
||
|
|
- handle: example.bsky.social # Account to monitor
|
||
|
|
- handle: another.bsky.social
|
||
|
|
```
|
||
|
|
|
||
|
|
Add or remove accounts, then restart collector:
|
||
|
|
```bash
|
||
|
|
docker compose restart collector
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Web Interface Not Loading
|
||
|
|
```bash
|
||
|
|
# Check if web service is running
|
||
|
|
docker compose ps web
|
||
|
|
|
||
|
|
# Check web logs for errors
|
||
|
|
docker compose logs --tail 50 web
|
||
|
|
|
||
|
|
# Restart web service
|
||
|
|
docker compose restart web
|
||
|
|
```
|
||
|
|
|
||
|
|
### Collector Not Running
|
||
|
|
```bash
|
||
|
|
# Check scheduler is running
|
||
|
|
docker compose ps scheduler
|
||
|
|
|
||
|
|
# Check collector status
|
||
|
|
docker compose ps collector
|
||
|
|
|
||
|
|
# Start scheduler if stopped
|
||
|
|
docker compose start scheduler
|
||
|
|
|
||
|
|
# Check scheduler logs
|
||
|
|
docker compose logs scheduler
|
||
|
|
```
|
||
|
|
|
||
|
|
### Database Connection Issues
|
||
|
|
```bash
|
||
|
|
# Check database health
|
||
|
|
docker compose ps db
|
||
|
|
|
||
|
|
# Restart database
|
||
|
|
docker compose restart db
|
||
|
|
|
||
|
|
# Check database logs
|
||
|
|
docker compose logs db
|
||
|
|
```
|
||
|
|
|
||
|
|
### Out of Disk Space
|
||
|
|
```bash
|
||
|
|
# Check Docker disk usage
|
||
|
|
docker system df
|
||
|
|
|
||
|
|
# Remove unused images/containers
|
||
|
|
docker system prune
|
||
|
|
|
||
|
|
# Check database size
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Analysis Failing (OpenAI API)
|
||
|
|
```bash
|
||
|
|
# Check API key is set
|
||
|
|
docker compose exec collector printenv | grep OPENAI_API_KEY
|
||
|
|
|
||
|
|
# Test API connectivity
|
||
|
|
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"
|
||
|
|
|
||
|
|
# Check rate limits in logs
|
||
|
|
docker compose logs collector | grep -i "rate limit"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Tuning
|
||
|
|
|
||
|
|
### Increase Collection Speed
|
||
|
|
Edit `docker-compose.yml`:
|
||
|
|
```yaml
|
||
|
|
environment:
|
||
|
|
MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50
|
||
|
|
MENTION_LOOKBACK_HOURS: 24 # Increase lookback
|
||
|
|
```
|
||
|
|
|
||
|
|
### Increase Analysis Speed
|
||
|
|
```yaml
|
||
|
|
environment:
|
||
|
|
ANALYZER_CONCURRENCY: 5 # More parallel requests
|
||
|
|
ANALYZER_BATCH_SIZE: 20 # Bigger batches
|
||
|
|
```
|
||
|
|
|
||
|
|
**Cost Warning:** Higher concurrency and batch size = higher OpenAI API costs.
|
||
|
|
|
||
|
|
### Change Collection Schedule
|
||
|
|
Edit `docker-compose.yml` under collector labels:
|
||
|
|
```yaml
|
||
|
|
labels:
|
||
|
|
ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours
|
||
|
|
ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection
|
||
|
|
```
|
||
|
|
|
||
|
|
Restart scheduler after changes:
|
||
|
|
```bash
|
||
|
|
docker compose restart scheduler
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Data Export
|
||
|
|
|
||
|
|
### Export to CSV via Web Interface
|
||
|
|
1. Navigate to http://localhost:5001/export
|
||
|
|
2. Select date range and filters
|
||
|
|
3. Click "Export to CSV"
|
||
|
|
|
||
|
|
### Export via Command Line
|
||
|
|
|
||
|
|
#### All Posts
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
|
||
|
|
SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
|
||
|
|
ts.overall, ts.toxic, ts.hate_speech, ts.threat
|
||
|
|
FROM posts p
|
||
|
|
LEFT JOIN accounts a ON a.did = p.author_did
|
||
|
|
LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
|
||
|
|
WHERE p.created_at >= '2026-01-01'
|
||
|
|
) TO STDOUT CSV HEADER" > posts_export.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Flagged Content with Reviews
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
|
||
|
|
SELECT p.uri, a.handle, p.text, p.created_at,
|
||
|
|
ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
|
||
|
|
FROM toxicity_scores ts
|
||
|
|
JOIN posts p ON p.uri = ts.uri
|
||
|
|
LEFT JOIN accounts a ON a.did = p.author_did
|
||
|
|
WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
|
||
|
|
ORDER BY ts.overall DESC
|
||
|
|
) TO STDOUT CSV HEADER" > flagged_export.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Restarting Data Collection (If Needed)
|
||
|
|
|
||
|
|
### Resume Collection After Pause
|
||
|
|
1. Start services:
|
||
|
|
```bash
|
||
|
|
docker compose start scheduler collector
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Verify collection runs:
|
||
|
|
```bash
|
||
|
|
docker compose logs -f collector
|
||
|
|
```
|
||
|
|
|
||
|
|
3. Check database for new entries:
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "
|
||
|
|
SELECT MAX(created_at) FROM posts;
|
||
|
|
SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day';
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Start Fresh Collection (Keep Database)
|
||
|
|
1. Stop services:
|
||
|
|
```bash
|
||
|
|
docker compose down
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Start only database and web:
|
||
|
|
```bash
|
||
|
|
docker compose up -d db web
|
||
|
|
```
|
||
|
|
|
||
|
|
3. Truncate collection tracking (optional):
|
||
|
|
```bash
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;"
|
||
|
|
```
|
||
|
|
|
||
|
|
4. Start collector:
|
||
|
|
```bash
|
||
|
|
docker compose up -d scheduler collector
|
||
|
|
```
|
||
|
|
|
||
|
|
### Complete Reset (Delete All Data)
|
||
|
|
```bash
|
||
|
|
# Stop everything
|
||
|
|
docker compose down
|
||
|
|
|
||
|
|
# Remove data volume
|
||
|
|
docker volume rm bluesky-collector_pgdata
|
||
|
|
|
||
|
|
# Restart from scratch
|
||
|
|
docker compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
**⚠️ WARNING:** This deletes all collected posts, mentions, and analysis results permanently!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
### Collection Health Check
|
||
|
|
```bash
|
||
|
|
# Last 5 collection runs
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "
|
||
|
|
SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
|
||
|
|
FROM collection_runs
|
||
|
|
ORDER BY started_at DESC
|
||
|
|
LIMIT 5;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Analysis Progress
|
||
|
|
```bash
|
||
|
|
# Count scored vs unscored
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "
|
||
|
|
SELECT
|
||
|
|
(SELECT COUNT(*) FROM posts) as total_posts,
|
||
|
|
(SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
|
||
|
|
(SELECT COUNT(*) FROM mentions) as total_mentions,
|
||
|
|
(SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Disk Usage
|
||
|
|
```bash
|
||
|
|
# Database size
|
||
|
|
docker compose exec db psql -U bluesky -d bluesky -c "
|
||
|
|
SELECT
|
||
|
|
pg_size_pretty(pg_database_size('bluesky')) as db_size,
|
||
|
|
pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
|
||
|
|
pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Security Notes
|
||
|
|
|
||
|
|
1. **Never commit .env file** - Contains API keys and passwords
|
||
|
|
2. **Change default passwords** - PostgreSQL default password is `changeme`
|
||
|
|
3. **Firewall rules** - Ports 5001 (web) and 5433 (database) exposed to localhost only
|
||
|
|
4. **API keys** - Bluesky and OpenAI credentials stored in environment variables
|
||
|
|
5. **Data retention** - Contains personal data (Bluesky posts); handle per GDPR requirements
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Support
|
||
|
|
|
||
|
|
### Documentation
|
||
|
|
- Main findings: `FINDINGS.md`
|
||
|
|
- This operations guide: `OPERATIONS.md`
|
||
|
|
- Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector
|
||
|
|
|
||
|
|
### Logs Location
|
||
|
|
- Docker logs: `docker compose logs [service]`
|
||
|
|
- Application logs: `./logs/` directory (if volume mounted)
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
1. **Port conflicts:** Change `WEB_PORT` or `POSTGRES_PORT` in .env
|
||
|
|
2. **Out of memory:** Reduce `ANALYZER_CONCURRENCY` or `ANALYZER_BATCH_SIZE`
|
||
|
|
3. **API rate limits:** Reduce collection frequency or batch size
|
||
|
|
4. **Disk full:** Run `docker system prune` and consider data export/cleanup
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated:** March 30, 2026
|
||
|
|
**Project Status:** Data collection complete, web interface available for analysis
|