bluesky-collector/README.md

282 lines
10 KiB
Markdown
Raw Normal View History

# Bluesky Account Monitor
Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.
## Architecture
```
┌─────────────────────────────────────────┐
│ Docker Compose │
│ │
accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │
│ │ ▲ │
│ ▼ │ │
│ analyzer ──────────┘ │
│ │ │
│ ▼ │
│ OpenAI API │
│ │
│ scheduler (Ofelia) ── cron triggers │
└─────────────────────────────────────────┘
```
Four services:
- **db** — PostgreSQL 16 (Alpine), stores all data
- **collector** — Python async service that fetches posts and mentions from Bluesky
- **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
- **web** — Flask + Gunicorn dashboard on port 5001
## Quick Start
```bash
# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY
# 2. Add your target accounts to config/accounts.yml
# 3. Start everything
docker compose up -d
# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql
# 5. Trigger an immediate first collection
docker compose exec collector python -m src
# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
# 7. Open the dashboard
open http://localhost:5001
```
## Collection
### What Gets Collected
| Source | API Endpoint | Stored In |
|--------|-------------|-----------|
| User's own posts & replies | `getAuthorFeed` (public) | `posts` table |
| Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table |
All records include a `raw_json` JSONB column with the full API response for future-proof analysis.
### How It Works
- **Scheduled polling** via Ofelia — runs every 4 hours by default
- **Incremental collection** — only fetches posts newer than the last run
- **Rate limit aware** — reads API response headers and sleeps when approaching limits
- **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters
## Toxicity Analysis
The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):
| Category | What it detects |
|----------|----------------|
| `toxic` | Rude, disrespectful, or aggressive language |
| `threat` | Violence, harm, intimidation, calls to action |
| `hate_speech` | Targeting any protected characteristic |
| `racism` | Race/ethnicity-based hostility |
| `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language |
| `islamophobia` | Anti-Muslim hate, "omvolking" narratives |
| `sexism` | Gender-based discrimination or harassment |
| `homophobia` | Anti-LGBTQ+ rhetoric |
| `insult` | Personal attacks, name-calling |
| `dehumanization` | Comparing people to animals, vermin, disease |
| `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement |
| `ableism` | Disability-targeting language, mental health slurs |
The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.
### Batch Processing
Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.
| | 1 post/call | 10 posts/call (default) |
|---|---|---|
| API calls for 60K posts | 60,000 | 6,000 |
| Estimated cost | ~$5.10 | ~$2.40 |
### Running the Analyzer
```bash
# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer
# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log
```
The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.
## Web Dashboard
Access at `http://localhost:5001` (or your configured `WEB_PORT`).
### Pages
- **Dashboard** — Overview of collection runs, account count, post/mention totals
- **Accounts** — List of tracked accounts with post counts and last activity
- **Statuses** — Browse all collected posts with filters and search
- **Mentions** — Browse mentions of tracked accounts
- **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs
- **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type
- **Account Toxicity** — Per-account toxicity breakdown with comparative charts
- **Export** — Download data as CSV
## Configuration
### accounts.yml
```yaml
accounts:
- handle: alice.bsky.social
- handle: bob.bsky.social
- handle: some-org.bsky.social
```
### Environment Variables
#### Collection
| Variable | Default | Description |
|----------|---------|-------------|
| `POSTGRES_PASSWORD` | `changeme` | Database password |
| `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port |
| `LOG_LEVEL` | `INFO` | Python log level |
| `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) |
| `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run |
| `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) |
| `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords |
#### Toxicity Analysis
| Variable | Default | Description |
|----------|---------|-------------|
| `OPENAI_API_KEY` | — | OpenAI API key (required) |
| `ANALYZER_MODEL` | `gpt-4.1-nano` | OpenAI model for classification |
| `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) |
| `ANALYZER_BATCH_SIZE` | `10` | Posts per API call |
| `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) |
| `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged |
#### Web UI
| Variable | Default | Description |
|----------|---------|-------------|
| `WEB_PORT` | `5001` | Exposed web dashboard port |
## Database Schema
### Key Tables
- **`accounts`** — Tracked accounts (DID, handle, collection timestamps)
- **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
- **`mentions`** — Posts from anyone that mention a tracked account
- **`collection_runs`** — Audit trail of each collection run (timing, counts, errors)
- **`collection_state`** — Per-account bookmarks for incremental collection
- **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged
- **`mention_toxicity_scores`** — Same structure for mentions
- **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors)
### Useful Queries
```sql
-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;
-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;
-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;
-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;
-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;
-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
```
## Operations
### Manual Runs
```bash
# Collect posts
docker compose exec collector python -m src
# Run toxicity analysis
docker compose exec collector python -m src.analyzer
```
### Monitoring
```bash
# Follow logs
docker compose logs -f collector
# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
"SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"
# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
"SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
```
### Rebuilding After Code Changes
```bash
docker compose build collector web
docker compose up -d
```
### Add/Remove Accounts
Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.
### First Run / Backfill
The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:
```bash
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
```
### Backup
The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:
```bash
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
```