bluesky-collector/README.md

# Bluesky Account Monitor

Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.

## Architecture

```
                    ┌─────────────────────────────────────────┐
                    │               Docker Compose             │
                    │                                          │
  accounts.yml ───▶│  collector ──▶ PostgreSQL ◀── web (Flask) │
                    │      │              ▲                     │
                    │      ▼              │                     │
                    │  analyzer ──────────┘                     │
                    │      │                                    │
                    │      ▼                                    │
                    │  OpenAI API                               │
                    │                                          │
                    │  scheduler (Ofelia) ── cron triggers     │
                    └─────────────────────────────────────────┘
```

Four services:

- **db** — PostgreSQL 16 (Alpine), stores all data
- **collector** — Python async service that fetches posts and mentions from Bluesky
- **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
- **web** — Flask + Gunicorn dashboard on port 5001

## Quick Start

```bash
# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY

# 2. Add your target accounts to config/accounts.yml

# 3. Start everything
docker compose up -d

# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql

# 5. Trigger an immediate first collection
docker compose exec collector python -m src

# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# 7. Open the dashboard
open http://localhost:5001
```

## Collection

### What Gets Collected

| Source | API Endpoint | Stored In |
|--------|-------------|-----------|
| User's own posts & replies | `getAuthorFeed` (public) | `posts` table |
| Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table |

All records include a `raw_json` JSONB column with the full API response for future-proof analysis.

### How It Works

- **Scheduled polling** via Ofelia — runs every 4 hours by default
- **Incremental collection** — only fetches posts newer than the last run
- **Rate limit aware** — reads API response headers and sleeps when approaching limits
- **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters

## Toxicity Analysis

The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):

| Category | What it detects |
|----------|----------------|
| `toxic` | Rude, disrespectful, or aggressive language |
| `threat` | Violence, harm, intimidation, calls to action |
| `hate_speech` | Targeting any protected characteristic |
| `racism` | Race/ethnicity-based hostility |
| `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language |
| `islamophobia` | Anti-Muslim hate, "omvolking" narratives |
| `sexism` | Gender-based discrimination or harassment |
| `homophobia` | Anti-LGBTQ+ rhetoric |
| `insult` | Personal attacks, name-calling |
| `dehumanization` | Comparing people to animals, vermin, disease |
| `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement |
| `ableism` | Disability-targeting language, mental health slurs |

The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.

### Batch Processing

Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.

| | 1 post/call | 10 posts/call (default) |
|---|---|---|
| API calls for 60K posts | 60,000 | 6,000 |
| Estimated cost | ~$5.10 | ~$2.40 |

### Running the Analyzer

```bash
# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer

# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log
```

The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.

## Web Dashboard

Access at `http://localhost:5001` (or your configured `WEB_PORT`).

### Pages

- **Dashboard** — Overview of collection runs, account count, post/mention totals
- **Accounts** — List of tracked accounts with post counts and last activity
- **Statuses** — Browse all collected posts with filters and search
- **Mentions** — Browse mentions of tracked accounts
- **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs
- **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type
- **Account Toxicity** — Per-account toxicity breakdown with comparative charts
- **Export** — Download data as CSV

## Configuration

### accounts.yml

```yaml
accounts:
  - handle: alice.bsky.social
  - handle: bob.bsky.social
  - handle: some-org.bsky.social
```

### Environment Variables

#### Collection

| Variable | Default | Description |
|----------|---------|-------------|
| `POSTGRES_PASSWORD` | `changeme` | Database password |
| `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port |
| `LOG_LEVEL` | `INFO` | Python log level |
| `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) |
| `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run |
| `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) |
| `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords |

#### Toxicity Analysis

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENAI_API_KEY` | — | OpenAI API key (required) |
| `ANALYZER_MODEL` | `gpt-4.1-nano` | OpenAI model for classification |
| `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) |
| `ANALYZER_BATCH_SIZE` | `10` | Posts per API call |
| `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) |
| `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged |

#### Web UI

| Variable | Default | Description |
|----------|---------|-------------|
| `WEB_PORT` | `5001` | Exposed web dashboard port |

## Database Schema

### Key Tables

- **`accounts`** — Tracked accounts (DID, handle, collection timestamps)
- **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
- **`mentions`** — Posts from anyone that mention a tracked account
- **`collection_runs`** — Audit trail of each collection run (timing, counts, errors)
- **`collection_state`** — Per-account bookmarks for incremental collection
- **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged
- **`mention_toxicity_scores`** — Same structure for mentions
- **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors)

### Useful Queries

```sql
-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;

-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;

-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;

-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;

-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;

-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
```

## Operations

### Manual Runs

```bash
# Collect posts
docker compose exec collector python -m src

# Run toxicity analysis
docker compose exec collector python -m src.analyzer
```

### Monitoring

```bash
# Follow logs
docker compose logs -f collector

# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"

# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
```

### Rebuilding After Code Changes

```bash
docker compose build collector web
docker compose up -d
```

### Add/Remove Accounts

Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

### First Run / Backfill

The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

```bash
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
```

### Backup

The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:

```bash
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
```
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00			`# Bluesky Account Monitor`

			`Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.`

			`## Architecture`

			```
			`┌─────────────────────────────────────────┐`
			`│ Docker Compose │`
			`│ │`
			`accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │`
			`│ │ ▲ │`
			`│ ▼ │ │`
			`│ analyzer ──────────┘ │`
			`│ │ │`
			`│ ▼ │`
			`│ OpenAI API │`
			`│ │`
			`│ scheduler (Ofelia) ── cron triggers │`
			`└─────────────────────────────────────────┘`
			```

			`Four services:`

			`- db — PostgreSQL 16 (Alpine), stores all data`
			`- collector — Python async service that fetches posts and mentions from Bluesky`
			`- scheduler — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)`
			`- web — Flask + Gunicorn dashboard on port 5001`

			`## Quick Start`

			```bash
			`# 1. Copy and edit your environment config`
			`cp .env.example .env`
			`# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY`

			`# 2. Add your target accounts to config/accounts.yml`

			`# 3. Start everything`
			`docker compose up -d`

			`# 4. Run the toxicity schema migration`
			`docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql`

			`# 5. Trigger an immediate first collection`
			`docker compose exec collector python -m src`

			`# 6. Run a test toxicity analysis (100 posts)`
			`docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer`

			`# 7. Open the dashboard`
			`open http://localhost:5001`
			```

			`## Collection`

			`### What Gets Collected`

			`\| Source \| API Endpoint \| Stored In \|`
			`\|--------\|-------------\|-----------\|`
			\| User's own posts & replies \| `getAuthorFeed` (public) \| `posts` table \|
			\| Posts mentioning a user \| `searchPosts` (requires auth) \| `mentions` table \|

			All records include a `raw_json` JSONB column with the full API response for future-proof analysis.

			`### How It Works`

			`- Scheduled polling via Ofelia — runs every 4 hours by default`
			`- Incremental collection — only fetches posts newer than the last run`
			`- Rate limit aware — reads API response headers and sleeps when approaching limits`
			`- Deduplication — posts are upserted by URI; engagement counts are refreshed on re-encounters`

			`## Toxicity Analysis`

			`The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):`

			`\| Category \| What it detects \|`
			`\|----------\|----------------\|`
			\| `toxic` \| Rude, disrespectful, or aggressive language \|
			\| `threat` \| Violence, harm, intimidation, calls to action \|
			\| `hate_speech` \| Targeting any protected characteristic \|
			\| `racism` \| Race/ethnicity-based hostility \|
			\| `antisemitism` \| Anti-Jewish hate, conspiracy theories, coded language \|
			\| `islamophobia` \| Anti-Muslim hate, "omvolking" narratives \|
			\| `sexism` \| Gender-based discrimination or harassment \|
			\| `homophobia` \| Anti-LGBTQ+ rhetoric \|
			\| `insult` \| Personal attacks, name-calling \|
			\| `dehumanization` \| Comparing people to animals, vermin, disease \|
			\| `extremism` \| Far-right/left rhetoric, Nazi glorification, Great Replacement \|
			\| `ableism` \| Disability-targeting language, mental health slurs \|

			`The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.`

			`### Batch Processing`

			`Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.`

			`\| \| 1 post/call \| 10 posts/call (default) \|`
			`\|---\|---\|---\|`
			`\| API calls for 60K posts \| 60,000 \| 6,000 \|`
			`\| Estimated cost \| ~$5.10 \| ~$2.40 \|`

			`### Running the Analyzer`

			```bash
			`# Test run (100 posts)`
			`docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer`

			`# Full run (all unscored posts)`
			`docker compose exec collector python -m src.analyzer`

			`# Check logs`
			`docker compose logs collector \| grep analyzer`
			`cat logs/analyzer.log`
			```

			`The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.`

			`## Web Dashboard`

			Access at `http://localhost:5001` (or your configured `WEB_PORT`).

			`### Pages`

			`- Dashboard — Overview of collection runs, account count, post/mention totals`
			`- Accounts — List of tracked accounts with post counts and last activity`
			`- Statuses — Browse all collected posts with filters and search`
			`- Mentions — Browse mentions of tracked accounts`
			`- Analysis — Toxicity overview: trend charts, category breakdown, recent analysis runs`
			`- Flagged Content — Posts scoring above the flag threshold (default 0.5), filterable by category and type`
			`- Account Toxicity — Per-account toxicity breakdown with comparative charts`
			`- Export — Download data as CSV`

			`## Configuration`

			`### accounts.yml`

			```yaml
			`accounts:`
			`- handle: alice.bsky.social`
			`- handle: bob.bsky.social`
			`- handle: some-org.bsky.social`
			```

			`### Environment Variables`

			`#### Collection`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `POSTGRES_PASSWORD` \| `changeme` \| Database password \|
			\| `POSTGRES_PORT` \| `5432` \| Exposed PostgreSQL port \|
			\| `LOG_LEVEL` \| `INFO` \| Python log level \|
			\| `MAX_PAGES_PER_ACCOUNT` \| `50` \| Max API pages per account per run (50 pages = 5000 posts) \|
			\| `MENTION_LOOKBACK_HOURS` \| `12` \| How far back to search mentions on first run \|
			\| `BSKY_HANDLE` \| — \| Your Bluesky handle (required for mention search) \|
			\| `BSKY_APP_PASSWORD` \| — \| App password from Settings > App Passwords \|

			`#### Toxicity Analysis`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `OPENAI_API_KEY` \| — \| OpenAI API key (required) \|
			\| `ANALYZER_MODEL` \| `gpt-4.1-nano` \| OpenAI model for classification \|
			\| `ANALYZER_CONCURRENCY` \| `3` \| Max concurrent API calls (batches in flight) \|
			\| `ANALYZER_BATCH_SIZE` \| `10` \| Posts per API call \|
			\| `ANALYZER_LIMIT` \| `0` \| Max posts to process per run (0 = all) \|
			\| `ANALYZER_FLAG_THRESHOLD` \| `0.5` \| Score above which a post is flagged \|

			`#### Web UI`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `WEB_PORT` \| `5001` \| Exposed web dashboard port \|

			`## Database Schema`

			`### Key Tables`

			- `accounts` — Tracked accounts (DID, handle, collection timestamps)
			- `posts` — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
			- `mentions` — Posts from anyone that mention a tracked account
			- `collection_runs` — Audit trail of each collection run (timing, counts, errors)
			- `collection_state` — Per-account bookmarks for incremental collection
			- `toxicity_scores` — Per-post scores across all 12 categories + overall + flagged
			- `mention_toxicity_scores` — Same structure for mentions
			- `analysis_runs` — Audit trail of analyzer runs (timing, counts, cost, errors)

			`### Useful Queries`

			```sql
			`-- Recent posts by a specific account`
			`SELECT created_at, post_type, text, like_count`
			`FROM posts`
			`WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')`
			`ORDER BY created_at DESC LIMIT 20;`

			`-- All mentions of a tracked account`
			`SELECT m.post_text, m.post_created_at, m.mentioning_did`
			`FROM mentions m`
			`JOIN accounts a ON a.did = m.mentioned_did`
			`WHERE a.handle = 'alice.bsky.social'`
			`ORDER BY m.post_created_at DESC;`

			`-- Most toxic posts (overall score)`
			`SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism`
			`FROM toxicity_scores t`
			`JOIN posts p ON p.uri = t.post_uri`
			`WHERE t.flagged = true`
			`ORDER BY t.overall DESC LIMIT 20;`

			`-- Toxicity by account`
			`SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts`
			`FROM toxicity_scores t`
			`JOIN posts p ON p.uri = t.post_uri`
			`JOIN accounts a ON a.did = p.author_did`
			`GROUP BY a.handle`
			`ORDER BY avg_toxicity DESC;`

			`-- Analysis run history`
			`SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd`
			`FROM analysis_runs ORDER BY started_at DESC LIMIT 10;`

			`-- Collection run history`
			`SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs`
			`FROM collection_runs ORDER BY started_at DESC LIMIT 10;`
			```

			`## Operations`

			`### Manual Runs`

			```bash
			`# Collect posts`
			`docker compose exec collector python -m src`

			`# Run toxicity analysis`
			`docker compose exec collector python -m src.analyzer`
			```

			`### Monitoring`

			```bash
			`# Follow logs`
			`docker compose logs -f collector`

			`# Quick data counts`
			`docker compose exec -T db psql -U bluesky -d bluesky -c \`
			`"SELECT (SELECT count() FROM posts) AS posts, (SELECT count() FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"`

			`# Last analysis run`
			`docker compose exec -T db psql -U bluesky -d bluesky -c \`
			`"SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"`
			```

			`### Rebuilding After Code Changes`

			```bash
			`docker compose build collector web`
			`docker compose up -d`
			```

			`### Add/Remove Accounts`

			Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

			`### First Run / Backfill`

			The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

			```bash
			`MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src`
			```

			`### Backup`

			The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:

			```bash
			`docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql`
			```