# Bluesky Account Monitor

Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.

## Architecture

```
                    ┌─────────────────────────────────────────┐
                    │               Docker Compose             │
                    │                                          │
  accounts.yml ───▶│  collector ──▶ PostgreSQL ◀── web (Flask) │
                    │      │              ▲                     │
                    │      ▼              │                     │
                    │  analyzer ──────────┘                     │
                    │      │                                    │
                    │      ▼                                    │
                    │  OpenAI API                               │
                    │                                          │
                    │  scheduler (Ofelia) ── cron triggers     │
                    └─────────────────────────────────────────┘
```

Four services:

- **db** — PostgreSQL 16 (Alpine), stores all data
- **collector** — Python async service that fetches posts and mentions from Bluesky
- **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
- **web** — Flask + Gunicorn dashboard on port 5001

## Quick Start

```bash
# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY

# 2. Add your target accounts to config/accounts.yml

# 3. Start everything
docker compose up -d

# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql

# 5. Trigger an immediate first collection
docker compose exec collector python -m src

# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# 7. Open the dashboard
open http://localhost:5001
```

## Collection

### What Gets Collected

| Source | API Endpoint | Stored In |
|--------|-------------|-----------|
| User's own posts & replies | `getAuthorFeed` (public) | `posts` table |
| Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table |

All records include a `raw_json` JSONB column with the full API response for future-proof analysis.

### How It Works

- **Scheduled polling** via Ofelia — runs every 4 hours by default
- **Incremental collection** — only fetches posts newer than the last run
- **Rate limit aware** — reads API response headers and sleeps when approaching limits
- **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters

## Toxicity Analysis

The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):

| Category | What it detects |
|----------|----------------|
| `toxic` | Rude, disrespectful, or aggressive language |
| `threat` | Violence, harm, intimidation, calls to action |
| `hate_speech` | Targeting any protected characteristic |
| `racism` | Race/ethnicity-based hostility |
| `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language |
| `islamophobia` | Anti-Muslim hate, "omvolking" narratives |
| `sexism` | Gender-based discrimination or harassment |
| `homophobia` | Anti-LGBTQ+ rhetoric |
| `insult` | Personal attacks, name-calling |
| `dehumanization` | Comparing people to animals, vermin, disease |
| `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement |
| `ableism` | Disability-targeting language, mental health slurs |

The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.

### Batch Processing

Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.

| | 1 post/call | 10 posts/call (default) |
|---|---|---|
| API calls for 60K posts | 60,000 | 6,000 |
| Estimated cost | ~$5.10 | ~$2.40 |

### Running the Analyzer

```bash
# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer

# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log
```

The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.

## Web Dashboard

Access at `http://localhost:5001` (or your configured `WEB_PORT`).

### Pages

- **Dashboard** — Overview of collection runs, account count, post/mention totals
- **Accounts** — List of tracked accounts with post counts and last activity
- **Statuses** — Browse all collected posts with filters and search
- **Mentions** — Browse mentions of tracked accounts
- **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs
- **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type
- **Account Toxicity** — Per-account toxicity breakdown with comparative charts
- **Export** — Download data as CSV

## Configuration

### accounts.yml

```yaml
accounts:
  - handle: alice.bsky.social
  - handle: bob.bsky.social
  - handle: some-org.bsky.social
```

### Environment Variables

#### Collection

| Variable | Default | Description |
|----------|---------|-------------|
| `POSTGRES_PASSWORD` | `changeme` | Database password |
| `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port |
| `LOG_LEVEL` | `INFO` | Python log level |
| `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) |
| `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run |
| `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) |
| `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords |

#### Toxicity Analysis

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENAI_API_KEY` | — | OpenAI API key (required) |
| `ANALYZER_MODEL` | `gpt-4.1-nano` | OpenAI model for classification |
| `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) |
| `ANALYZER_BATCH_SIZE` | `10` | Posts per API call |
| `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) |
| `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged |

#### Web UI

| Variable | Default | Description |
|----------|---------|-------------|
| `WEB_PORT` | `5001` | Exposed web dashboard port |

## Database Schema

### Key Tables

- **`accounts`** — Tracked accounts (DID, handle, collection timestamps)
- **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
- **`mentions`** — Posts from anyone that mention a tracked account
- **`collection_runs`** — Audit trail of each collection run (timing, counts, errors)
- **`collection_state`** — Per-account bookmarks for incremental collection
- **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged
- **`mention_toxicity_scores`** — Same structure for mentions
- **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors)

### Useful Queries

```sql
-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;

-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;

-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;

-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;

-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;

-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
```

## Operations

### Manual Runs

```bash
# Collect posts
docker compose exec collector python -m src

# Run toxicity analysis
docker compose exec collector python -m src.analyzer
```

### Monitoring

```bash
# Follow logs
docker compose logs -f collector

# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"

# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
```

### Rebuilding After Code Changes

```bash
docker compose build collector web
docker compose up -d
```

### Add/Remove Accounts

Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

### First Run / Backfill

The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

```bash
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
```

### Backup

The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:

```bash
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
```