# Bluesky Account Monitor Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker. ## Architecture ``` ┌─────────────────────────────────────────┐ │ Docker Compose │ │ │ accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │ │ │ ▲ │ │ ▼ │ │ │ analyzer ──────────┘ │ │ │ │ │ ▼ │ │ OpenAI API │ │ │ │ scheduler (Ofelia) ── cron triggers │ └─────────────────────────────────────────┘ ``` Four services: - **db** — PostgreSQL 16 (Alpine), stores all data - **collector** — Python async service that fetches posts and mentions from Bluesky - **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset) - **web** — Flask + Gunicorn dashboard on port 5001 ## Quick Start ```bash # 1. Copy and edit your environment config cp .env.example .env # Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY # 2. Add your target accounts to config/accounts.yml # 3. Start everything docker compose up -d # 4. Run the toxicity schema migration docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql # 5. Trigger an immediate first collection docker compose exec collector python -m src # 6. Run a test toxicity analysis (100 posts) docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer # 7. Open the dashboard open http://localhost:5001 ``` ## Collection ### What Gets Collected | Source | API Endpoint | Stored In | |--------|-------------|-----------| | User's own posts & replies | `getAuthorFeed` (public) | `posts` table | | Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table | All records include a `raw_json` JSONB column with the full API response for future-proof analysis. ### How It Works - **Scheduled polling** via Ofelia — runs every 4 hours by default - **Incremental collection** — only fetches posts newer than the last run - **Rate limit aware** — reads API response headers and sleeps when approaching limits - **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters ## Toxicity Analysis The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme): | Category | What it detects | |----------|----------------| | `toxic` | Rude, disrespectful, or aggressive language | | `threat` | Violence, harm, intimidation, calls to action | | `hate_speech` | Targeting any protected characteristic | | `racism` | Race/ethnicity-based hostility | | `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language | | `islamophobia` | Anti-Muslim hate, "omvolking" narratives | | `sexism` | Gender-based discrimination or harassment | | `homophobia` | Anti-LGBTQ+ rhetoric | | `insult` | Personal attacks, name-calling | | `dehumanization` | Comparing people to animals, vermin, disease | | `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement | | `ableism` | Disability-targeting language, mental health slurs | The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats. ### Batch Processing Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%. | | 1 post/call | 10 posts/call (default) | |---|---|---| | API calls for 60K posts | 60,000 | 6,000 | | Estimated cost | ~$5.10 | ~$2.40 | ### Running the Analyzer ```bash # Test run (100 posts) docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer # Full run (all unscored posts) docker compose exec collector python -m src.analyzer # Check logs docker compose logs collector | grep analyzer cat logs/analyzer.log ``` The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention. ## Web Dashboard Access at `http://localhost:5001` (or your configured `WEB_PORT`). ### Pages - **Dashboard** — Overview of collection runs, account count, post/mention totals - **Accounts** — List of tracked accounts with post counts and last activity - **Statuses** — Browse all collected posts with filters and search - **Mentions** — Browse mentions of tracked accounts - **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs - **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type - **Account Toxicity** — Per-account toxicity breakdown with comparative charts - **Export** — Download data as CSV ## Configuration ### accounts.yml ```yaml accounts: - handle: alice.bsky.social - handle: bob.bsky.social - handle: some-org.bsky.social ``` ### Environment Variables #### Collection | Variable | Default | Description | |----------|---------|-------------| | `POSTGRES_PASSWORD` | `changeme` | Database password | | `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port | | `LOG_LEVEL` | `INFO` | Python log level | | `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) | | `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run | | `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) | | `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords | #### Toxicity Analysis | Variable | Default | Description | |----------|---------|-------------| | `OPENAI_API_KEY` | — | OpenAI API key (required) | | `ANALYZER_MODEL` | `gpt-4.1-nano` | OpenAI model for classification | | `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) | | `ANALYZER_BATCH_SIZE` | `10` | Posts per API call | | `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) | | `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged | #### Web UI | Variable | Default | Description | |----------|---------|-------------| | `WEB_PORT` | `5001` | Exposed web dashboard port | ## Database Schema ### Key Tables - **`accounts`** — Tracked accounts (DID, handle, collection timestamps) - **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON) - **`mentions`** — Posts from anyone that mention a tracked account - **`collection_runs`** — Audit trail of each collection run (timing, counts, errors) - **`collection_state`** — Per-account bookmarks for incremental collection - **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged - **`mention_toxicity_scores`** — Same structure for mentions - **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors) ### Useful Queries ```sql -- Recent posts by a specific account SELECT created_at, post_type, text, like_count FROM posts WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social') ORDER BY created_at DESC LIMIT 20; -- All mentions of a tracked account SELECT m.post_text, m.post_created_at, m.mentioning_did FROM mentions m JOIN accounts a ON a.did = m.mentioned_did WHERE a.handle = 'alice.bsky.social' ORDER BY m.post_created_at DESC; -- Most toxic posts (overall score) SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism FROM toxicity_scores t JOIN posts p ON p.uri = t.post_uri WHERE t.flagged = true ORDER BY t.overall DESC LIMIT 20; -- Toxicity by account SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts FROM toxicity_scores t JOIN posts p ON p.uri = t.post_uri JOIN accounts a ON a.did = p.author_did GROUP BY a.handle ORDER BY avg_toxicity DESC; -- Analysis run history SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 10; -- Collection run history SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs FROM collection_runs ORDER BY started_at DESC LIMIT 10; ``` ## Operations ### Manual Runs ```bash # Collect posts docker compose exec collector python -m src # Run toxicity analysis docker compose exec collector python -m src.analyzer ``` ### Monitoring ```bash # Follow logs docker compose logs -f collector # Quick data counts docker compose exec -T db psql -U bluesky -d bluesky -c \ "SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;" # Last analysis run docker compose exec -T db psql -U bluesky -d bluesky -c \ "SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;" ``` ### Rebuilding After Code Changes ```bash docker compose build collector web docker compose up -d ``` ### Add/Remove Accounts Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved. ### First Run / Backfill The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value: ```bash MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src ``` ### Backup The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools: ```bash docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql ```