- Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|---|---|---|
| config | ||
| scripts | ||
| src | ||
| .dockerignore | ||
| .env.example | ||
| .gitignore | ||
| docker-compose.yml | ||
| Dockerfile | ||
| README.md | ||
| requirements-web.txt | ||
| requirements.txt | ||
| web.Dockerfile | ||
Bluesky Account Monitor
Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.
Architecture
┌─────────────────────────────────────────┐
│ Docker Compose │
│ │
accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │
│ │ ▲ │
│ ▼ │ │
│ analyzer ──────────┘ │
│ │ │
│ ▼ │
│ OpenAI API │
│ │
│ scheduler (Ofelia) ── cron triggers │
└─────────────────────────────────────────┘
Four services:
- db — PostgreSQL 16 (Alpine), stores all data
- collector — Python async service that fetches posts and mentions from Bluesky
- scheduler — Ofelia cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
- web — Flask + Gunicorn dashboard on port 5001
Quick Start
# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY
# 2. Add your target accounts to config/accounts.yml
# 3. Start everything
docker compose up -d
# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql
# 5. Trigger an immediate first collection
docker compose exec collector python -m src
# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
# 7. Open the dashboard
open http://localhost:5001
Collection
What Gets Collected
| Source | API Endpoint | Stored In |
|---|---|---|
| User's own posts & replies | getAuthorFeed (public) |
posts table |
| Posts mentioning a user | searchPosts (requires auth) |
mentions table |
All records include a raw_json JSONB column with the full API response for future-proof analysis.
How It Works
- Scheduled polling via Ofelia — runs every 4 hours by default
- Incremental collection — only fetches posts newer than the last run
- Rate limit aware — reads API response headers and sleeps when approaching limits
- Deduplication — posts are upserted by URI; engagement counts are refreshed on re-encounters
Toxicity Analysis
The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):
| Category | What it detects |
|---|---|
toxic |
Rude, disrespectful, or aggressive language |
threat |
Violence, harm, intimidation, calls to action |
hate_speech |
Targeting any protected characteristic |
racism |
Race/ethnicity-based hostility |
antisemitism |
Anti-Jewish hate, conspiracy theories, coded language |
islamophobia |
Anti-Muslim hate, "omvolking" narratives |
sexism |
Gender-based discrimination or harassment |
homophobia |
Anti-LGBTQ+ rhetoric |
insult |
Personal attacks, name-calling |
dehumanization |
Comparing people to animals, vermin, disease |
extremism |
Far-right/left rhetoric, Nazi glorification, Great Replacement |
ableism |
Disability-targeting language, mental health slurs |
The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.
Batch Processing
Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.
| 1 post/call | 10 posts/call (default) | |
|---|---|---|
| API calls for 60K posts | 60,000 | 6,000 |
| Estimated cost | ~$5.10 | ~$2.40 |
Running the Analyzer
# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer
# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log
The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.
Web Dashboard
Access at http://localhost:5001 (or your configured WEB_PORT).
Pages
- Dashboard — Overview of collection runs, account count, post/mention totals
- Accounts — List of tracked accounts with post counts and last activity
- Statuses — Browse all collected posts with filters and search
- Mentions — Browse mentions of tracked accounts
- Analysis — Toxicity overview: trend charts, category breakdown, recent analysis runs
- Flagged Content — Posts scoring above the flag threshold (default 0.5), filterable by category and type
- Account Toxicity — Per-account toxicity breakdown with comparative charts
- Export — Download data as CSV
Configuration
accounts.yml
accounts:
- handle: alice.bsky.social
- handle: bob.bsky.social
- handle: some-org.bsky.social
Environment Variables
Collection
| Variable | Default | Description |
|---|---|---|
POSTGRES_PASSWORD |
changeme |
Database password |
POSTGRES_PORT |
5432 |
Exposed PostgreSQL port |
LOG_LEVEL |
INFO |
Python log level |
MAX_PAGES_PER_ACCOUNT |
50 |
Max API pages per account per run (50 pages = 5000 posts) |
MENTION_LOOKBACK_HOURS |
12 |
How far back to search mentions on first run |
BSKY_HANDLE |
— | Your Bluesky handle (required for mention search) |
BSKY_APP_PASSWORD |
— | App password from Settings > App Passwords |
Toxicity Analysis
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | OpenAI API key (required) |
ANALYZER_MODEL |
gpt-4.1-nano |
OpenAI model for classification |
ANALYZER_CONCURRENCY |
3 |
Max concurrent API calls (batches in flight) |
ANALYZER_BATCH_SIZE |
10 |
Posts per API call |
ANALYZER_LIMIT |
0 |
Max posts to process per run (0 = all) |
ANALYZER_FLAG_THRESHOLD |
0.5 |
Score above which a post is flagged |
Web UI
| Variable | Default | Description |
|---|---|---|
WEB_PORT |
5001 |
Exposed web dashboard port |
Database Schema
Key Tables
accounts— Tracked accounts (DID, handle, collection timestamps)posts— Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)mentions— Posts from anyone that mention a tracked accountcollection_runs— Audit trail of each collection run (timing, counts, errors)collection_state— Per-account bookmarks for incremental collectiontoxicity_scores— Per-post scores across all 12 categories + overall + flaggedmention_toxicity_scores— Same structure for mentionsanalysis_runs— Audit trail of analyzer runs (timing, counts, cost, errors)
Useful Queries
-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;
-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;
-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;
-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;
-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;
-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
Operations
Manual Runs
# Collect posts
docker compose exec collector python -m src
# Run toxicity analysis
docker compose exec collector python -m src.analyzer
Monitoring
# Follow logs
docker compose logs -f collector
# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
"SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"
# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
"SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
Rebuilding After Code Changes
docker compose build collector web
docker compose up -d
Add/Remove Accounts
Edit config/accounts.yml — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.
First Run / Backfill
The first run pages back up to MAX_PAGES_PER_ACCOUNT pages (default 5000 posts). For a deeper backfill, temporarily increase this value:
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
Backup
The pgdata volume persists across container restarts. Back it up with standard PostgreSQL tools:
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql