Refactor toxicity analysis implementation to be independent of specific LLM providers. Update configuration and code to use generic terminology and allow flexibility in choosing language models. Changes: - Update environment variable naming for API credentials - Generalize documentation to support multiple LLM providers - Improve configuration flexibility for model selection - Add project documentation files to gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
305 lines
11 KiB
Markdown
305 lines
11 KiB
Markdown
# Bluesky Account Monitor
|
|
|
|
Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Docker Compose │
|
|
│ │
|
|
accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │
|
|
│ │ ▲ │
|
|
│ ▼ │ │
|
|
│ analyzer ──────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ LLM API │
|
|
│ │
|
|
│ scheduler (Ofelia) ── cron triggers │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
Four services:
|
|
|
|
- **db** — PostgreSQL 16 (Alpine), stores all data
|
|
- **collector** — Python async service that fetches posts and mentions from Bluesky
|
|
- **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
|
|
- **web** — Flask + Gunicorn dashboard on port 5001
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# 1. Copy and edit your environment config
|
|
cp .env.example .env
|
|
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, LLM_API_KEY
|
|
|
|
# 2. Add your target accounts to config/accounts.yml
|
|
|
|
# 3. Start everything
|
|
docker compose up -d
|
|
|
|
# 4. Run the toxicity schema migration
|
|
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql
|
|
|
|
# 5. Trigger an immediate first collection
|
|
docker compose exec collector python -m src
|
|
|
|
# 6. Run a test toxicity analysis (100 posts)
|
|
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
|
|
|
|
# 7. Open the dashboard
|
|
open http://localhost:5001
|
|
```
|
|
|
|
## Collection
|
|
|
|
### What Gets Collected
|
|
|
|
| Source | API Endpoint | Stored In |
|
|
|--------|-------------|-----------|
|
|
| User's own posts & replies | `getAuthorFeed` (public) | `posts` table |
|
|
| Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table |
|
|
|
|
All records include a `raw_json` JSONB column with the full API response for future-proof analysis.
|
|
|
|
### How It Works
|
|
|
|
- **Scheduled polling** via Ofelia — runs every 4 hours by default
|
|
- **Incremental collection** — only fetches posts newer than the last run
|
|
- **Rate limit aware** — reads API response headers and sleeps when approaching limits
|
|
- **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters
|
|
|
|
## Toxicity Analysis
|
|
|
|
The analyzer classifies every post and mention using an LLM, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):
|
|
|
|
| Category | What it detects |
|
|
|----------|----------------|
|
|
| `toxic` | Rude, disrespectful, or aggressive language |
|
|
| `threat` | Violence, harm, intimidation, calls to action |
|
|
| `hate_speech` | Targeting any protected characteristic |
|
|
| `racism` | Race/ethnicity-based hostility |
|
|
| `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language |
|
|
| `islamophobia` | Anti-Muslim hate, "omvolking" narratives |
|
|
| `sexism` | Gender-based discrimination or harassment |
|
|
| `homophobia` | Anti-LGBTQ+ rhetoric |
|
|
| `insult` | Personal attacks, name-calling |
|
|
| `dehumanization` | Comparing people to animals, vermin, disease |
|
|
| `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement |
|
|
| `ableism` | Disability-targeting language, mental health slurs |
|
|
|
|
The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.
|
|
|
|
### Batch Processing
|
|
|
|
Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.
|
|
|
|
| | 1 post/call | 10 posts/call (default) |
|
|
|---|---|---|
|
|
| API calls for 60K posts | 60,000 | 6,000 |
|
|
| Estimated cost | ~$5.10 | ~$2.40 |
|
|
|
|
### Running the Analyzer
|
|
|
|
```bash
|
|
# Test run (100 posts)
|
|
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer
|
|
|
|
# Full run (all unscored posts)
|
|
docker compose exec collector python -m src.analyzer
|
|
|
|
# Check logs
|
|
docker compose logs collector | grep analyzer
|
|
cat logs/analyzer.log
|
|
```
|
|
|
|
The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.
|
|
|
|
## Web Dashboard
|
|
|
|
Access at `http://localhost:5001` (or your configured `WEB_PORT`).
|
|
|
|
### Pages
|
|
|
|
- **Dashboard** — Overview of collection runs, account count, post/mention totals
|
|
- **Accounts** — List of tracked accounts with post counts and last activity
|
|
- **Statuses** — Browse all collected posts with filters and search
|
|
- **Mentions** — Browse mentions of tracked accounts
|
|
- **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs
|
|
- **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type
|
|
- **Account Toxicity** — Per-account toxicity breakdown with comparative charts
|
|
- **Export** — Download data as CSV
|
|
|
|
## Configuration
|
|
|
|
### accounts.yml
|
|
|
|
```yaml
|
|
accounts:
|
|
- handle: alice.bsky.social
|
|
- handle: bob.bsky.social
|
|
- handle: some-org.bsky.social
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
#### Collection
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `POSTGRES_PASSWORD` | `changeme` | Database password |
|
|
| `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port |
|
|
| `LOG_LEVEL` | `INFO` | Python log level |
|
|
| `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) |
|
|
| `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run |
|
|
| `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) |
|
|
| `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords |
|
|
|
|
#### Toxicity Analysis
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `LLM_API_KEY` | — | LLM API key (required) |
|
|
| `ANALYZER_MODEL` | `gpt-4.1-nano` | LLM model for classification |
|
|
| `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) |
|
|
| `ANALYZER_BATCH_SIZE` | `10` | Posts per API call |
|
|
| `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) |
|
|
| `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged |
|
|
|
|
#### Web UI
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `WEB_PORT` | `5001` | Exposed web dashboard port |
|
|
|
|
## Database Schema
|
|
|
|
### Key Tables
|
|
|
|
- **`accounts`** — Tracked accounts (DID, handle, collection timestamps)
|
|
- **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
|
|
- **`mentions`** — Posts from anyone that mention a tracked account
|
|
- **`collection_runs`** — Audit trail of each collection run (timing, counts, errors)
|
|
- **`collection_state`** — Per-account bookmarks for incremental collection
|
|
- **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged
|
|
- **`mention_toxicity_scores`** — Same structure for mentions
|
|
- **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors)
|
|
|
|
### Useful Queries
|
|
|
|
```sql
|
|
-- Recent posts by a specific account
|
|
SELECT created_at, post_type, text, like_count
|
|
FROM posts
|
|
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
|
|
ORDER BY created_at DESC LIMIT 20;
|
|
|
|
-- All mentions of a tracked account
|
|
SELECT m.post_text, m.post_created_at, m.mentioning_did
|
|
FROM mentions m
|
|
JOIN accounts a ON a.did = m.mentioned_did
|
|
WHERE a.handle = 'alice.bsky.social'
|
|
ORDER BY m.post_created_at DESC;
|
|
|
|
-- Most toxic posts (overall score)
|
|
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
|
|
FROM toxicity_scores t
|
|
JOIN posts p ON p.uri = t.post_uri
|
|
WHERE t.flagged = true
|
|
ORDER BY t.overall DESC LIMIT 20;
|
|
|
|
-- Toxicity by account
|
|
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
|
|
FROM toxicity_scores t
|
|
JOIN posts p ON p.uri = t.post_uri
|
|
JOIN accounts a ON a.did = p.author_did
|
|
GROUP BY a.handle
|
|
ORDER BY avg_toxicity DESC;
|
|
|
|
-- Analysis run history
|
|
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
|
|
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;
|
|
|
|
-- Collection run history
|
|
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
|
|
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
|
|
```
|
|
|
|
## Operations
|
|
|
|
### Manual Runs
|
|
|
|
```bash
|
|
# Collect posts
|
|
docker compose exec collector python -m src
|
|
|
|
# Run toxicity analysis
|
|
docker compose exec collector python -m src.analyzer
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
```bash
|
|
# Follow logs
|
|
docker compose logs -f collector
|
|
|
|
# Quick data counts
|
|
docker compose exec -T db psql -U bluesky -d bluesky -c \
|
|
"SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"
|
|
|
|
# Last analysis run
|
|
docker compose exec -T db psql -U bluesky -d bluesky -c \
|
|
"SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
|
|
```
|
|
|
|
### Rebuilding After Code Changes
|
|
|
|
```bash
|
|
docker compose build collector web
|
|
docker compose up -d
|
|
```
|
|
|
|
### Add/Remove Accounts
|
|
|
|
Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.
|
|
|
|
### First Run / Backfill
|
|
|
|
The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:
|
|
|
|
```bash
|
|
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
|
|
```
|
|
|
|
### Backup
|
|
|
|
The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:
|
|
|
|
```bash
|
|
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
|
|
```
|
|
|
|
## License
|
|
|
|
MIT License
|
|
|
|
Copyright (c) 2026 Post X Society
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
of this software and associated documentation files (the "Software"), to deal
|
|
in the Software without restriction, including without limitation the rights
|
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
copies of the Software, and to permit persons to whom the Software is
|
|
furnished to do so, subject to the following conditions:
|
|
|
|
The above copyright notice and this permission notice shall be included in all
|
|
copies or substantial portions of the Software.
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
SOFTWARE.
|