No description

Find a file

Pieter b1fd78e0c1 Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>		2026-02-08 13:54:36 +01:00
config	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
scripts	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
src	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
.dockerignore	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
.env.example	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
.gitignore	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
docker-compose.yml	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
Dockerfile	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
README.md	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
requirements-web.txt	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
requirements.txt	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00
web.Dockerfile	Initial commit: Bluesky collector with toxicity analysis	2026-02-08 13:54:36 +01:00

README.md

Bluesky Account Monitor

Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.

Architecture

                    ┌─────────────────────────────────────────┐
                    │               Docker Compose             │
                    │                                          │
  accounts.yml ───▶│  collector ──▶ PostgreSQL ◀── web (Flask) │
                    │      │              ▲                     │
                    │      ▼              │                     │
                    │  analyzer ──────────┘                     │
                    │      │                                    │
                    │      ▼                                    │
                    │  OpenAI API                               │
                    │                                          │
                    │  scheduler (Ofelia) ── cron triggers     │
                    └─────────────────────────────────────────┘

Four services:

db — PostgreSQL 16 (Alpine), stores all data
collector — Python async service that fetches posts and mentions from Bluesky
scheduler — Ofelia cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
web — Flask + Gunicorn dashboard on port 5001

Quick Start

# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, OPENAI_API_KEY

# 2. Add your target accounts to config/accounts.yml

# 3. Start everything
docker compose up -d

# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql

# 5. Trigger an immediate first collection
docker compose exec collector python -m src

# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# 7. Open the dashboard
open http://localhost:5001

Collection

What Gets Collected

Source	API Endpoint	Stored In
User's own posts & replies	`getAuthorFeed` (public)	`posts` table
Posts mentioning a user	`searchPosts` (requires auth)	`mentions` table

All records include a raw_json JSONB column with the full API response for future-proof analysis.

How It Works

Scheduled polling via Ofelia — runs every 4 hours by default
Incremental collection — only fetches posts newer than the last run
Rate limit aware — reads API response headers and sleeps when approaching limits
Deduplication — posts are upserted by URI; engagement counts are refreshed on re-encounters

Toxicity Analysis

The analyzer classifies every post and mention using OpenAI's GPT-4.1-nano, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):

Category	What it detects
`toxic`	Rude, disrespectful, or aggressive language
`threat`	Violence, harm, intimidation, calls to action
`hate_speech`	Targeting any protected characteristic
`racism`	Race/ethnicity-based hostility
`antisemitism`	Anti-Jewish hate, conspiracy theories, coded language
`islamophobia`	Anti-Muslim hate, "omvolking" narratives
`sexism`	Gender-based discrimination or harassment
`homophobia`	Anti-LGBTQ+ rhetoric
`insult`	Personal attacks, name-calling
`dehumanization`	Comparing people to animals, vermin, disease
`extremism`	Far-right/left rhetoric, Nazi glorification, Great Replacement
`ableism`	Disability-targeting language, mental health slurs

The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.

Batch Processing

Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.

	1 post/call	10 posts/call (default)
API calls for 60K posts	60,000	6,000
Estimated cost	~$5.10	~$2.40

Running the Analyzer

# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer

# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log

The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.

Web Dashboard

Access at http://localhost:5001 (or your configured WEB_PORT).

Configuration

accounts.yml

accounts:
  - handle: alice.bsky.social
  - handle: bob.bsky.social
  - handle: some-org.bsky.social

Environment Variables

Collection

Variable	Default	Description
`POSTGRES_PASSWORD`	`changeme`	Database password
`POSTGRES_PORT`	`5432`	Exposed PostgreSQL port
`LOG_LEVEL`	`INFO`	Python log level
`MAX_PAGES_PER_ACCOUNT`	`50`	Max API pages per account per run (50 pages = 5000 posts)
`MENTION_LOOKBACK_HOURS`	`12`	How far back to search mentions on first run
`BSKY_HANDLE`	—	Your Bluesky handle (required for mention search)
`BSKY_APP_PASSWORD`	—	App password from Settings > App Passwords

Toxicity Analysis

Variable	Default	Description
`OPENAI_API_KEY`	—	OpenAI API key (required)
`ANALYZER_MODEL`	`gpt-4.1-nano`	OpenAI model for classification
`ANALYZER_CONCURRENCY`	`3`	Max concurrent API calls (batches in flight)
`ANALYZER_BATCH_SIZE`	`10`	Posts per API call
`ANALYZER_LIMIT`	`0`	Max posts to process per run (0 = all)
`ANALYZER_FLAG_THRESHOLD`	`0.5`	Score above which a post is flagged

Web UI

Variable	Default	Description
`WEB_PORT`	`5001`	Exposed web dashboard port

Database Schema

Key Tables

accounts — Tracked accounts (DID, handle, collection timestamps)
posts — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
mentions — Posts from anyone that mention a tracked account
collection_runs — Audit trail of each collection run (timing, counts, errors)
collection_state — Per-account bookmarks for incremental collection
toxicity_scores — Per-post scores across all 12 categories + overall + flagged
mention_toxicity_scores — Same structure for mentions
analysis_runs — Audit trail of analyzer runs (timing, counts, cost, errors)

Useful Queries

-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;

-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;

-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;

-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;

-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;

-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;

Operations

Manual Runs

# Collect posts
docker compose exec collector python -m src

# Run toxicity analysis
docker compose exec collector python -m src.analyzer

Monitoring

# Follow logs
docker compose logs -f collector

# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"

# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"

Rebuilding After Code Changes

docker compose build collector web
docker compose up -d

Add/Remove Accounts

Edit config/accounts.yml — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

First Run / Backfill

The first run pages back up to MAX_PAGES_PER_ACCOUNT pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src

Backup

The pgdata volume persists across container restarts. Back it up with standard PostgreSQL tools:

docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql

README.md

Bluesky Account Monitor

Architecture

Quick Start

Collection

What Gets Collected

How It Works

Toxicity Analysis

Batch Processing

Running the Analyzer

Web Dashboard

Pages

Configuration

accounts.yml

Environment Variables

Collection

Toxicity Analysis

Web UI

Database Schema

Key Tables

Useful Queries

Operations

Manual Runs

Monitoring

Rebuilding After Code Changes

Add/Remove Accounts

First Run / Backfill

Backup