bluesky-collector/README.md

# Bluesky Account Monitor

Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.

## Architecture

```
                    ┌─────────────────────────────────────────┐
                    │               Docker Compose             │
                    │                                          │
  accounts.yml ───▶│  collector ──▶ PostgreSQL ◀── web (Flask) │
                    │      │              ▲                     │
                    │      ▼              │                     │
                    │  analyzer ──────────┘                     │
                    │      │                                    │
                    │      ▼                                    │
                    │  LLM API                                  │
                    │                                          │
                    │  scheduler (Ofelia) ── cron triggers     │
                    └─────────────────────────────────────────┘
```

Four services:

- **db** — PostgreSQL 16 (Alpine), stores all data
- **collector** — Python async service that fetches posts and mentions from Bluesky
- **scheduler** — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)
- **web** — Flask + Gunicorn dashboard on port 5001

## Quick Start

```bash
# 1. Copy and edit your environment config
cp .env.example .env
# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, LLM_API_KEY

# 2. Add your target accounts to config/accounts.yml

# 3. Start everything
docker compose up -d

# 4. Run the toxicity schema migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql

# 5. Trigger an immediate first collection
docker compose exec collector python -m src

# 6. Run a test toxicity analysis (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# 7. Open the dashboard
open http://localhost:5001
```

## Collection

### What Gets Collected

| Source | API Endpoint | Stored In |
|--------|-------------|-----------|
| User's own posts & replies | `getAuthorFeed` (public) | `posts` table |
| Posts mentioning a user | `searchPosts` (requires auth) | `mentions` table |

All records include a `raw_json` JSONB column with the full API response for future-proof analysis.

### How It Works

- **Scheduled polling** via Ofelia — runs every 4 hours by default
- **Incremental collection** — only fetches posts newer than the last run
- **Rate limit aware** — reads API response headers and sleeps when approaching limits
- **Deduplication** — posts are upserted by URI; engagement counts are refreshed on re-encounters

## Toxicity Analysis

The analyzer classifies every post and mention using an LLM, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):

| Category | What it detects |
|----------|----------------|
| `toxic` | Rude, disrespectful, or aggressive language |
| `threat` | Violence, harm, intimidation, calls to action |
| `hate_speech` | Targeting any protected characteristic |
| `racism` | Race/ethnicity-based hostility |
| `antisemitism` | Anti-Jewish hate, conspiracy theories, coded language |
| `islamophobia` | Anti-Muslim hate, "omvolking" narratives |
| `sexism` | Gender-based discrimination or harassment |
| `homophobia` | Anti-LGBTQ+ rhetoric |
| `insult` | Personal attacks, name-calling |
| `dehumanization` | Comparing people to animals, vermin, disease |
| `extremism` | Far-right/left rhetoric, Nazi glorification, Great Replacement |
| `ableism` | Disability-targeting language, mental health slurs |

The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.

### Batch Processing

Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.

| | 1 post/call | 10 posts/call (default) |
|---|---|---|
| API calls for 60K posts | 60,000 | 6,000 |
| Estimated cost | ~$5.10 | ~$2.40 |

### Running the Analyzer

```bash
# Test run (100 posts)
docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer

# Full run (all unscored posts)
docker compose exec collector python -m src.analyzer

# Check logs
docker compose logs collector | grep analyzer
cat logs/analyzer.log
```

The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.

## Web Dashboard

Access at `http://localhost:5001` (or your configured `WEB_PORT`).

### Pages

- **Dashboard** — Overview of collection runs, account count, post/mention totals
- **Accounts** — List of tracked accounts with post counts and last activity
- **Statuses** — Browse all collected posts with filters and search
- **Mentions** — Browse mentions of tracked accounts
- **Analysis** — Toxicity overview: trend charts, category breakdown, recent analysis runs
- **Flagged Content** — Posts scoring above the flag threshold (default 0.5), filterable by category and type
- **Account Toxicity** — Per-account toxicity breakdown with comparative charts
- **Export** — Download data as CSV

## Configuration

### accounts.yml

```yaml
accounts:
  - handle: alice.bsky.social
  - handle: bob.bsky.social
  - handle: some-org.bsky.social
```

### Environment Variables

#### Collection

| Variable | Default | Description |
|----------|---------|-------------|
| `POSTGRES_PASSWORD` | `changeme` | Database password |
| `POSTGRES_PORT` | `5432` | Exposed PostgreSQL port |
| `LOG_LEVEL` | `INFO` | Python log level |
| `MAX_PAGES_PER_ACCOUNT` | `50` | Max API pages per account per run (50 pages = 5000 posts) |
| `MENTION_LOOKBACK_HOURS` | `12` | How far back to search mentions on first run |
| `BSKY_HANDLE` | — | Your Bluesky handle (required for mention search) |
| `BSKY_APP_PASSWORD` | — | App password from Settings > App Passwords |

#### Toxicity Analysis

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_API_KEY` | — | LLM API key (required) |
| `ANALYZER_MODEL` | `gpt-4.1-nano` | LLM model for classification |
| `ANALYZER_CONCURRENCY` | `3` | Max concurrent API calls (batches in flight) |
| `ANALYZER_BATCH_SIZE` | `10` | Posts per API call |
| `ANALYZER_LIMIT` | `0` | Max posts to process per run (0 = all) |
| `ANALYZER_FLAG_THRESHOLD` | `0.5` | Score above which a post is flagged |

#### Web UI

| Variable | Default | Description |
|----------|---------|-------------|
| `WEB_PORT` | `5001` | Exposed web dashboard port |

## Database Schema

### Key Tables

- **`accounts`** — Tracked accounts (DID, handle, collection timestamps)
- **`posts`** — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
- **`mentions`** — Posts from anyone that mention a tracked account
- **`collection_runs`** — Audit trail of each collection run (timing, counts, errors)
- **`collection_state`** — Per-account bookmarks for incremental collection
- **`toxicity_scores`** — Per-post scores across all 12 categories + overall + flagged
- **`mention_toxicity_scores`** — Same structure for mentions
- **`analysis_runs`** — Audit trail of analyzer runs (timing, counts, cost, errors)

### Useful Queries

```sql
-- Recent posts by a specific account
SELECT created_at, post_type, text, like_count
FROM posts
WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')
ORDER BY created_at DESC LIMIT 20;

-- All mentions of a tracked account
SELECT m.post_text, m.post_created_at, m.mentioning_did
FROM mentions m
JOIN accounts a ON a.did = m.mentioned_did
WHERE a.handle = 'alice.bsky.social'
ORDER BY m.post_created_at DESC;

-- Most toxic posts (overall score)
SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
WHERE t.flagged = true
ORDER BY t.overall DESC LIMIT 20;

-- Toxicity by account
SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts
FROM toxicity_scores t
JOIN posts p ON p.uri = t.post_uri
JOIN accounts a ON a.did = p.author_did
GROUP BY a.handle
ORDER BY avg_toxicity DESC;

-- Analysis run history
SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd
FROM analysis_runs ORDER BY started_at DESC LIMIT 10;

-- Collection run history
SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs
FROM collection_runs ORDER BY started_at DESC LIMIT 10;
```

## Operations

### Manual Runs

```bash
# Collect posts
docker compose exec collector python -m src

# Run toxicity analysis
docker compose exec collector python -m src.analyzer
```

### Monitoring

```bash
# Follow logs
docker compose logs -f collector

# Quick data counts
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT (SELECT count(*) FROM posts) AS posts, (SELECT count(*) FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"

# Last analysis run
docker compose exec -T db psql -U bluesky -d bluesky -c \
  "SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"
```

### Rebuilding After Code Changes

```bash
docker compose build collector web
docker compose up -d
```

### Add/Remove Accounts

Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

### First Run / Backfill

The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

```bash
MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src
```

### Backup

The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:

```bash
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
```

## License

MIT License

Copyright (c) 2026 Post X Society

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00			`# Bluesky Account Monitor`

			`Collects posts, replies, and mentions for a list of Bluesky accounts, runs AI-powered toxicity analysis across 12 categories, and presents results on a web dashboard. Everything runs in Docker.`

			`## Architecture`

			```
			`┌─────────────────────────────────────────┐`
			`│ Docker Compose │`
			`│ │`
			`accounts.yml ───▶│ collector ──▶ PostgreSQL ◀── web (Flask) │`
			`│ │ ▲ │`
			`│ ▼ │ │`
			`│ analyzer ──────────┘ │`
			`│ │ │`
			`│ ▼ │`
Make analyzer LLM provider agnostic Refactor toxicity analysis implementation to be independent of specific LLM providers. Update configuration and code to use generic terminology and allow flexibility in choosing language models. Changes: - Update environment variable naming for API credentials - Generalize documentation to support multiple LLM providers - Improve configuration flexibility for model selection - Add project documentation files to gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-04-20 08:21:11 +02:00			`│ LLM API │`
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00			`│ │`
			`│ scheduler (Ofelia) ── cron triggers │`
			`└─────────────────────────────────────────┘`
			```

			`Four services:`

			`- db — PostgreSQL 16 (Alpine), stores all data`
			`- collector — Python async service that fetches posts and mentions from Bluesky`
			`- scheduler — [Ofelia](https://github.com/mcuadros/ofelia) cron that triggers collection (every 4h) and analysis (every 4h + 30min offset)`
			`- web — Flask + Gunicorn dashboard on port 5001`

			`## Quick Start`

			```bash
			`# 1. Copy and edit your environment config`
			`cp .env.example .env`
Make analyzer LLM provider agnostic Refactor toxicity analysis implementation to be independent of specific LLM providers. Update configuration and code to use generic terminology and allow flexibility in choosing language models. Changes: - Update environment variable naming for API credentials - Generalize documentation to support multiple LLM providers - Improve configuration flexibility for model selection - Add project documentation files to gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-04-20 08:21:11 +02:00			`# Fill in: BSKY_HANDLE, BSKY_APP_PASSWORD, LLM_API_KEY`
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00
			`# 2. Add your target accounts to config/accounts.yml`

			`# 3. Start everything`
			`docker compose up -d`

			`# 4. Run the toxicity schema migration`
			`docker compose exec -T db psql -U bluesky -d bluesky < scripts/02-toxicity.sql`

			`# 5. Trigger an immediate first collection`
			`docker compose exec collector python -m src`

			`# 6. Run a test toxicity analysis (100 posts)`
			`docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer`

			`# 7. Open the dashboard`
			`open http://localhost:5001`
			```

			`## Collection`

			`### What Gets Collected`

			`\| Source \| API Endpoint \| Stored In \|`
			`\|--------\|-------------\|-----------\|`
			\| User's own posts & replies \| `getAuthorFeed` (public) \| `posts` table \|
			\| Posts mentioning a user \| `searchPosts` (requires auth) \| `mentions` table \|

			All records include a `raw_json` JSONB column with the full API response for future-proof analysis.

			`### How It Works`

			`- Scheduled polling via Ofelia — runs every 4 hours by default`
			`- Incremental collection — only fetches posts newer than the last run`
			`- Rate limit aware — reads API response headers and sleeps when approaching limits`
			`- Deduplication — posts are upserted by URI; engagement counts are refreshed on re-encounters`

			`## Toxicity Analysis`

Make analyzer LLM provider agnostic Refactor toxicity analysis implementation to be independent of specific LLM providers. Update configuration and code to use generic terminology and allow flexibility in choosing language models. Changes: - Update environment variable naming for API credentials - Generalize documentation to support multiple LLM providers - Improve configuration flexibility for model selection - Add project documentation files to gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-04-20 08:21:11 +02:00			`The analyzer classifies every post and mention using an LLM, scoring content on 12 categories from 0.0 (absent) to 1.0 (extreme):`
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00
			`\| Category \| What it detects \|`
			`\|----------\|----------------\|`
			\| `toxic` \| Rude, disrespectful, or aggressive language \|
			\| `threat` \| Violence, harm, intimidation, calls to action \|
			\| `hate_speech` \| Targeting any protected characteristic \|
			\| `racism` \| Race/ethnicity-based hostility \|
			\| `antisemitism` \| Anti-Jewish hate, conspiracy theories, coded language \|
			\| `islamophobia` \| Anti-Muslim hate, "omvolking" narratives \|
			\| `sexism` \| Gender-based discrimination or harassment \|
			\| `homophobia` \| Anti-LGBTQ+ rhetoric \|
			\| `insult` \| Personal attacks, name-calling \|
			\| `dehumanization` \| Comparing people to animals, vermin, disease \|
			\| `extremism` \| Far-right/left rhetoric, Nazi glorification, Great Replacement \|
			\| `ableism` \| Disability-targeting language, mental health slurs \|

			`The prompt is tuned for Dutch political discourse, recognizing coded terms like "gelukszoekers", "kutmarokkanen", "landverrader", "linkse ratten", etc. Political disagreement and criticism are not scored as toxic — only genuine hostility, hate, and threats.`

			`### Batch Processing`

			`Posts are sent to the API in batches (default 10 per call) to minimize cost and API calls. The ~500-token system prompt is sent once per batch instead of once per post, cutting input token cost by ~60%.`

			`\| \| 1 post/call \| 10 posts/call (default) \|`
			`\|---\|---\|---\|`
			`\| API calls for 60K posts \| 60,000 \| 6,000 \|`
			`\| Estimated cost \| ~$5.10 \| ~$2.40 \|`

			`### Running the Analyzer`

			```bash
			`# Test run (100 posts)`
			`docker compose exec -e ANALYZER_LIMIT=100 collector python -m src.analyzer`

			`# Full run (all unscored posts)`
			`docker compose exec collector python -m src.analyzer`

			`# Check logs`
			`docker compose logs collector \| grep analyzer`
			`cat logs/analyzer.log`
			```

			`The scheduled cron runs the analyzer automatically every 4 hours (30 minutes after each collection), so new posts are scored without manual intervention.`

			`## Web Dashboard`

			Access at `http://localhost:5001` (or your configured `WEB_PORT`).

			`### Pages`

			`- Dashboard — Overview of collection runs, account count, post/mention totals`
			`- Accounts — List of tracked accounts with post counts and last activity`
			`- Statuses — Browse all collected posts with filters and search`
			`- Mentions — Browse mentions of tracked accounts`
			`- Analysis — Toxicity overview: trend charts, category breakdown, recent analysis runs`
			`- Flagged Content — Posts scoring above the flag threshold (default 0.5), filterable by category and type`
			`- Account Toxicity — Per-account toxicity breakdown with comparative charts`
			`- Export — Download data as CSV`

			`## Configuration`

			`### accounts.yml`

			```yaml
			`accounts:`
			`- handle: alice.bsky.social`
			`- handle: bob.bsky.social`
			`- handle: some-org.bsky.social`
			```

			`### Environment Variables`

			`#### Collection`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `POSTGRES_PASSWORD` \| `changeme` \| Database password \|
			\| `POSTGRES_PORT` \| `5432` \| Exposed PostgreSQL port \|
			\| `LOG_LEVEL` \| `INFO` \| Python log level \|
			\| `MAX_PAGES_PER_ACCOUNT` \| `50` \| Max API pages per account per run (50 pages = 5000 posts) \|
			\| `MENTION_LOOKBACK_HOURS` \| `12` \| How far back to search mentions on first run \|
			\| `BSKY_HANDLE` \| — \| Your Bluesky handle (required for mention search) \|
			\| `BSKY_APP_PASSWORD` \| — \| App password from Settings > App Passwords \|

			`#### Toxicity Analysis`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
Make analyzer LLM provider agnostic Refactor toxicity analysis implementation to be independent of specific LLM providers. Update configuration and code to use generic terminology and allow flexibility in choosing language models. Changes: - Update environment variable naming for API credentials - Generalize documentation to support multiple LLM providers - Improve configuration flexibility for model selection - Add project documentation files to gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-04-20 08:21:11 +02:00			\| `LLM_API_KEY` \| — \| LLM API key (required) \|
			\| `ANALYZER_MODEL` \| `gpt-4.1-nano` \| LLM model for classification \|
Initial commit: Bluesky collector with toxicity analysis - Bluesky post collector with mention tracking - PostgreSQL database for storage - OpenAI-based toxicity analysis - Web UI for viewing and analyzing posts - Docker compose setup for deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-02-08 13:54:36 +01:00			\| `ANALYZER_CONCURRENCY` \| `3` \| Max concurrent API calls (batches in flight) \|
			\| `ANALYZER_BATCH_SIZE` \| `10` \| Posts per API call \|
			\| `ANALYZER_LIMIT` \| `0` \| Max posts to process per run (0 = all) \|
			\| `ANALYZER_FLAG_THRESHOLD` \| `0.5` \| Score above which a post is flagged \|

			`#### Web UI`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `WEB_PORT` \| `5001` \| Exposed web dashboard port \|

			`## Database Schema`

			`### Key Tables`

			- `accounts` — Tracked accounts (DID, handle, collection timestamps)
			- `posts` — Posts from tracked accounts (text, timestamps, engagement counts, post type, raw JSON)
			- `mentions` — Posts from anyone that mention a tracked account
			- `collection_runs` — Audit trail of each collection run (timing, counts, errors)
			- `collection_state` — Per-account bookmarks for incremental collection
			- `toxicity_scores` — Per-post scores across all 12 categories + overall + flagged
			- `mention_toxicity_scores` — Same structure for mentions
			- `analysis_runs` — Audit trail of analyzer runs (timing, counts, cost, errors)

			`### Useful Queries`

			```sql
			`-- Recent posts by a specific account`
			`SELECT created_at, post_type, text, like_count`
			`FROM posts`
			`WHERE author_did = (SELECT did FROM accounts WHERE handle = 'alice.bsky.social')`
			`ORDER BY created_at DESC LIMIT 20;`

			`-- All mentions of a tracked account`
			`SELECT m.post_text, m.post_created_at, m.mentioning_did`
			`FROM mentions m`
			`JOIN accounts a ON a.did = m.mentioned_did`
			`WHERE a.handle = 'alice.bsky.social'`
			`ORDER BY m.post_created_at DESC;`

			`-- Most toxic posts (overall score)`
			`SELECT p.text, t.overall, t.toxic, t.threat, t.hate_speech, t.racism`
			`FROM toxicity_scores t`
			`JOIN posts p ON p.uri = t.post_uri`
			`WHERE t.flagged = true`
			`ORDER BY t.overall DESC LIMIT 20;`

			`-- Toxicity by account`
			`SELECT a.handle, avg(t.overall) AS avg_toxicity, count(*) AS scored_posts`
			`FROM toxicity_scores t`
			`JOIN posts p ON p.uri = t.post_uri`
			`JOIN accounts a ON a.did = p.author_did`
			`GROUP BY a.handle`
			`ORDER BY avg_toxicity DESC;`

			`-- Analysis run history`
			`SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd`
			`FROM analysis_runs ORDER BY started_at DESC LIMIT 10;`

			`-- Collection run history`
			`SELECT id, started_at, status, posts_collected, mentions_collected, duration_secs`
			`FROM collection_runs ORDER BY started_at DESC LIMIT 10;`
			```

			`## Operations`

			`### Manual Runs`

			```bash
			`# Collect posts`
			`docker compose exec collector python -m src`

			`# Run toxicity analysis`
			`docker compose exec collector python -m src.analyzer`
			```

			`### Monitoring`

			```bash
			`# Follow logs`
			`docker compose logs -f collector`

			`# Quick data counts`
			`docker compose exec -T db psql -U bluesky -d bluesky -c \`
			`"SELECT (SELECT count() FROM posts) AS posts, (SELECT count() FROM mentions) AS mentions, (SELECT count(*) FROM toxicity_scores) AS scored;"`

			`# Last analysis run`
			`docker compose exec -T db psql -U bluesky -d bluesky -c \`
			`"SELECT id, started_at, status, posts_scored, mentions_scored, cost_usd FROM analysis_runs ORDER BY started_at DESC LIMIT 5;"`
			```

			`### Rebuilding After Code Changes`

			```bash
			`docker compose build collector web`
			`docker compose up -d`
			```

			`### Add/Remove Accounts`

			Edit `config/accounts.yml` — changes take effect on the next scheduled or manual run. Removed accounts are marked inactive but their data is preserved.

			`### First Run / Backfill`

			The first run pages back up to `MAX_PAGES_PER_ACCOUNT` pages (default 5000 posts). For a deeper backfill, temporarily increase this value:

			```bash
			`MAX_PAGES_PER_ACCOUNT=200 docker compose exec collector python -m src`
			```

			`### Backup`

			The `pgdata` volume persists across container restarts. Back it up with standard PostgreSQL tools:

			```bash
			`docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql`
			```
Add documentation and license, remove IDE files Added comprehensive project documentation and MIT license. Removed Claude IDE configuration files from repository tracking. Documentation added: - FINDINGS.md: Complete methodology report and research findings - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30) - Human review results: 40.4% correct, 59.6% false positives - AI toxicity detection limitations and recommendations - OPERATIONS.md: Complete operations and maintenance guide - Service start/stop procedures - Database operations and queries - Configuration options - Troubleshooting guide - Data export instructions License: - Added MIT License to README.md - Copyright 2026 Post X Society - Open source with permissive license Repository cleanup: - Added .claude/ to .gitignore - Removed .claude/settings.local.json from tracking - Prevents IDE-specific files from being committed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2026-03-30 14:39:11 +02:00
			`## License`

			`MIT License`

			`Copyright (c) 2026 Post X Society`

			`Permission is hereby granted, free of charge, to any person obtaining a copy`
			`of this software and associated documentation files (the "Software"), to deal`
			`in the Software without restriction, including without limitation the rights`
			`to use, copy, modify, merge, publish, distribute, sublicense, and/or sell`
			`copies of the Software, and to permit persons to whom the Software is`
			`furnished to do so, subject to the following conditions:`

			`The above copyright notice and this permission notice shall be included in all`
			`copies or substantial portions of the Software.`

			`THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR`
			`IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,`
			`FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE`
			`AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER`
			`LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,`
			`OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE`
			`SOFTWARE.`