mastodon-collector/README.md

# Mastodon Collector

Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.

## Quick Start

```bash
# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt

# 2. Start everything
docker compose up -d

# 3. Open the dashboard
open http://localhost:8585
```

## Architecture

| Service       | Description                                    | Port  |
|---------------|------------------------------------------------|-------|
| **db**        | PostgreSQL 16                                  | 5432  |
| **web**       | Flask dashboard (Gunicorn)                     | 8585  |
| **collector** | Background service, polls every 4 hours        | —     |

## Adding Accounts

Two methods:

1. **Text file** — edit `accounts.txt`, one handle per line (`@user@instance.social`). Picked up on next collection cycle.
2. **Web UI** — go to http://localhost:8585/accounts and use the form.

## Configuration

Edit `.env` to customize:

```
POSTGRES_PASSWORD=collector_secret      # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400             # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-...                   # Required for toxicity analysis
```

## Toxicity Analysis

The system includes automated toxicity detection and manual review capabilities:

### Features

- **Automated Classification**: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions:
  - General toxicity, threats, hate speech
  - Racism, antisemitism, islamophobia
  - Sexism, homophobia, ableism
  - Insults, dehumanization, extremism
- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation
- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis`

### Running Analysis

```bash
# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"

# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
```

### Analysis Database Schema

Additional tables for toxicity analysis:

- `toxicity_scores` — toxicity scores per status (12 categories + overall)
- `analysis_runs` — audit trail of analysis runs with costs and duration

### Cost Estimation

- Batch processing: ~10 posts per API call
- Estimated cost: ~$0.12 per 1,000 posts analyzed
- Example: 16,906 posts ≈ $1.95

## API Endpoints

For plugging into your analysis pipeline:

| Endpoint              | Description                          |
|-----------------------|--------------------------------------|
| `GET /api/stats`      | Overview stats (counts by type)      |
| `GET /api/statuses`   | Paginated statuses as JSON           |
| `GET /export`         | Download all statuses as CSV         |

### `/api/statuses` parameters

- `page` — page number (default: 1)
- `per_page` — results per page (default: 100, max: 500)
- `account_id` — filter by internal account ID
- `type` — filter by status type: `post`, `reply`, `mention`, `reblog`
- `since` — ISO datetime, only return statuses after this time

## Database Schema

Main tables:

- `monitored_accounts` — accounts being tracked
- `statuses` — collected posts with plain text + HTML content
- `mentions` — who was @-mentioned in each status
- `media_attachments` — images/videos attached to statuses
- `tags` — hashtags used
- `collection_logs` — audit trail of each collection run

Each status stores `raw_json` with the full Mastodon API response for future analysis needs.

## Moving to a Server

```bash
# Copy the project
scp -r mastodon-collector/ user@server:~/

# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d
```

## Stopping

```bash
docker compose down          # Stop services, keep data
docker compose down -v       # Stop services AND delete database
```

## Research & Reporting

See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including:
- Data collection statistics
- Toxicity analysis methodology
- Manual review results and findings
- False positive analysis
- Limitations and considerations

## License

MIT License - see [LICENSE](LICENSE) file for details.