# Mastodon Collector Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using LLMs, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline. ## Quick Start ```bash # 1. Add accounts to monitor echo "@user@mastodon.social" >> accounts.txt # 2. Start everything docker compose up -d # 3. Open the dashboard open http://localhost:8585 ``` ## Architecture | Service | Description | Port | |---------------|------------------------------------------------|-------| | **db** | PostgreSQL 16 | 5432 | | **web** | Flask dashboard (Gunicorn) | 8585 | | **collector** | Background service, polls every 4 hours | — | ## Adding Accounts Two methods: 1. **Text file** — edit `accounts.txt`, one handle per line (`@user@instance.social`). Picked up on next collection cycle. 2. **Web UI** — go to http://localhost:8585/accounts and use the form. ## Configuration Edit `.env` to customize: ``` POSTGRES_PASSWORD=collector_secret # Change for production FLASK_SECRET_KEY=change-me-in-production POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s) OPENAI_API_KEY=sk-... # Required for toxicity analysis ``` ## Toxicity Analysis The system includes automated toxicity detection and manual review capabilities: ### Features - **Automated Classification**: Uses an LLM to analyze posts across 12 toxicity dimensions: - General toxicity, threats, hate speech - Racism, antisemitism, islamophobia - Sexism, homophobia, ableism - Insults, dehumanization, extremism - **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review - **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation - **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis` ### Running Analysis ```bash # Analyze all unscored statuses (run inside collector container) docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer" # Limit to first 100 statuses for testing docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer" ``` ### Analysis Database Schema Additional tables for toxicity analysis: - `toxicity_scores` — toxicity scores per status (12 categories + overall) - `analysis_runs` — audit trail of analysis runs with costs and duration ### Cost Estimation - Batch processing: ~10 posts per API call - Estimated cost: ~$0.12 per 1,000 posts analyzed - Example: 16,906 posts ≈ $1.95 ## API Endpoints For plugging into your analysis pipeline: | Endpoint | Description | |-----------------------|--------------------------------------| | `GET /api/stats` | Overview stats (counts by type) | | `GET /api/statuses` | Paginated statuses as JSON | | `GET /export` | Download all statuses as CSV | ### `/api/statuses` parameters - `page` — page number (default: 1) - `per_page` — results per page (default: 100, max: 500) - `account_id` — filter by internal account ID - `type` — filter by status type: `post`, `reply`, `mention`, `reblog` - `since` — ISO datetime, only return statuses after this time ## Database Schema Main tables: - `monitored_accounts` — accounts being tracked - `statuses` — collected posts with plain text + HTML content - `mentions` — who was @-mentioned in each status - `media_attachments` — images/videos attached to statuses - `tags` — hashtags used - `collection_logs` — audit trail of each collection run Each status stores `raw_json` with the full Mastodon API response for future analysis needs. ## Moving to a Server ```bash # Copy the project scp -r mastodon-collector/ user@server:~/ # On the server cd mastodon-collector # Edit .env with production secrets docker compose up -d ``` ## Stopping ```bash docker compose down # Stop services, keep data docker compose down -v # Stop services AND delete database ``` ## Research & Reporting See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including: - Data collection statistics - Toxicity analysis methodology - Manual review results and findings - False positive analysis - Limitations and considerations ## License MIT License - see [LICENSE](LICENSE) file for details.